[PATCH] YAML: Implement block scalar parsing
Duncan P. N. Exon Smith
dexonsmith at apple.com
Wed May 13 15:29:17 PDT 2015
New patch LGTM. Thanks!
> On 2015 May 13, at 15:07, Alex L <arphaman at gmail.com> wrote:
>
>
> Thanks.
> Some comments below:
>
> 2015-05-13 11:37 GMT-07:00 Duncan P. N. Exon Smith <dexonsmith at apple.com>:
>
> > On 2015 May 13, at 09:53, Alex Lorenz <arphaman at gmail.com> wrote:
> >
> > This update uses a new helper function that consumes a line break in another part of code that I forgot to update last time.
> >
> >
> > REPOSITORY
> > rL LLVM
> >
> > http://reviews.llvm.org/D9503
> >
> > Files:
> > include/llvm/Support/YAMLParser.h
> > lib/Support/YAMLParser.cpp
> > test/YAMLParser/spec-09-14.test
> > test/YAMLParser/spec-09-18.test
> > test/YAMLParser/spec-09-19.test
> > test/YAMLParser/spec-09-20.test
> > test/YAMLParser/spec-09-21.test
> > test/YAMLParser/spec-09-22.test
> > test/YAMLParser/spec-09-24.test
> > test/YAMLParser/spec-09-25.test
> > test/YAMLParser/spec-09-26.test
> > test/YAMLParser/spec-09-27.test
> > test/YAMLParser/spec-09-28.test
> > unittests/Support/YAMLParserTest.cpp
> > utils/yaml-bench/YAMLBench.cpp
> >
> > EMAIL PREFERENCES
> > http://reviews.llvm.org/settings/panel/emailpreferences/
> > <D9503.25702.patch>
>
> It's a lot easier to see what's going on now; thanks.
>
> I wonder whether "returning IsDone" is the right thing for your
> helpers, or whether "returning IsError" would be cleaner. I also
> think in one spot you fail to return correctly on error (I pointed
> it out below somewhere).
>
> Many of the rest of the comments amount to: "skip `else` after a
> `return`".
>
> > Index: lib/Support/YAMLParser.cpp
> > ===================================================================
> > --- lib/Support/YAMLParser.cpp
> > +++ lib/Support/YAMLParser.cpp
> > @@ -101,6 +101,7 @@
> > void Node::anchor() {}
> > void NullNode::anchor() {}
> > void ScalarNode::anchor() {}
> > +void BlockScalarNode::anchor() {}
> > void KeyValueNode::anchor() {}
> > void MappingNode::anchor() {}
> > void SequenceNode::anchor() {}
> > @@ -128,6 +129,7 @@
> > TK_Key,
> > TK_Value,
> > TK_Scalar,
> > + TK_BlockScalar,
> > TK_Alias,
> > TK_Anchor,
> > TK_Tag
> > @@ -137,6 +139,9 @@
> > /// of the token in the input.
> > StringRef Range;
> >
> > + /// The value of a block scalar node.
> > + std::string Value;
> > +
> > Token() : Kind(TK_Error) {}
> > };
> > }
> > @@ -348,6 +353,14 @@
> > /// b-break.
> > StringRef::iterator skip_b_break(StringRef::iterator Position);
> >
> > + /// @brief Skip a single s-space[31] starting at Position.
> > + ///
> > + /// An s-space is 0x20
> > + ///
> > + /// @returns The code unit after the s-space, or Position if it's not a
> > + /// s-space.
> > + StringRef::iterator skip_s_space(StringRef::iterator Position);
> > +
> > /// @brief Skip a single s-white[33] starting at Position.
> > ///
> > /// A s-white is 0x20 | 0x9
> > @@ -373,6 +386,13 @@
> > StringRef::iterator skip_while( SkipWhileFunc Func
> > , StringRef::iterator Position);
> >
> > + /// @brief Skip minimal well-formed code unit subsequences until Func
> > + /// returns its input.
> > + ///
> > + /// @returns The code unit after the last minimal well-formed code unit
> > + /// subsequence that Func accepted.
> > + void advanceWhile(SkipWhileFunc Func);
> > +
> > /// @brief Scan ns-uri-char[39]s starting at Cur.
> > ///
> > /// This updates Cur and Column while scanning.
> > @@ -393,6 +413,11 @@
> > /// Pos is whitespace or a new line
> > bool isBlankOrBreak(StringRef::iterator Position);
> >
> > + /// @brief Consume a single b-break[28] if it's present at the current
> > + /// position. Return false if the code unit at the current position ins't a
> > + /// line break.
> > + bool consumeLineBreakIfPresent();
> > +
> > /// @brief If IsSimpleKeyAllowed, create and push_back a new SimpleKey.
> > void saveSimpleKeyCandidate( TokenQueueT::iterator Tok
> > , unsigned AtColumn
> > @@ -466,6 +491,26 @@
> > /// @brief Scan a block scalar starting with | or >.
> > bool scanBlockScalar(bool IsLiteral);
> >
> > + /// @brief Scan a chomping indicator in a block scalar header.
> > + char scanBlockChompingIndicator();
> > +
> > + /// @brief Scan an indentation indicator in a block scalar header.
> > + unsigned scanBlockIndentationIndicator();
> > +
> > + /// @brief Scan a block scalar header.
> > + /// Return true if we've reached the end of the block scalar.
> > + bool scanBlockScalarHeader(char &ChompingIndicator, unsigned &IndentIndicator,
> > + bool &WasScanningSuccessful);
> > +
> > + /// @brief Look for the indentation level of a block scalar.
> > + /// Return true if we've reached the end of the block scalar.
> > + bool findBlockScalarIndent(unsigned &BlockIndent, unsigned BlockExitIndent,
> > + unsigned &LineBreaks);
> > +
> > + /// @brief Scan the indentation of a text line in a block scalar.
> > + /// Return true if we've reached the end of the block scalar.
> > + bool scanBlockScalarIndent(unsigned BlockIndent, unsigned BlockExitIndent);
> > +
> > /// @brief Scan a tag of the form !stuff.
> > bool scanTag();
> >
> > @@ -612,6 +657,9 @@
> > case Token::TK_Scalar:
> > OS << "Scalar: ";
> > break;
> > + case Token::TK_BlockScalar:
> > + OS << "Block Scalar: ";
> > + break;
> > case Token::TK_Alias:
> > OS << "Alias: ";
> > break;
> > @@ -816,6 +864,13 @@
> > return Position;
> > }
> >
> > +StringRef::iterator Scanner::skip_s_space(StringRef::iterator Position) {
> > + if (Position == End)
> > + return Position;
> > + if (*Position == ' ')
> > + return Position + 1;
> > + return Position;
> > +}
> >
> > StringRef::iterator Scanner::skip_s_white(StringRef::iterator Position) {
> > if (Position == End)
> > @@ -844,6 +899,12 @@
> > return Position;
> > }
> >
> > +void Scanner::advanceWhile(SkipWhileFunc Func) {
> > + auto Final = skip_while(Func, Current);
> > + Column += Final - Current;
> > + Current = Final;
> > +}
> > +
> > static bool is_ns_hex_digit(const char C) {
> > return (C >= '0' && C <= '9')
> > || (C >= 'a' && C <= 'z')
> > @@ -906,6 +967,16 @@
> > return false;
> > }
> >
> > +bool Scanner::consumeLineBreakIfPresent() {
> > + auto Next = skip_b_break(Current);
> > + if (Next == Current)
> > + return false;
> > + Column = 0;
> > + ++Line;
> > + Current = Next;
> > + return true;
> > +}
> > +
> > void Scanner::saveSimpleKeyCandidate( TokenQueueT::iterator Tok
> > , unsigned AtColumn
> > , bool IsRequired) {
> > @@ -1374,38 +1445,220 @@
> > return true;
> > }
> >
> > -bool Scanner::scanBlockScalar(bool IsLiteral) {
> > - StringRef::iterator Start = Current;
> > - skip(1); // Eat | or >
> > - while(true) {
> > - StringRef::iterator i = skip_nb_char(Current);
> > - if (i == Current) {
> > - if (Column == 0)
> > - break;
> > - i = skip_b_break(Current);
> > - if (i != Current) {
> > - // We got a line break.
> > - Column = 0;
> > - ++Line;
> > - Current = i;
> > - continue;
> > +char Scanner::scanBlockChompingIndicator() {
> > + char Indicator = ' ';
> > + if (Current != End && (*Current == '+' || *Current == '-')) {
> > + Indicator = *Current;
> > + skip(1);
> > + }
> > + return Indicator;
> > +}
> > +
> > +/// \brief Return the number of trailing line breaks that have to be emitted
> > +/// as determined by the chomping indicator.
>
> Summarize the function in one line. If that's not enough, add more
> after a blank line. You don't need the `\brief` anymore, since we've
> turned on auto-brief. Something like this:
>
> /// Get the number of line breaks after chomping.
> ///
> /// Return the number of trailing line breaks to emit, depending on
> /// \c ChompingIndicator.
>
> Please adjust the rest of your doxygen comments similarly.
>
> Fixed.
>
>
> > +static unsigned applyChompingIndicator(char ChompingIndicator,
>
> I think `getChompedLineBreaks()` is a better name.
>
> Done.
>
>
> > + unsigned LineBreaks, StringRef Str) {
>
> I think your comments should be more succinct. The code would be easier
> to read.
>
> > + // Don't output any trailing new lines if the stripping chomping behaviour is
> > + // specified.
>
> If you move the comment inside the `if`, you don't need to describe the
> condition. This could just be a comment on the same line as the
> `return` (or the line before) that said something like "// Strip all
> line breaks.".
>
> Fixed.
>
>
> > + if (ChompingIndicator == '-')
> > + return 0;
> > + // Keep trailing lines when the keep chomping behaviour is specified.
>
> Similarly, this comment could be "// Keep all line breaks.".
>
> > + else if (ChompingIndicator == '+')
>
> This should be just `if`.
>
> > + return LineBreaks;
> > + // Clip trailing lines (default chomping behaviour). The final line break
> > + // is preserved as long as the string isn't empty, but the other trailing
> > + // lines aren't kept.
>
> And this could be "Clip trailing lines."
>
> > + return Str.empty() ? 0 : 1;
>
> Resulting in something like:
>
> if (ChompingIndicator == '-') // Strip all line breaks.
> return 0;
>
> if (ChompingIndicator == '+') // Keep all line breaks.
> return LineBreaks;
>
> // Clip trailing lines.
> return Str.empty() ? 0 : 1;
>
> Done.
>
>
> > +}
> > +
> > +unsigned Scanner::scanBlockIndentationIndicator() {
> > + unsigned Indent = 0;
> > + if (Current != End && (*Current >= '1' && *Current <= '9')) {
> > + Indent = unsigned(*Current - '0');
> > + skip(1);
> > + }
> > + return Indent;
> > +}
> > +
> > +bool Scanner::scanBlockScalarHeader(char &ChompingIndicator,
> > + unsigned &IndentIndicator,
> > + bool &WasScanningSuccessful) {
> > + // Chomping indicator ([164] c-chomping-indicator)
>
> Remove this comment. Add this to the docs for
> `scanBlockChompingIndicator()` if it's useful.
>
> Fixed.
>
>
> > + ChompingIndicator = scanBlockChompingIndicator();
> > + // Indentation indicator ([163] c-indentation-indicator)
>
> Remove this comment. Add this to the docs for
> `scanBlockIndentationIndicator()` if it's useful.
>
> Fixed.
>
>
> > + IndentIndicator = scanBlockIndentationIndicator();
> > + // Check for the chomping indicator once again.
> > + if (ChompingIndicator == ' ')
> > + ChompingIndicator = scanBlockChompingIndicator();
> > + Current = skip_while(&Scanner::skip_s_white, Current);
> > + skipComment();
> > +
> > + if (Current == End) {
> > + // End of file after block header, we have an empty scalar.
> > + WasScanningSuccessful = true;
> > + return true;
> > + }
> > + auto I = skip_b_break(Current);
> > + if (I == Current) {
> > + setError("Expected a line break after block scalar header", Current);
> > + WasScanningSuccessful = false;
> > + return true;
> > + }
> > + Current = I;
> > + Column = 0;
> > + ++Line;
> > + return false;
> > +}
> > +
> > +bool Scanner::findBlockScalarIndent(unsigned &BlockIndent,
> > + unsigned BlockExitIndent,
> > + unsigned &LineBreaks) {
> > + unsigned MaxAllSpaceLineCharacters = 0;
> > + StringRef::iterator LongestAllSpaceLine;
> > +
> > + while (true) {
> > + advanceWhile(&Scanner::skip_s_space);
> > + // If there is a non breaking character next, then this is a non empty
> > + // line, thus we can use the discovered indentation as the block's
> > + // indentation.
>
> I'd move this...
>
> > + if (skip_nb_char(Current) != Current) {
>
> ... to here:
>
> // This line isn't empty, so find the indentation.
>
> Done.
>
>
> > + if (Column <= BlockExitIndent) {
> > + // This is the end of the block literal
> > + return true;
> > } else {
>
> Remove the `else`.
>
> Fixed.
>
>
> > - // There was an error, which should already have been printed out.
> > - return false;
> > + BlockIndent = Column;
> > + // So here we enforce the rule that the leading all space lines can't
> > + // have more characters than the block's indentation level
>
> Is this comment useful? It looks to me like it's obvious from the error
> message.
>
> Removed.
>
>
> > + if (MaxAllSpaceLineCharacters > BlockIndent) {
> > + setError("A leading all-space line must not have too many spaces",
>
> Do you have a test for this error?
>
> Yes, it’s tested in the YAML parser unit tests.
>
>
> "too many" is vague. I suggest something more like:
>
> "Leading all-space lines must be smaller than the block indent"
>
> Fixed.
>
>
> > + LongestAllSpaceLine);
> > + }
> > + // We found the block's indentation.
> > + break;
>
> This should just be `return false`.
>
> > + }
> > + } else if (skip_b_break(Current) != Current) {
>
> Remove the `else`.
>
> Fixed.
>
>
> > + // This is an all space line, so we have to record the amount of
> > + // space characters it has so that later when we find the first
> > + // text line that defines the indentation level we can make sure
> > + // that all previous all space lines don't have more space
> > + // characters than the indentation level.
>
> This is long and inhibits legibility of the code...
>
> > + if (Column > MaxAllSpaceLineCharacters) {
>
> ... so move it to here and skip the stuff that's obvious from reading
> the code.
>
> // Record the longest all-space line in case it's longer than the
> // block indent.
>
> Done.
>
>
> > + MaxAllSpaceLineCharacters = Column;
> > + LongestAllSpaceLine = Current;
> > }
> > }
> > - Current = i;
> > +
> > + // Check for EOF.
> > + if (Current == End)
> > + return true;
> > +
> > + if (!consumeLineBreakIfPresent())
> > + return true;
> > + ++LineBreaks;
> > + }
> > + return false;
> > +}
> > +
> > +bool Scanner::scanBlockScalarIndent(unsigned BlockIndent,
> > + unsigned BlockExitIndent) {
> > + // Skip the indentation.
> > + while (Column < BlockIndent) {
> > + auto I = skip_s_space(Current);
> > + if (I == Current)
> > + break;
> > + Current = I;
> > ++Column;
> > }
> >
> > - if (Start == Current) {
> > - setError("Got empty block scalar", Start);
> > - return false;
> > + // If this line isn't empty then we have to check the indentation to
> > + // see if the block scalar ends.
> > + if (skip_nb_char(Current) != Current) {
> > + if (Column <= BlockExitIndent) {
> > + // This is the end of the block literal
> > + return true;
> > + } else if (Column < BlockIndent) {
>
> Remove the `else`.
>
> Fixed.
>
>
> > + if (Current != End && *Current == '#') {
> > + // This is a trailing comment
> > + return true;
>
> What about spaces followed by a comment? Is this duplicated from your
> other comment-handling code?
>
> I’m not sure what you mean here. The leading all-space lines treat comments
> like content, this doesn’t come up there.
>
>
> > + } else {
>
> Remove the `else`.
>
> Fixed.
>
>
> > + setError("A text line is less indented than the block scalar", Current);
>
> Aren't you returning "IsDone"? Should you return `true` here?
>
> I'm returning false for an error now.
>
>
> > + }
> > + }
> > + }
> > + return false;
>
> If you invert the outer condition, you can return early, and this'll all
> be easier to read:
>
> if (skip_nb_char(Current) == Current)
> return false;
>
> if (Column <= BlockExitIndent) // End of the block literal.
> return true;
>
> if (Curent != End && *Current == '#') // Trailing comment.
> return true;
>
> setError(...);
> return false;
>
> This logic matches what you have (I think!), but it seems kind of off.
> In particular, I'm not sure the error makes sense -- it seems to
> describe the "Column <= BlockExitIndent" scenario. Once you've
> refactored this, please review the logic.
>
> Refactored.
>
>
> > +}
> > +
> > +bool Scanner::scanBlockScalar(bool IsLiteral) {
> > + auto Start = Current;
> > + // Eat '|' or '>'
> > + assert(*Current == '|' || *Current == '>');
> > + skip(1);
> > +
> > + // Block Header ([162] c-b-block-header)
>
> This comment should be part of `scanBlockScalarHeader()`, if anywhere.
>
> Removed.
>
>
> > + char ChompingIndicator;
> > + unsigned BlockIndent;
> > + bool WasScanningSuccessful;
> > + bool Done = scanBlockScalarHeader(ChompingIndicator, BlockIndent,
> > + WasScanningSuccessful);
>
> Usually we use returns from parsing functions as an error mechanism
> (see, e.g., `LLParser.cpp`). Should you really use it for Done-ness?
> If you use it for errors, you get code like this:
>
> if (!scanBlockScalarHeader(...))
> return false;
>
> Play with it a bit, and see if it looks cleaner (you'd need to change
> the rest of your functions as well). I don't feel strongly either way.
>
> That would be good for consistency, I use the error return now in the
> updated patch.
>
>
> > + if (Done) {
> > + if (WasScanningSuccessful) {
> > + Token T;
> > + T.Kind = Token::TK_BlockScalar;
> > + T.Range = StringRef(Start, Current - Start);
> > + TokenQueue.push_back(T);
>
> Would this logic fit inside `scanBlockScalarHeader()`?
>
> Yes, done.
>
>
> > + }
> > + return WasScanningSuccessful;
> > + }
> > + Start = Current;
> > +
> > + // Look for a first non-blank line to determine the indenation of the block
> > + // scalar.
>
> Not a useful comment. The name of the function you're calling explains
> this.
>
> Removed.
>
>
> > + unsigned BlockExitIndent = Indent < 0 ? 0 : (unsigned)Indent;
> > + unsigned LineBreaks = 0;
> > + if (BlockIndent == 0)
> > + Done = findBlockScalarIndent(BlockIndent, BlockExitIndent, LineBreaks);
> > +
> > + // Scan the block's scalars body.
> > + SmallString<256> Str;
> > + while (!Done) {
> > + // Handle the block's indentation
>
> The function name below is sufficient. This comment isn't useful.
>
> Removed.
>
>
> > + if (scanBlockScalarIndent(BlockIndent, BlockExitIndent))
>
> What about errors from this?
>
> Fixed.
>
>
> > + break;
> > +
> > + // Parse the current line.
> > + auto LineStart = Current;
> > + advanceWhile(&Scanner::skip_nb_char);
> > + if (LineStart != Current) {
> > + for (unsigned I = 0; I < LineBreaks; ++I)
> > + Str.push_back('\n');
>
> This loop is the same as:
>
> Str.append(LineBreaks, '\n');
>
> Fixed.
>
>
> > + LineBreaks = 0;
> > + Str.append(StringRef(LineStart, Current - LineStart));
>
> I think it'll be easier to read if you put the two `Str`-modifying
> operations next to each other, and then reset `LineBreaks`.
>
> Done.
>
>
> > + }
> > +
> > + // Check for EOF.
> > + if (Current == End)
> > + break;
> > +
> > + if (!consumeLineBreakIfPresent())
> > + break;
> > + ++LineBreaks;
> > }
> >
> > + if (Current == End && !LineBreaks)
> > + // Ensure that there is at least one line break before the end of file.
> > + LineBreaks = 1;
>
> Is this tested somewhere? Does it interact correctly with the
> chomping style?
>
> Yes - it is tested at the 'YAMLParser/spec-09-24.test’ test.
>
>
> > + LineBreaks = applyChompingIndicator(ChompingIndicator, LineBreaks, Str);
> > + for (unsigned I = 0; I < LineBreaks; ++I)
> > + Str.push_back('\n');
>
> Another case of:
>
> Str.append(LineBreaks, '\n');
>
> With the rename I suggested for this function, I think this would read
> well with both lines merged:
>
> Str.append(getChompedLineBreaks(ChompingIndicator, LineBreaks, Str),
> '\n');
>
> Up to you though.
>
> Fixed.
>
>
> > +
> > + // New lines may start a simple key.
> > + if (!FlowLevel)
> > + IsSimpleKeyAllowed = true;
>
> Is this tested somewhere? Can it be in a separate commit?
>
> Yes - it is tested by several tests. I would really prefer this to be in this
> commit as It can’t be tested if I were to commit it before and
> after it would break most tests in this one. Those tests are derived from
> PyYAML and I think it’s important to try and keep the compatibility.
>
>
> > +
> > Token T;
> > - T.Kind = Token::TK_Scalar;
> > + T.Kind = Token::TK_BlockScalar;
> > T.Range = StringRef(Start, Current - Start);
> > + T.Value = std::move(Str.str().str());
>
> The `std::move()` isn't useful -- this is already returning a temporary.
>
> Fixed.
>
>
> > TokenQueue.push_back(T);
> > return true;
> > }
> > @@ -1607,6 +1860,7 @@
> > case NK_Null:
> > return "tag:yaml.org,2002:null";
> > case NK_Scalar:
> > + case NK_BlockScalar:
> > // TODO: Tag resolution.
> > return "tag:yaml.org,2002:str";
>
> Is this change tested somewhere?
>
> Yes - all the file check tests check the !!str tag, and yaml-bench derives this tag from this string.
>
>
> > case NK_Mapping:
> > @@ -2138,6 +2392,11 @@
> > , AnchorInfo.Range.substr(1)
> > , TagInfo.Range
> > , T.Range);
> > + case Token::TK_BlockScalar:
> > + getNext();
> > + return new (NodeAllocator)
> > + BlockScalarNode(stream.CurrentDoc, AnchorInfo.Range.substr(1),
> > + TagInfo.Range, T.Value, T.Range);
> > case Token::TK_Key:
> > // Don't eat the TK_Key, KeyValueNode expects it.
> > return new (NodeAllocator)
More information about the llvm-commits
mailing list