[PATCH] YAML: Implement block scalar parsing

Wed May 13 11:37:27 PDT 2015

> On 2015 May 13, at 09:53, Alex Lorenz <arphaman at gmail.com> wrote:
> 
> This update uses a new helper function that consumes a line break in another part of code that I forgot to update last time.
> 
> 
> REPOSITORY
>  rL LLVM
> 
> http://reviews.llvm.org/D9503
> 
> Files:
>  include/llvm/Support/YAMLParser.h
>  lib/Support/YAMLParser.cpp
>  test/YAMLParser/spec-09-14.test
>  test/YAMLParser/spec-09-18.test
>  test/YAMLParser/spec-09-19.test
>  test/YAMLParser/spec-09-20.test
>  test/YAMLParser/spec-09-21.test
>  test/YAMLParser/spec-09-22.test
>  test/YAMLParser/spec-09-24.test
>  test/YAMLParser/spec-09-25.test
>  test/YAMLParser/spec-09-26.test
>  test/YAMLParser/spec-09-27.test
>  test/YAMLParser/spec-09-28.test
>  unittests/Support/YAMLParserTest.cpp
>  utils/yaml-bench/YAMLBench.cpp
> 
> EMAIL PREFERENCES
>  http://reviews.llvm.org/settings/panel/emailpreferences/
> <D9503.25702.patch>

It's a lot easier to see what's going on now; thanks.

I wonder whether "returning IsDone" is the right thing for your
helpers, or whether "returning IsError" would be cleaner.  I also
think in one spot you fail to return correctly on error (I pointed
it out below somewhere).

Many of the rest of the comments amount to: "skip `else` after a
`return`".

> Index: lib/Support/YAMLParser.cpp
> ===================================================================
> --- lib/Support/YAMLParser.cpp
> +++ lib/Support/YAMLParser.cpp
> @@ -101,6 +101,7 @@
>  void Node::anchor() {}
>  void NullNode::anchor() {}
>  void ScalarNode::anchor() {}
> +void BlockScalarNode::anchor() {}
>  void KeyValueNode::anchor() {}
>  void MappingNode::anchor() {}
>  void SequenceNode::anchor() {}
> @@ -128,6 +129,7 @@
>      TK_Key,
>      TK_Value,
>      TK_Scalar,
> +    TK_BlockScalar,
>      TK_Alias,
>      TK_Anchor,
>      TK_Tag
> @@ -137,6 +139,9 @@
>    /// of the token in the input.
>    StringRef Range;
>  
> +  /// The value of a block scalar node.
> +  std::string Value;
> +
>    Token() : Kind(TK_Error) {}
>  };
>  }
> @@ -348,6 +353,14 @@
>    ///          b-break.
>    StringRef::iterator skip_b_break(StringRef::iterator Position);
>  
> +  /// @brief Skip a single s-space[31] starting at Position.
> +  ///
> +  /// An s-space is 0x20
> +  ///
> +  /// @returns The code unit after the s-space, or Position if it's not a
> +  ///          s-space.
> +  StringRef::iterator skip_s_space(StringRef::iterator Position);
> +
>    /// @brief Skip a single s-white[33] starting at Position.
>    ///
>    /// A s-white is 0x20 | 0x9
> @@ -373,6 +386,13 @@
>    StringRef::iterator skip_while( SkipWhileFunc Func
>                                  , StringRef::iterator Position);
>  
> +  /// @brief Skip minimal well-formed code unit subsequences until Func
> +  ///        returns its input.
> +  ///
> +  /// @returns The code unit after the last minimal well-formed code unit
> +  ///          subsequence that Func accepted.
> +  void advanceWhile(SkipWhileFunc Func);
> +
>    /// @brief Scan ns-uri-char[39]s starting at Cur.
>    ///
>    /// This updates Cur and Column while scanning.
> @@ -393,6 +413,11 @@
>    ///        Pos is whitespace or a new line
>    bool isBlankOrBreak(StringRef::iterator Position);
>  
> +  /// @brief Consume a single b-break[28] if it's present at the current
> +  /// position. Return false if the code unit at the current position ins't a
> +  /// line break.
> +  bool consumeLineBreakIfPresent();
> +
>    /// @brief If IsSimpleKeyAllowed, create and push_back a new SimpleKey.
>    void saveSimpleKeyCandidate( TokenQueueT::iterator Tok
>                               , unsigned AtColumn
> @@ -466,6 +491,26 @@
>    /// @brief Scan a block scalar starting with | or >.
>    bool scanBlockScalar(bool IsLiteral);
>  
> +  /// @brief Scan a chomping indicator in a block scalar header.
> +  char scanBlockChompingIndicator();
> +
> +  /// @brief Scan an indentation indicator in a block scalar header.
> +  unsigned scanBlockIndentationIndicator();
> +
> +  /// @brief Scan a block scalar header.
> +  /// Return true if we've reached the end of the block scalar.
> +  bool scanBlockScalarHeader(char &ChompingIndicator, unsigned &IndentIndicator,
> +                             bool &WasScanningSuccessful);
> +
> +  /// @brief Look for the indentation level of a block scalar.
> +  /// Return true if we've reached the end of the block scalar.
> +  bool findBlockScalarIndent(unsigned &BlockIndent, unsigned BlockExitIndent,
> +                             unsigned &LineBreaks);
> +
> +  /// @brief Scan the indentation of a text line in a block scalar.
> +  /// Return true if we've reached the end of the block scalar.
> +  bool scanBlockScalarIndent(unsigned BlockIndent, unsigned BlockExitIndent);
> +
>    /// @brief Scan a tag of the form !stuff.
>    bool scanTag();
>  
> @@ -612,6 +657,9 @@
>      case Token::TK_Scalar:
>        OS << "Scalar: ";
>        break;
> +    case Token::TK_BlockScalar:
> +      OS << "Block Scalar: ";
> +      break;
>      case Token::TK_Alias:
>        OS << "Alias: ";
>        break;
> @@ -816,6 +864,13 @@
>    return Position;
>  }
>  
> +StringRef::iterator Scanner::skip_s_space(StringRef::iterator Position) {
> +  if (Position == End)
> +    return Position;
> +  if (*Position == ' ')
> +    return Position + 1;
> +  return Position;
> +}
>  
>  StringRef::iterator Scanner::skip_s_white(StringRef::iterator Position) {
>    if (Position == End)
> @@ -844,6 +899,12 @@
>    return Position;
>  }
>  
> +void Scanner::advanceWhile(SkipWhileFunc Func) {
> +  auto Final = skip_while(Func, Current);
> +  Column += Final - Current;
> +  Current = Final;
> +}
> +
>  static bool is_ns_hex_digit(const char C) {
>    return    (C >= '0' && C <= '9')
>           || (C >= 'a' && C <= 'z')
> @@ -906,6 +967,16 @@
>    return false;
>  }
>  
> +bool Scanner::consumeLineBreakIfPresent() {
> +  auto Next = skip_b_break(Current);
> +  if (Next == Current)
> +    return false;
> +  Column = 0;
> +  ++Line;
> +  Current = Next;
> +  return true;
> +}
> +
>  void Scanner::saveSimpleKeyCandidate( TokenQueueT::iterator Tok
>                                      , unsigned AtColumn
>                                      , bool IsRequired) {
> @@ -1374,38 +1445,220 @@
>    return true;
>  }
>  
> -bool Scanner::scanBlockScalar(bool IsLiteral) {
> -  StringRef::iterator Start = Current;
> -  skip(1); // Eat | or >
> -  while(true) {
> -    StringRef::iterator i = skip_nb_char(Current);
> -    if (i == Current) {
> -      if (Column == 0)
> -        break;
> -      i = skip_b_break(Current);
> -      if (i != Current) {
> -        // We got a line break.
> -        Column = 0;
> -        ++Line;
> -        Current = i;
> -        continue;
> +char Scanner::scanBlockChompingIndicator() {
> +  char Indicator = ' ';
> +  if (Current != End && (*Current == '+' || *Current == '-')) {
> +    Indicator = *Current;
> +    skip(1);
> +  }
> +  return Indicator;
> +}
> +
> +/// \brief Return the number of trailing line breaks that have to be emitted
> +/// as determined by the chomping indicator.

Summarize the function in one line.  If that's not enough, add more
after a blank line.  You don't need the `\brief` anymore, since we've
turned on auto-brief.  Something like this:

    /// Get the number of line breaks after chomping.
    ///
    /// Return the number of trailing line breaks to emit, depending on
    /// \c ChompingIndicator.

Please adjust the rest of your doxygen comments similarly.

> +static unsigned applyChompingIndicator(char ChompingIndicator,

I think `getChompedLineBreaks()` is a better name.

> +                                       unsigned LineBreaks, StringRef Str) {

I think your comments should be more succinct.  The code would be easier
to read.

> +  // Don't output any trailing new lines if the stripping chomping behaviour is
> +  // specified.

If you move the comment inside the `if`, you don't need to describe the
condition.  This could just be a comment on the same line as the
`return` (or the line before) that said something like "// Strip all
line breaks.".

> +  if (ChompingIndicator == '-')
> +    return 0;
> +  // Keep trailing lines when the keep chomping behaviour is specified.

Similarly, this comment could be "// Keep all line breaks.".

> +  else if (ChompingIndicator == '+')

This should be just `if`.

> +    return LineBreaks;
> +  // Clip trailing lines (default chomping behaviour). The final line break
> +  // is preserved as long as the string isn't empty, but the other trailing
> +  // lines aren't kept.

And this could be "Clip trailing lines."

> +  return Str.empty() ? 0 : 1;

Resulting in something like:

    if (ChompingIndicator == '-') // Strip all line breaks.
      return 0;

    if (ChompingIndicator == '+') // Keep all line breaks.
      return LineBreaks;

    // Clip trailing lines.
    return Str.empty() ? 0 : 1;

> +}
> +
> +unsigned Scanner::scanBlockIndentationIndicator() {
> +  unsigned Indent = 0;
> +  if (Current != End && (*Current >= '1' && *Current <= '9')) {
> +    Indent = unsigned(*Current - '0');
> +    skip(1);
> +  }
> +  return Indent;
> +}
> +
> +bool Scanner::scanBlockScalarHeader(char &ChompingIndicator,
> +                                    unsigned &IndentIndicator,
> +                                    bool &WasScanningSuccessful) {
> +  // Chomping indicator ([164] c-chomping-indicator)

Remove this comment.  Add this to the docs for
`scanBlockChompingIndicator()` if it's useful.

> +  ChompingIndicator = scanBlockChompingIndicator();
> +  // Indentation indicator ([163] c-indentation-indicator)

Remove this comment.  Add this to the docs for
`scanBlockIndentationIndicator()` if it's useful.

> +  IndentIndicator = scanBlockIndentationIndicator();
> +  // Check for the chomping indicator once again.
> +  if (ChompingIndicator == ' ')
> +    ChompingIndicator = scanBlockChompingIndicator();
> +  Current = skip_while(&Scanner::skip_s_white, Current);
> +  skipComment();
> +
> +  if (Current == End) {
> +    // End of file after block header, we have an empty scalar.
> +    WasScanningSuccessful = true;
> +    return true;
> +  }
> +  auto I = skip_b_break(Current);
> +  if (I == Current) {
> +    setError("Expected a line break after block scalar header", Current);
> +    WasScanningSuccessful = false;
> +    return true;
> +  }
> +  Current = I;
> +  Column = 0;
> +  ++Line;
> +  return false;
> +}
> +
> +bool Scanner::findBlockScalarIndent(unsigned &BlockIndent,
> +                                    unsigned BlockExitIndent,
> +                                    unsigned &LineBreaks) {
> +  unsigned MaxAllSpaceLineCharacters = 0;
> +  StringRef::iterator LongestAllSpaceLine;
> +
> +  while (true) {
> +    advanceWhile(&Scanner::skip_s_space);
> +    // If there is a non breaking character next, then this is a non empty
> +    // line, thus we can use the discovered indentation as the block's
> +    // indentation.

I'd move this...

> +    if (skip_nb_char(Current) != Current) {

... to here:

     // This line isn't empty, so find the indentation.

> +      if (Column <= BlockExitIndent) {
> +        // This is the end of the block literal
> +        return true;
>        } else {

Remove the `else`.

> -        // There was an error, which should already have been printed out.
> -        return false;
> +        BlockIndent = Column;
> +        // So here we enforce the rule that the leading all space lines can't
> +        // have more characters than the block's indentation level

Is this comment useful?  It looks to me like it's obvious from the error
message.

> +        if (MaxAllSpaceLineCharacters > BlockIndent) {
> +          setError("A leading all-space line must not have too many spaces",

Do you have a test for this error?

"too many" is vague.  I suggest something more like:

    "Leading all-space lines must be smaller than the block indent"

> +                   LongestAllSpaceLine);
> +        }
> +        // We found the block's indentation.
> +        break;

This should just be `return false`.

> +      }
> +    } else if (skip_b_break(Current) != Current) {

Remove the `else`.

> +      // This is an all space line, so we have to record the amount of
> +      // space characters it has so that later when we find the first
> +      // text line that defines the indentation level we can make sure
> +      // that all previous all space lines don't have more space
> +      // characters than the indentation level.

This is long and inhibits legibility of the code...

> +      if (Column > MaxAllSpaceLineCharacters) {

... so move it to here and skip the stuff that's obvious from reading
the code.

    // Record the longest all-space line in case it's longer than the
    // block indent.

> +        MaxAllSpaceLineCharacters = Column;
> +        LongestAllSpaceLine = Current;
>        }
>      }
> -    Current = i;
> +
> +    // Check for EOF.
> +    if (Current == End)
> +      return true;
> +
> +    if (!consumeLineBreakIfPresent())
> +      return true;
> +    ++LineBreaks;
> +  }
> +  return false;
> +}
> +
> +bool Scanner::scanBlockScalarIndent(unsigned BlockIndent,
> +                                    unsigned BlockExitIndent) {
> +  // Skip the indentation.
> +  while (Column < BlockIndent) {
> +    auto I = skip_s_space(Current);
> +    if (I == Current)
> +      break;
> +    Current = I;
>      ++Column;
>    }
>  
> -  if (Start == Current) {
> -    setError("Got empty block scalar", Start);
> -    return false;
> +  // If this line isn't empty then we have to check the indentation to
> +  // see if the block scalar ends.
> +  if (skip_nb_char(Current) != Current) {
> +    if (Column <= BlockExitIndent) {
> +      // This is the end of the block literal
> +      return true;
> +    } else if (Column < BlockIndent) {

Remove the `else`.

> +      if (Current != End && *Current == '#') {
> +        // This is a trailing comment
> +        return true;

What about spaces followed by a comment?  Is this duplicated from your
other comment-handling code?

> +      } else {

Remove the `else`.

> +        setError("A text line is less indented than the block scalar", Current);

Aren't you returning "IsDone"?  Should you return `true` here?

> +      }
> +    }
> +  }
> +  return false;

If you invert the outer condition, you can return early, and this'll all
be easier to read:

    if (skip_nb_char(Current) == Current)
      return false;

    if (Column <= BlockExitIndent) // End of the block literal.
      return true;

    if (Curent != End && *Current == '#') // Trailing comment.
      return true;

    setError(...);
    return false;

This logic matches what you have (I think!), but it seems kind of off.
In particular, I'm not sure the error makes sense -- it seems to
describe the "Column <= BlockExitIndent" scenario.  Once you've
refactored this, please review the logic.

> +}
> +
> +bool Scanner::scanBlockScalar(bool IsLiteral) {
> +  auto Start = Current;
> +  // Eat '|' or '>'
> +  assert(*Current == '|' || *Current == '>');
> +  skip(1);
> +
> +  // Block Header ([162] c-b-block-header)

This comment should be part of `scanBlockScalarHeader()`, if anywhere.

> +  char ChompingIndicator;
> +  unsigned BlockIndent;
> +  bool WasScanningSuccessful;
> +  bool Done = scanBlockScalarHeader(ChompingIndicator, BlockIndent,
> +                                    WasScanningSuccessful);

Usually we use returns from parsing functions as an error mechanism
(see, e.g., `LLParser.cpp`).  Should you really use it for Done-ness?
If you use it for errors, you get code like this:

    if (!scanBlockScalarHeader(...))
      return false;

Play with it a bit, and see if it looks cleaner (you'd need to change
the rest of your functions as well).  I don't feel strongly either way.

> +  if (Done) {
> +    if (WasScanningSuccessful) {
> +      Token T;
> +      T.Kind = Token::TK_BlockScalar;
> +      T.Range = StringRef(Start, Current - Start);
> +      TokenQueue.push_back(T);

Would this logic fit inside `scanBlockScalarHeader()`?

> +    }
> +    return WasScanningSuccessful;
> +  }
> +  Start = Current;
> +
> +  // Look for a first non-blank line to determine the indenation of the block
> +  // scalar.

Not a useful comment.  The name of the function you're calling explains
this.

> +  unsigned BlockExitIndent = Indent < 0 ? 0 : (unsigned)Indent;
> +  unsigned LineBreaks = 0;
> +  if (BlockIndent == 0)
> +    Done = findBlockScalarIndent(BlockIndent, BlockExitIndent, LineBreaks);
> +
> +  // Scan the block's scalars body.
> +  SmallString<256> Str;
> +  while (!Done) {
> +    // Handle the block's indentation

The function name below is sufficient.  This comment isn't useful.

> +    if (scanBlockScalarIndent(BlockIndent, BlockExitIndent))

What about errors from this?

> +      break;
> +
> +    // Parse the current line.
> +    auto LineStart = Current;
> +    advanceWhile(&Scanner::skip_nb_char);
> +    if (LineStart != Current) {
> +      for (unsigned I = 0; I < LineBreaks; ++I)
> +        Str.push_back('\n');

This loop is the same as:

    Str.append(LineBreaks, '\n');

> +      LineBreaks = 0;
> +      Str.append(StringRef(LineStart, Current - LineStart));

I think it'll be easier to read if you put the two `Str`-modifying
operations next to each other, and then reset `LineBreaks`.

> +    }
> +
> +    // Check for EOF.
> +    if (Current == End)
> +      break;
> +
> +    if (!consumeLineBreakIfPresent())
> +      break;
> +    ++LineBreaks;
>    }
>  
> +  if (Current == End && !LineBreaks)
> +    // Ensure that there is at least one line break before the end of file.
> +    LineBreaks = 1;

Is this tested somewhere?  Does it interact correctly with the
chomping style?

> +  LineBreaks = applyChompingIndicator(ChompingIndicator, LineBreaks, Str);
> +  for (unsigned I = 0; I < LineBreaks; ++I)
> +    Str.push_back('\n');

Another case of:

    Str.append(LineBreaks, '\n');

With the rename I suggested for this function, I think this would read
well with both lines merged:

    Str.append(getChompedLineBreaks(ChompingIndicator, LineBreaks, Str),
               '\n');

Up to you though.

> +
> +  // New lines may start a simple key.
> +  if (!FlowLevel)
> +    IsSimpleKeyAllowed = true;

Is this tested somewhere?  Can it be in a separate commit?

> +
>    Token T;
> -  T.Kind = Token::TK_Scalar;
> +  T.Kind = Token::TK_BlockScalar;
>    T.Range = StringRef(Start, Current - Start);
> +  T.Value = std::move(Str.str().str());

The `std::move()` isn't useful -- this is already returning a temporary.

>    TokenQueue.push_back(T);
>    return true;
>  }
> @@ -1607,6 +1860,7 @@
>    case NK_Null:
>      return "tag:yaml.org,2002:null";
>    case NK_Scalar:
> +  case NK_BlockScalar:
>      // TODO: Tag resolution.
>      return "tag:yaml.org,2002:str";

Is this change tested somewhere?

>    case NK_Mapping:
> @@ -2138,6 +2392,11 @@
>                  , AnchorInfo.Range.substr(1)
>                  , TagInfo.Range
>                  , T.Range);
> +  case Token::TK_BlockScalar:
> +    getNext();
> +    return new (NodeAllocator)
> +        BlockScalarNode(stream.CurrentDoc, AnchorInfo.Range.substr(1),
> +                        TagInfo.Range, T.Value, T.Range);
>    case Token::TK_Key:
>      // Don't eat the TK_Key, KeyValueNode expects it.
>      return new (NodeAllocator)