[PATCH] YAML: Implement block scalar parsing

Tue May 12 11:57:40 PDT 2015

> On 2015-May-06, at 16:47, Alex Lorenz <arphaman at gmail.com> wrote:
> 
> As suggested by Duncan, I took out the extraction of the skipComment method from this patch, 
> which I've committed separately in r236663. I also took out the change to the KeyValueNode skip method, 
> which I've committed separately in r236669.
> 
> 
> REPOSITORY
>  rL LLVM
> 
> http://reviews.llvm.org/D9503
> 
> Files:
>  include/llvm/Support/YAMLParser.h
>  lib/Support/YAMLParser.cpp
>  test/YAMLParser/spec-09-14.test
>  test/YAMLParser/spec-09-18.test
>  test/YAMLParser/spec-09-19.test
>  test/YAMLParser/spec-09-20.test
>  test/YAMLParser/spec-09-21.test
>  test/YAMLParser/spec-09-22.test
>  test/YAMLParser/spec-09-24.test
>  test/YAMLParser/spec-09-25.test
>  test/YAMLParser/spec-09-26.test
>  test/YAMLParser/spec-09-27.test
>  test/YAMLParser/spec-09-28.test
>  test/YAMLParser/spec-09-29.test
>  test/YAMLParser/spec-09-30.test
>  test/YAMLParser/spec-09-31.test
>  test/YAMLParser/spec-09-32.test
>  test/YAMLParser/spec-09-33.test
>  unittests/Support/YAMLParserTest.cpp
>  utils/yaml-bench/YAMLBench.cpp
> 
> EMAIL PREFERENCES
>  http://reviews.llvm.org/settings/panel/emailpreferences/
> <D9503.25106.patch>

This is looking better, but I still have a few comments.  Most of them
boil down to:

 1. Can X be separated into a helper function?
 2. X looks like an encapsulated feature.  Can it be committed (and
    tested) separately?

> Index: lib/Support/YAMLParser.cpp
> ===================================================================
> --- lib/Support/YAMLParser.cpp
> +++ lib/Support/YAMLParser.cpp
> @@ -101,6 +101,7 @@
>  void Node::anchor() {}
>  void NullNode::anchor() {}
>  void ScalarNode::anchor() {}
> +void BlockScalarNode::anchor() {}
>  void KeyValueNode::anchor() {}
>  void MappingNode::anchor() {}
>  void SequenceNode::anchor() {}
> @@ -128,6 +129,7 @@
>      TK_Key,
>      TK_Value,
>      TK_Scalar,
> +    TK_BlockScalar,
>      TK_Alias,
>      TK_Anchor,
>      TK_Tag
> @@ -137,6 +139,9 @@
>    /// of the token in the input.
>    StringRef Range;
>  
> +  /// The value of a block scalar node.
> +  std::string Value;
> +
>    Token() : Kind(TK_Error) {}
>  };
>  }
> @@ -348,6 +353,14 @@
>    ///          b-break.
>    StringRef::iterator skip_b_break(StringRef::iterator Position);
>  
> +  /// @brief Skip a single s-space[31] starting at Position.
> +  ///
> +  /// An s-space is 0x20
> +  ///
> +  /// @returns The code unit after the s-space, or Position if it's not a
> +  ///          s-space.
> +  StringRef::iterator skip_s_space(StringRef::iterator Position);
> +
>    /// @brief Skip a single s-white[33] starting at Position.
>    ///
>    /// A s-white is 0x20 | 0x9
> @@ -612,6 +625,9 @@
>      case Token::TK_Scalar:
>        OS << "Scalar: ";
>        break;
> +    case Token::TK_BlockScalar:
> +      OS << "Block Scalar: ";
> +      break;
>      case Token::TK_Alias:
>        OS << "Alias: ";
>        break;
> @@ -816,6 +832,13 @@
>    return Position;
>  }
>  
> +StringRef::iterator Scanner::skip_s_space(StringRef::iterator Position) {
> +  if (Position == End)
> +    return Position;
> +  if (*Position == ' ')
> +    return Position + 1;
> +  return Position;
> +}
>  
>  StringRef::iterator Scanner::skip_s_white(StringRef::iterator Position) {
>    if (Position == End)
> @@ -1375,37 +1398,196 @@
>  }
>  
>  bool Scanner::scanBlockScalar(bool IsLiteral) {
> -  StringRef::iterator Start = Current;

It looks like you've renamed this variable from `Start` to `BlockStart`.
Please commit that separately since it unnecessarily complicates this
patch.

> -  skip(1); // Eat | or >
> -  while(true) {
> -    StringRef::iterator i = skip_nb_char(Current);
> -    if (i == Current) {
> -      if (Column == 0)
> -        break;
> -      i = skip_b_break(Current);
> -      if (i != Current) {
> -        // We got a line break.
> -        Column = 0;
> -        ++Line;
> -        Current = i;
> -        continue;
> -      } else {
> -        // There was an error, which should already have been printed out.
> -        return false;
> -      }
> -    }
> -    Current = i;
> -    ++Column;
> +  std::string Str;
> +  raw_string_ostream OS(Str);

Is the raw_string_ostream really valuable here?  Why not a
`SmallVector<char, 256>` or some such, and just append to it directly?

I ask because it looks like you're only appending `char`s, so the
overhead of a stream (even a fast stream) seems indirect.

> +  auto BlockStart = Current;
> +  // If this is true, then the scanner has to look for a first non-blank line
> +  // to determine the indenation of the block scalar.
> +  bool LookingForIndent = true;

The name should be more obviously Boolean.  I suggest `IsIndentKnown`
(and inverting the meaning), but you may find something better.

> +  unsigned BlockIndent = 0;
> +  unsigned BlockExitIndent = Indent < 0 ? 0 : (unsigned)Indent;
> +  unsigned LineBreaks = 0;
> +
> +  unsigned MaxAllSpaceLineCharacters = 0;
> +  StringRef::iterator LongestAllSpaceLine;
> +
> +  bool FoldNextLine = false;

Please declare all of these variables immediately before their
respective first uses.

> +
> +  skip(1); // Eat '|' or '>'

Would it be valuable to assert that Current is one of those two?

> +
> +  // Block Header ([162] c-b-block-header)
> +  // Chomping indicator ([164] c-chomping-indicator)
> +  char ChompingIndicator = ' ';
> +  if (Current != End && (*Current == '+' || *Current == '-')) {
> +    ChompingIndicator = *Current;
> +    skip(1);
>    }
> +  // Indentation indicator ([163] c-indentation-indicator)
> +  if (Current != End && (*Current >= '1' && *Current <= '9')) {
> +    BlockIndent = unsigned(*Current - '0');
> +    LookingForIndent = false;
> +    skip(1);
> +  }
> +  // Check for chomping indicator once again.
> +  if (Current != End && (*Current == '+' || *Current == '-')) {
> +    ChompingIndicator = *Current;
> +    skip(1);
> +  }

Why do you need to check this twice?  Is it valid to for it to appear
twice?  Are there any requirements for it?

If possible, these should really be thrown into helper functions.
Depending on the exact semantics of the chomping indicator, something
like this would be easier to read:

    char ChompingIndicator = scanChompingCharacter();
    unsigned BlockIndent = scanBlockIndent();
    skipChompingIndicator(ChompingIndicator);

where I'm assuming:

    void skipChompingIndicator(char ChompingIndicator) {
      switch (ChompingIndicator) {
        default:
          llvm_unreachable("invalid chomping indicator");
        case ' ':
          break;
        case '+':
        case '-':
          skip(1);
      }
    }

But exactly how to make this readible depends on what the correct
behaviour is.

> +  Current = skip_while(&Scanner::skip_s_white, Current);
> +  skipComment();
>  
> -  if (Start == Current) {
> -    setError("Got empty block scalar", Start);
> +  if (Current == End) {
> +    // End of file after block header, we have an empty scalar.
> +    Token T;
> +    T.Kind = Token::TK_BlockScalar;
> +    T.Range = StringRef(BlockStart, Current - BlockStart);
> +    TokenQueue.push_back(T);
> +    return true;
> +  }

Can this change be in a separate commit?

> +  auto I = skip_b_break(Current);
> +  if (I == Current) {
> +    setError("Expected a line break after block scalar header", Current);
>      return false;
>    }
> +  Current = I;
> +  Column = 0;
> +  ++Line;
> +  BlockStart = Current;

And that looks like the end of the header.  Please find a way to make
the whole thing read something like this:

    scanBlockHeader(/* ... */);

This probably means passing around (or otherwise storing) some state for
scanning blocks.

> +
> +  while (true) {
> +    // Handle the block's indentation
> +    if (LookingForIndent) {
> +      while (true) {
> +        auto I = skip_s_space(Current);
> +        if (I == Current)
> +          break;
> +        Current = I;
> +        ++Column;
> +      }

This little while loop is a bit hard to understand, and should be split
out.  I suggest something like:

    template <class FTy> void advanceWhile(FTy F) {
      auto Final = skip_while(F, Current);
      Column += Final - Current;
      Current = Final;
    }

I don't really know this class well, so if this API is strangely
inconsistent just come up with something that works.  Anyway, then you
could use this as:

    if (LookingForIndent) {
      advanceWhile(&Scanner::skip_s_space);

> +      // If there is a non breaking character next, then this is a non empty
> +      // line, thus we can use the discovered indentation as the block's
> +      // indentation.
> +      if (skip_nb_char(Current) != Current) {
> +        if (Column <= BlockExitIndent) {
> +          // This is the end of the block literal, so we exit the loop.
> +          break;
> +        } else {
> +          BlockIndent = Column;
> +          LookingForIndent = false;
> +          // So here we enforce the rule that the leading all space lines can't
> +          // have more characters than the block's indentation level
> +          if (MaxAllSpaceLineCharacters > BlockIndent) {
> +            setError("A leading all-space line must not have too many spaces",
> +                     LongestAllSpaceLine);
> +          }
> +        }
> +      } else if (skip_b_break(Current) != Current) {
> +        // This is an all space line, so we have to record the amount of
> +        // space characters it has so that later when we find the first
> +        // text line that defines the indentation level we can make sure
> +        // that all previous all space lines don't have more space
> +        // characters than the indentation level.
> +        if (Column > MaxAllSpaceLineCharacters) {
> +          MaxAllSpaceLineCharacters = Column;
> +          LongestAllSpaceLine = Current;
> +        }
> +      }

This whole block of logic should probably be a in a function called
something like `lookForIndent()`.

Also, please find a way to separate this from the outer `while (true)`
loop.  You only need to find the indentation once, right?

> +    } else {
> +      while (Column < BlockIndent) {
> +        auto I = skip_s_space(Current);
> +        if (I == Current)
> +          break;
> +        Current = I;
> +        ++Column;
> +      }
> +      // If this line isn't empty then we have to check the indentation to
> +      // see if the block scalar ends.
> +      if (skip_nb_char(Current) != Current) {
> +        if (Column <= BlockExitIndent) {
> +          // This is the end of the block literal, exit the loop.
> +          break;
> +        } else if (Column < BlockIndent) {
> +          if (Current != End && *Current == '#') {
> +            // This is a trailing comment, exit the loop.
> +            break;
> +          } else {
> +            setError("A text line is less indented than the block scalar",
> +                     Current);

Should this somehow return in error?

> +          }
> +        }
> +      }

It looks like you should also create a function for this block of code,
which scans the indentation at the beginning of every line.

> +    }
> +
> +    // Fold this line if necessary, by removing an additional line break.
> +    // Folding is applied to text lines which follow other text lines and don't
> +    // have any extra spaces (or tabs) after the indentation spaces.
> +    bool IsLineFolded = false;
> +    if (!IsLiteral && FoldNextLine && Column == BlockIndent &&
> +        skip_s_white(Current) == Current && skip_nb_char(Current) != Current &&
> +        LineBreaks) {
> +      IsLineFolded = true;
> +      --LineBreaks;
> +    }

What does it mean to have a folded line?  Do you need to include this
line-folding logic in the same patch, or can you add that feature after?

> +
> +    // Parse the current line.
> +    auto Start = Current;
> +    Current = skip_while(&Scanner::skip_nb_char, Current);
> +    Column += (Current - Start);
> +    if (Start != Current) {
> +      for (unsigned I = 0; I < LineBreaks; ++I)
> +        OS << "\n";
> +      if (IsLineFolded && !LineBreaks)
> +        OS << ' ';
> +      LineBreaks = 0;
> +      auto Line = StringRef(Start, Current - Start);
> +      OS << Line;
> +      // Don't fold the next line when this line ends with a whitespace
> +      // character.
> +      FoldNextLine = Line.back() != ' ' && Line.back() != '\t';
> +    } else {
> +      FoldNextLine = false;
> +    }

This is the start of the line-folding logic (and the second half is up
above), right?  Is it necessary that it's split across the loop
boundary, or is there a way to localize it here?  It'll be easier to
follow if it's localized.  (I still think this might be better as a
separate patch, since it seems like a separate feature.)

> +
> +    // Check for EOF.
> +    if (Current == End) {
> +      // Ensure that there is at least one line break before the end of file.
> +      if (!LineBreaks)
> +        LineBreaks = 1;
> +      break;
> +    }

Can this be separated into a function?

> +    auto I = skip_b_break(Current);
> +    if (I == Current) {
> +      // There was an error, which should already have been printed out.
> +      return false;

Why not return at the error site?

> +    }
> +    // We've got a line break
> +    Column = 0;
> +    ++Line;
> +    Current = I;
> +    ++LineBreaks;
> +  }
> +
> +  // Don't output any trailing new lines if the stripping chomping behaviour is
> +  // specified.
> +  if (ChompingIndicator == '-')
> +    LineBreaks = 0;
> +  // Clip trailing lines (default chomping behaviour). The final line break
> +  // is preserved as long as the string isn't empty, but the other trailing
> +  // lines aren't kept.
> +  else if (ChompingIndicator != '+')
> +    LineBreaks = OS.str().empty() ? 0 : 1;

Can you put this in some sort of helper function?

> +
> +  for (unsigned I = 0; I < LineBreaks; ++I)
> +    OS << "\n";
> +
> +  // New lines may start a simple key.
> +  if (!FlowLevel)
> +    IsSimpleKeyAllowed = true;

Is this a separable change?

>  
>    Token T;
> -  T.Kind = Token::TK_Scalar;
> -  T.Range = StringRef(Start, Current - Start);
> +  T.Kind = Token::TK_BlockScalar;
> +  T.Range = StringRef(BlockStart, Current - BlockStart);
> +  T.Value = std::move(OS.str());
>    TokenQueue.push_back(T);
>    return true;
>  }
> @@ -1607,6 +1789,7 @@
>    case NK_Null:
>      return "tag:yaml.org,2002:null";
>    case NK_Scalar:
> +  case NK_BlockScalar:
>      // TODO: Tag resolution.
>      return "tag:yaml.org,2002:str";
>    case NK_Mapping:
> @@ -2138,6 +2321,11 @@
>                  , AnchorInfo.Range.substr(1)
>                  , TagInfo.Range
>                  , T.Range);
> +  case Token::TK_BlockScalar:
> +    getNext();
> +    return new (NodeAllocator)
> +        BlockScalarNode(stream.CurrentDoc, AnchorInfo.Range.substr(1),
> +                        TagInfo.Range, T.Value, T.Range);
>    case Token::TK_Key:
>      // Don't eat the TK_Key, KeyValueNode expects it.
>      return new (NodeAllocator)