[PATCH] YAML: Implement block scalar parsing

Tue May 12 14:53:40 PDT 2015

Thanks, I'll post the updated patch soon.

2015-05-12 11:57 GMT-07:00 Duncan P. N. Exon Smith <dexonsmith at apple.com>:

>
> > On 2015-May-06, at 16:47, Alex Lorenz <arphaman at gmail.com> wrote:
> >
> > As suggested by Duncan, I took out the extraction of the skipComment
> method from this patch,
> > which I've committed separately in r236663. I also took out the change
> to the KeyValueNode skip method,
> > which I've committed separately in r236669.
> >
> >
> > REPOSITORY
> >  rL LLVM
> >
> > http://reviews.llvm.org/D9503
> >
> > Files:
> >  include/llvm/Support/YAMLParser.h
> >  lib/Support/YAMLParser.cpp
> >  test/YAMLParser/spec-09-14.test
> >  test/YAMLParser/spec-09-18.test
> >  test/YAMLParser/spec-09-19.test
> >  test/YAMLParser/spec-09-20.test
> >  test/YAMLParser/spec-09-21.test
> >  test/YAMLParser/spec-09-22.test
> >  test/YAMLParser/spec-09-24.test
> >  test/YAMLParser/spec-09-25.test
> >  test/YAMLParser/spec-09-26.test
> >  test/YAMLParser/spec-09-27.test
> >  test/YAMLParser/spec-09-28.test
> >  test/YAMLParser/spec-09-29.test
> >  test/YAMLParser/spec-09-30.test
> >  test/YAMLParser/spec-09-31.test
> >  test/YAMLParser/spec-09-32.test
> >  test/YAMLParser/spec-09-33.test
> >  unittests/Support/YAMLParserTest.cpp
> >  utils/yaml-bench/YAMLBench.cpp
> >
> > EMAIL PREFERENCES
> >  http://reviews.llvm.org/settings/panel/emailpreferences/
> > <D9503.25106.patch>
>
> This is looking better, but I still have a few comments.  Most of them
> boil down to:
>
>  1. Can X be separated into a helper function?
>  2. X looks like an encapsulated feature.  Can it be committed (and
>     tested) separately?
>
> > Index: lib/Support/YAMLParser.cpp
> > ===================================================================
> > --- lib/Support/YAMLParser.cpp
> > +++ lib/Support/YAMLParser.cpp
> > @@ -101,6 +101,7 @@
> >  void Node::anchor() {}
> >  void NullNode::anchor() {}
> >  void ScalarNode::anchor() {}
> > +void BlockScalarNode::anchor() {}
> >  void KeyValueNode::anchor() {}
> >  void MappingNode::anchor() {}
> >  void SequenceNode::anchor() {}
> > @@ -128,6 +129,7 @@
> >      TK_Key,
> >      TK_Value,
> >      TK_Scalar,
> > +    TK_BlockScalar,
> >      TK_Alias,
> >      TK_Anchor,
> >      TK_Tag
> > @@ -137,6 +139,9 @@
> >    /// of the token in the input.
> >    StringRef Range;
> >
> > +  /// The value of a block scalar node.
> > +  std::string Value;
> > +
> >    Token() : Kind(TK_Error) {}
> >  };
> >  }
> > @@ -348,6 +353,14 @@
> >    ///          b-break.
> >    StringRef::iterator skip_b_break(StringRef::iterator Position);
> >
> > +  /// @brief Skip a single s-space[31] starting at Position.
> > +  ///
> > +  /// An s-space is 0x20
> > +  ///
> > +  /// @returns The code unit after the s-space, or Position if it's not
> a
> > +  ///          s-space.
> > +  StringRef::iterator skip_s_space(StringRef::iterator Position);
> > +
> >    /// @brief Skip a single s-white[33] starting at Position.
> >    ///
> >    /// A s-white is 0x20 | 0x9
> > @@ -612,6 +625,9 @@
> >      case Token::TK_Scalar:
> >        OS << "Scalar: ";
> >        break;
> > +    case Token::TK_BlockScalar:
> > +      OS << "Block Scalar: ";
> > +      break;
> >      case Token::TK_Alias:
> >        OS << "Alias: ";
> >        break;
> > @@ -816,6 +832,13 @@
> >    return Position;
> >  }
> >
> > +StringRef::iterator Scanner::skip_s_space(StringRef::iterator Position)
> {
> > +  if (Position == End)
> > +    return Position;
> > +  if (*Position == ' ')
> > +    return Position + 1;
> > +  return Position;
> > +}
> >
> >  StringRef::iterator Scanner::skip_s_white(StringRef::iterator Position)
> {
> >    if (Position == End)
> > @@ -1375,37 +1398,196 @@
> >  }
> >
> >  bool Scanner::scanBlockScalar(bool IsLiteral) {
> > -  StringRef::iterator Start = Current;
>
> It looks like you've renamed this variable from `Start` to `BlockStart`.
> Please commit that separately since it unnecessarily complicates this
> patch.
>

Sure, I will do that.

>
> > -  skip(1); // Eat | or >
> > -  while(true) {
> > -    StringRef::iterator i = skip_nb_char(Current);
> > -    if (i == Current) {
> > -      if (Column == 0)
> > -        break;
> > -      i = skip_b_break(Current);
> > -      if (i != Current) {
> > -        // We got a line break.
> > -        Column = 0;
> > -        ++Line;
> > -        Current = i;
> > -        continue;
> > -      } else {
> > -        // There was an error, which should already have been printed
> out.
> > -        return false;
> > -      }
> > -    }
> > -    Current = i;
> > -    ++Column;
> > +  std::string Str;
> > +  raw_string_ostream OS(Str);
>
> Is the raw_string_ostream really valuable here?  Why not a
> `SmallVector<char, 256>` or some such, and just append to it directly?
>
> I ask because it looks like you're only appending `char`s, so the
> overhead of a stream (even a fast stream) seems indirect.
>

Yes, that would work well.

>
> > +  auto BlockStart = Current;
> > +  // If this is true, then the scanner has to look for a first
> non-blank line
> > +  // to determine the indenation of the block scalar.
> > +  bool LookingForIndent = true;
>
> The name should be more obviously Boolean.  I suggest `IsIndentKnown`
> (and inverting the meaning), but you may find something better.
>
> > +  unsigned BlockIndent = 0;
> > +  unsigned BlockExitIndent = Indent < 0 ? 0 : (unsigned)Indent;
> > +  unsigned LineBreaks = 0;
> > +
> > +  unsigned MaxAllSpaceLineCharacters = 0;
> > +  StringRef::iterator LongestAllSpaceLine;
> > +
> > +  bool FoldNextLine = false;
>
> Please declare all of these variables immediately before their
> respective first uses.
>
> > +
> > +  skip(1); // Eat '|' or '>'
>
> Would it be valuable to assert that Current is one of those two?
>
> > +
> > +  // Block Header ([162] c-b-block-header)
> > +  // Chomping indicator ([164] c-chomping-indicator)
> > +  char ChompingIndicator = ' ';
> > +  if (Current != End && (*Current == '+' || *Current == '-')) {
> > +    ChompingIndicator = *Current;
> > +    skip(1);
> >    }
> > +  // Indentation indicator ([163] c-indentation-indicator)
> > +  if (Current != End && (*Current >= '1' && *Current <= '9')) {
> > +    BlockIndent = unsigned(*Current - '0');
> > +    LookingForIndent = false;
> > +    skip(1);
> > +  }
> > +  // Check for chomping indicator once again.
> > +  if (Current != End && (*Current == '+' || *Current == '-')) {
> > +    ChompingIndicator = *Current;
> > +    skip(1);
> > +  }
>
> Why do you need to check this twice?  Is it valid to for it to appear
> twice?  Are there any requirements for it?
>
> If possible, these should really be thrown into helper functions.
> Depending on the exact semantics of the chomping indicator, something
> like this would be easier to read:
>
>     char ChompingIndicator = scanChompingCharacter();
>     unsigned BlockIndent = scanBlockIndent();
>     skipChompingIndicator(ChompingIndicator);
>
> where I'm assuming:
>
>     void skipChompingIndicator(char ChompingIndicator) {
>       switch (ChompingIndicator) {
>         default:
>           llvm_unreachable("invalid chomping indicator");
>         case ' ':
>           break;
>         case '+':
>         case '-':
>           skip(1);
>       }
>     }
>
> But exactly how to make this readible depends on what the correct
> behaviour is.
>

It can appear before or after the indentation indicator. Actually right
now there's a subtle bug in there that allows you to have two chomping
indicators, so I'll have to fix it.
It would make sense to make it into a separate method, I'll do that.

>
> > +  Current = skip_while(&Scanner::skip_s_white, Current);
> > +  skipComment();
> >
> > -  if (Start == Current) {
> > -    setError("Got empty block scalar", Start);
> > +  if (Current == End) {
> > +    // End of file after block header, we have an empty scalar.
> > +    Token T;
> > +    T.Kind = Token::TK_BlockScalar;
> > +    T.Range = StringRef(BlockStart, Current - BlockStart);
> > +    TokenQueue.push_back(T);
> > +    return true;
> > +  }
>
> Can this change be in a separate commit?
>

I suppose it could but I would prefer it if went into this one to make the
tests consistent. If I were to commit it later it would make tests for this
patch incomplete. Just because the old code reported an error there
doesn't mean it was correct, and I think it would be better to treat it
like that
old code wasn't there so this part isn't so much of a change
but an essential part of a new feature.

>
> > +  auto I = skip_b_break(Current);
> > +  if (I == Current) {
> > +    setError("Expected a line break after block scalar header",
> Current);
> >      return false;
> >    }
> > +  Current = I;
> > +  Column = 0;
> > +  ++Line;
> > +  BlockStart = Current;
>
> And that looks like the end of the header.  Please find a way to make
> the whole thing read something like this:
>
>     scanBlockHeader(/* ... */);
>
> This probably means passing around (or otherwise storing) some state for
> scanning blocks.
>
> > +
> > +  while (true) {
> > +    // Handle the block's indentation
> > +    if (LookingForIndent) {
> > +      while (true) {
> > +        auto I = skip_s_space(Current);
> > +        if (I == Current)
> > +          break;
> > +        Current = I;
> > +        ++Column;
> > +      }
>
> This little while loop is a bit hard to understand, and should be split
> out.  I suggest something like:
>
>     template <class FTy> void advanceWhile(FTy F) {
>       auto Final = skip_while(F, Current);
>       Column += Final - Current;
>       Current = Final;
>     }
>
> I don't really know this class well, so if this API is strangely
> inconsistent just come up with something that works.  Anyway, then you
> could use this as:
>
>     if (LookingForIndent) {
>       advanceWhile(&Scanner::skip_s_space);
>
> > +      // If there is a non breaking character next, then this is a non
> empty
> > +      // line, thus we can use the discovered indentation as the block's
> > +      // indentation.
> > +      if (skip_nb_char(Current) != Current) {
> > +        if (Column <= BlockExitIndent) {
> > +          // This is the end of the block literal, so we exit the loop.
> > +          break;
> > +        } else {
> > +          BlockIndent = Column;
> > +          LookingForIndent = false;
> > +          // So here we enforce the rule that the leading all space
> lines can't
> > +          // have more characters than the block's indentation level
> > +          if (MaxAllSpaceLineCharacters > BlockIndent) {
> > +            setError("A leading all-space line must not have too many
> spaces",
> > +                     LongestAllSpaceLine);
> > +          }
> > +        }
> > +      } else if (skip_b_break(Current) != Current) {
> > +        // This is an all space line, so we have to record the amount of
> > +        // space characters it has so that later when we find the first
> > +        // text line that defines the indentation level we can make sure
> > +        // that all previous all space lines don't have more space
> > +        // characters than the indentation level.
> > +        if (Column > MaxAllSpaceLineCharacters) {
> > +          MaxAllSpaceLineCharacters = Column;
> > +          LongestAllSpaceLine = Current;
> > +        }
> > +      }
>
> This whole block of logic should probably be a in a function called
> something like `lookForIndent()`.
>
> Also, please find a way to separate this from the outer `while (true)`
> loop.  You only need to find the indentation once, right?\
>

Yes, I need to look for indentation once. I'll try to take it out of the
main
loop.

>
> > +    } else {
> > +      while (Column < BlockIndent) {
> > +        auto I = skip_s_space(Current);
> > +        if (I == Current)
> > +          break;
> > +        Current = I;
> > +        ++Column;
> > +      }
> > +      // If this line isn't empty then we have to check the indentation
> to
> > +      // see if the block scalar ends.
> > +      if (skip_nb_char(Current) != Current) {
> > +        if (Column <= BlockExitIndent) {
> > +          // This is the end of the block literal, exit the loop.
> > +          break;
> > +        } else if (Column < BlockIndent) {
> > +          if (Current != End && *Current == '#') {
> > +            // This is a trailing comment, exit the loop.
> > +            break;
> > +          } else {
> > +            setError("A text line is less indented than the block
> scalar",
> > +                     Current);
>
> Should this somehow return in error?
>

It would be better for consistency here, but I think that errors like this
are recoverable. If we return here we will loose the
strict compatibility with the current YAML parser tests in this patch,
which are derived from the PyYAML spec and assume that
the YAML parser is capable of recovering from some errors.

>
> > +          }
> > +        }
> > +      }
>
> It looks like you should also create a function for this block of code,
> which scans the indentation at the beginning of every line.
>
> > +    }
> > +
> > +    // Fold this line if necessary, by removing an additional line
> break.
> > +    // Folding is applied to text lines which follow other text lines
> and don't
> > +    // have any extra spaces (or tabs) after the indentation spaces.
> > +    bool IsLineFolded = false;
> > +    if (!IsLiteral && FoldNextLine && Column == BlockIndent &&
> > +        skip_s_white(Current) == Current && skip_nb_char(Current) !=
> Current &&
> > +        LineBreaks) {
> > +      IsLineFolded = true;
> > +      --LineBreaks;
> > +    }
>
> What does it mean to have a folded line?  Do you need to include this
> line-folding logic in the same patch, or can you add that feature after?
>

Line folding applies to block scalars that start with '>'. It basically
removes
the newlines for lines that don't have any spaces or other newlines between
them.

It would be better to make line folding into a separate feature that I can
send out a patch for
later, yeah.

>
> > +
> > +    // Parse the current line.
> > +    auto Start = Current;
> > +    Current = skip_while(&Scanner::skip_nb_char, Current);
> > +    Column += (Current - Start);
> > +    if (Start != Current) {
> > +      for (unsigned I = 0; I < LineBreaks; ++I)
> > +        OS << "\n";
> > +      if (IsLineFolded && !LineBreaks)
> > +        OS << ' ';
> > +      LineBreaks = 0;
> > +      auto Line = StringRef(Start, Current - Start);
> > +      OS << Line;
> > +      // Don't fold the next line when this line ends with a whitespace
> > +      // character.
> > +      FoldNextLine = Line.back() != ' ' && Line.back() != '\t';
> > +    } else {
> > +      FoldNextLine = false;
> > +    }
>
> This is the start of the line-folding logic (and the second half is up
> above), right?  Is it necessary that it's split across the loop
> boundary, or is there a way to localize it here?  It'll be easier to
> follow if it's localized.  (I still think this might be better as a
> separate patch, since it seems like a separate feature.)
>
> > +
> > +    // Check for EOF.
> > +    if (Current == End) {
> > +      // Ensure that there is at least one line break before the end of
> file.
> > +      if (!LineBreaks)
> > +        LineBreaks = 1;
> > +      break;
> > +    }
>
> Can this be separated into a function?
>
> > +    auto I = skip_b_break(Current);
> > +    if (I == Current) {
> > +      // There was an error, which should already have been printed out.
> > +      return false;
>
> Why not return at the error site?
>

This isn't related to the two errors that don't return above. The comment is
misleading (I copied the comment from the old code).
This is a separate error (unexpected characters instead of line break) and
I think that we should actually report and error here as the error for this
particular problem doesn't appear to be actually reported.

>
> > +    }
> > +    // We've got a line break
> > +    Column = 0;
> > +    ++Line;
> > +    Current = I;
> > +    ++LineBreaks;
> > +  }
> > +
> > +  // Don't output any trailing new lines if the stripping chomping
> behaviour is
> > +  // specified.
> > +  if (ChompingIndicator == '-')
> > +    LineBreaks = 0;
> > +  // Clip trailing lines (default chomping behaviour). The final line
> break
> > +  // is preserved as long as the string isn't empty, but the other
> trailing
> > +  // lines aren't kept.
> > +  else if (ChompingIndicator != '+')
> > +    LineBreaks = OS.str().empty() ? 0 : 1;
>
> Can you put this in some sort of helper function?
>
> > +
> > +  for (unsigned I = 0; I < LineBreaks; ++I)
> > +    OS << "\n";
> > +
> > +  // New lines may start a simple key.
> > +  if (!FlowLevel)
> > +    IsSimpleKeyAllowed = true;
>
> Is this a separable change?
>

No, I can't really take it out. if I commit it earlier it wouldn't be
possible to test it,
and if I'll leave it until later it would really break some of the yaml
parser tests in this
commit.

>
> >
> >    Token T;
> > -  T.Kind = Token::TK_Scalar;
> > -  T.Range = StringRef(Start, Current - Start);
> > +  T.Kind = Token::TK_BlockScalar;
> > +  T.Range = StringRef(BlockStart, Current - BlockStart);
> > +  T.Value = std::move(OS.str());
> >    TokenQueue.push_back(T);
> >    return true;
> >  }
> > @@ -1607,6 +1789,7 @@
> >    case NK_Null:
> >      return "tag:yaml.org,2002:null";
> >    case NK_Scalar:
> > +  case NK_BlockScalar:
> >      // TODO: Tag resolution.
> >      return "tag:yaml.org,2002:str";
> >    case NK_Mapping:
> > @@ -2138,6 +2321,11 @@
> >                  , AnchorInfo.Range.substr(1)
> >                  , TagInfo.Range
> >                  , T.Range);
> > +  case Token::TK_BlockScalar:
> > +    getNext();
> > +    return new (NodeAllocator)
> > +        BlockScalarNode(stream.CurrentDoc, AnchorInfo.Range.substr(1),
> > +                        TagInfo.Range, T.Value, T.Range);
> >    case Token::TK_Key:
> >      // Don't eat the TK_Key, KeyValueNode expects it.
> >      return new (NodeAllocator)
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150512/34210320/attachment.html>