[PATCH] YAML: Implement block scalar parsing

Wed May 13 15:29:17 PDT 2015

New patch LGTM.  Thanks!

> On 2015 May 13, at 15:07, Alex L <arphaman at gmail.com> wrote:
> 
> 
> Thanks.
> Some comments below:
> 
> 2015-05-13 11:37 GMT-07:00 Duncan P. N. Exon Smith <dexonsmith at apple.com>:
> 
> > On 2015 May 13, at 09:53, Alex Lorenz <arphaman at gmail.com> wrote:
> >
> > This update uses a new helper function that consumes a line break in another part of code that I forgot to update last time.
> >
> >
> > REPOSITORY
> >  rL LLVM
> >
> > http://reviews.llvm.org/D9503
> >
> > Files:
> >  include/llvm/Support/YAMLParser.h
> >  lib/Support/YAMLParser.cpp
> >  test/YAMLParser/spec-09-14.test
> >  test/YAMLParser/spec-09-18.test
> >  test/YAMLParser/spec-09-19.test
> >  test/YAMLParser/spec-09-20.test
> >  test/YAMLParser/spec-09-21.test
> >  test/YAMLParser/spec-09-22.test
> >  test/YAMLParser/spec-09-24.test
> >  test/YAMLParser/spec-09-25.test
> >  test/YAMLParser/spec-09-26.test
> >  test/YAMLParser/spec-09-27.test
> >  test/YAMLParser/spec-09-28.test
> >  unittests/Support/YAMLParserTest.cpp
> >  utils/yaml-bench/YAMLBench.cpp
> >
> > EMAIL PREFERENCES
> >  http://reviews.llvm.org/settings/panel/emailpreferences/
> > <D9503.25702.patch>
> 
> It's a lot easier to see what's going on now; thanks.
> 
> I wonder whether "returning IsDone" is the right thing for your
> helpers, or whether "returning IsError" would be cleaner.  I also
> think in one spot you fail to return correctly on error (I pointed
> it out below somewhere).
> 
> Many of the rest of the comments amount to: "skip `else` after a
> `return`".
> 
> > Index: lib/Support/YAMLParser.cpp
> > ===================================================================
> > --- lib/Support/YAMLParser.cpp
> > +++ lib/Support/YAMLParser.cpp
> > @@ -101,6 +101,7 @@
> >  void Node::anchor() {}
> >  void NullNode::anchor() {}
> >  void ScalarNode::anchor() {}
> > +void BlockScalarNode::anchor() {}
> >  void KeyValueNode::anchor() {}
> >  void MappingNode::anchor() {}
> >  void SequenceNode::anchor() {}
> > @@ -128,6 +129,7 @@
> >      TK_Key,
> >      TK_Value,
> >      TK_Scalar,
> > +    TK_BlockScalar,
> >      TK_Alias,
> >      TK_Anchor,
> >      TK_Tag
> > @@ -137,6 +139,9 @@
> >    /// of the token in the input.
> >    StringRef Range;
> >
> > +  /// The value of a block scalar node.
> > +  std::string Value;
> > +
> >    Token() : Kind(TK_Error) {}
> >  };
> >  }
> > @@ -348,6 +353,14 @@
> >    ///          b-break.
> >    StringRef::iterator skip_b_break(StringRef::iterator Position);
> >
> > +  /// @brief Skip a single s-space[31] starting at Position.
> > +  ///
> > +  /// An s-space is 0x20
> > +  ///
> > +  /// @returns The code unit after the s-space, or Position if it's not a
> > +  ///          s-space.
> > +  StringRef::iterator skip_s_space(StringRef::iterator Position);
> > +
> >    /// @brief Skip a single s-white[33] starting at Position.
> >    ///
> >    /// A s-white is 0x20 | 0x9
> > @@ -373,6 +386,13 @@
> >    StringRef::iterator skip_while( SkipWhileFunc Func
> >                                  , StringRef::iterator Position);
> >
> > +  /// @brief Skip minimal well-formed code unit subsequences until Func
> > +  ///        returns its input.
> > +  ///
> > +  /// @returns The code unit after the last minimal well-formed code unit
> > +  ///          subsequence that Func accepted.
> > +  void advanceWhile(SkipWhileFunc Func);
> > +
> >    /// @brief Scan ns-uri-char[39]s starting at Cur.
> >    ///
> >    /// This updates Cur and Column while scanning.
> > @@ -393,6 +413,11 @@
> >    ///        Pos is whitespace or a new line
> >    bool isBlankOrBreak(StringRef::iterator Position);
> >
> > +  /// @brief Consume a single b-break[28] if it's present at the current
> > +  /// position. Return false if the code unit at the current position ins't a
> > +  /// line break.
> > +  bool consumeLineBreakIfPresent();
> > +
> >    /// @brief If IsSimpleKeyAllowed, create and push_back a new SimpleKey.
> >    void saveSimpleKeyCandidate( TokenQueueT::iterator Tok
> >                               , unsigned AtColumn
> > @@ -466,6 +491,26 @@
> >    /// @brief Scan a block scalar starting with | or >.
> >    bool scanBlockScalar(bool IsLiteral);
> >
> > +  /// @brief Scan a chomping indicator in a block scalar header.
> > +  char scanBlockChompingIndicator();
> > +
> > +  /// @brief Scan an indentation indicator in a block scalar header.
> > +  unsigned scanBlockIndentationIndicator();
> > +
> > +  /// @brief Scan a block scalar header.
> > +  /// Return true if we've reached the end of the block scalar.
> > +  bool scanBlockScalarHeader(char &ChompingIndicator, unsigned &IndentIndicator,
> > +                             bool &WasScanningSuccessful);
> > +
> > +  /// @brief Look for the indentation level of a block scalar.
> > +  /// Return true if we've reached the end of the block scalar.
> > +  bool findBlockScalarIndent(unsigned &BlockIndent, unsigned BlockExitIndent,
> > +                             unsigned &LineBreaks);
> > +
> > +  /// @brief Scan the indentation of a text line in a block scalar.
> > +  /// Return true if we've reached the end of the block scalar.
> > +  bool scanBlockScalarIndent(unsigned BlockIndent, unsigned BlockExitIndent);
> > +
> >    /// @brief Scan a tag of the form !stuff.
> >    bool scanTag();
> >
> > @@ -612,6 +657,9 @@
> >      case Token::TK_Scalar:
> >        OS << "Scalar: ";
> >        break;
> > +    case Token::TK_BlockScalar:
> > +      OS << "Block Scalar: ";
> > +      break;
> >      case Token::TK_Alias:
> >        OS << "Alias: ";
> >        break;
> > @@ -816,6 +864,13 @@
> >    return Position;
> >  }
> >
> > +StringRef::iterator Scanner::skip_s_space(StringRef::iterator Position) {
> > +  if (Position == End)
> > +    return Position;
> > +  if (*Position == ' ')
> > +    return Position + 1;
> > +  return Position;
> > +}
> >
> >  StringRef::iterator Scanner::skip_s_white(StringRef::iterator Position) {
> >    if (Position == End)
> > @@ -844,6 +899,12 @@
> >    return Position;
> >  }
> >
> > +void Scanner::advanceWhile(SkipWhileFunc Func) {
> > +  auto Final = skip_while(Func, Current);
> > +  Column += Final - Current;
> > +  Current = Final;
> > +}
> > +
> >  static bool is_ns_hex_digit(const char C) {
> >    return    (C >= '0' && C <= '9')
> >           || (C >= 'a' && C <= 'z')
> > @@ -906,6 +967,16 @@
> >    return false;
> >  }
> >
> > +bool Scanner::consumeLineBreakIfPresent() {
> > +  auto Next = skip_b_break(Current);
> > +  if (Next == Current)
> > +    return false;
> > +  Column = 0;
> > +  ++Line;
> > +  Current = Next;
> > +  return true;
> > +}
> > +
> >  void Scanner::saveSimpleKeyCandidate( TokenQueueT::iterator Tok
> >                                      , unsigned AtColumn
> >                                      , bool IsRequired) {
> > @@ -1374,38 +1445,220 @@
> >    return true;
> >  }
> >
> > -bool Scanner::scanBlockScalar(bool IsLiteral) {
> > -  StringRef::iterator Start = Current;
> > -  skip(1); // Eat | or >
> > -  while(true) {
> > -    StringRef::iterator i = skip_nb_char(Current);
> > -    if (i == Current) {
> > -      if (Column == 0)
> > -        break;
> > -      i = skip_b_break(Current);
> > -      if (i != Current) {
> > -        // We got a line break.
> > -        Column = 0;
> > -        ++Line;
> > -        Current = i;
> > -        continue;
> > +char Scanner::scanBlockChompingIndicator() {
> > +  char Indicator = ' ';
> > +  if (Current != End && (*Current == '+' || *Current == '-')) {
> > +    Indicator = *Current;
> > +    skip(1);
> > +  }
> > +  return Indicator;
> > +}
> > +
> > +/// \brief Return the number of trailing line breaks that have to be emitted
> > +/// as determined by the chomping indicator.
> 
> Summarize the function in one line.  If that's not enough, add more
> after a blank line.  You don't need the `\brief` anymore, since we've
> turned on auto-brief.  Something like this:
> 
>     /// Get the number of line breaks after chomping.
>     ///
>     /// Return the number of trailing line breaks to emit, depending on
>     /// \c ChompingIndicator.
> 
> Please adjust the rest of your doxygen comments similarly.
> 
> Fixed.
>  
> 
> > +static unsigned applyChompingIndicator(char ChompingIndicator,
> 
> I think `getChompedLineBreaks()` is a better name.
> 
> Done.
>  
> 
> > +                                       unsigned LineBreaks, StringRef Str) {
> 
> I think your comments should be more succinct.  The code would be easier
> to read.
> 
> > +  // Don't output any trailing new lines if the stripping chomping behaviour is
> > +  // specified.
> 
> If you move the comment inside the `if`, you don't need to describe the
> condition.  This could just be a comment on the same line as the
> `return` (or the line before) that said something like "// Strip all
> line breaks.".
> 
> Fixed.
>  
> 
> > +  if (ChompingIndicator == '-')
> > +    return 0;
> > +  // Keep trailing lines when the keep chomping behaviour is specified.
> 
> Similarly, this comment could be "// Keep all line breaks.".
> 
> > +  else if (ChompingIndicator == '+')
> 
> This should be just `if`.
> 
> > +    return LineBreaks;
> > +  // Clip trailing lines (default chomping behaviour). The final line break
> > +  // is preserved as long as the string isn't empty, but the other trailing
> > +  // lines aren't kept.
> 
> And this could be "Clip trailing lines."
> 
> > +  return Str.empty() ? 0 : 1;
> 
> Resulting in something like:
> 
>     if (ChompingIndicator == '-') // Strip all line breaks.
>       return 0;
> 
>     if (ChompingIndicator == '+') // Keep all line breaks.
>       return LineBreaks;
> 
>     // Clip trailing lines.
>     return Str.empty() ? 0 : 1;
> 
> Done.
>  
> 
> > +}
> > +
> > +unsigned Scanner::scanBlockIndentationIndicator() {
> > +  unsigned Indent = 0;
> > +  if (Current != End && (*Current >= '1' && *Current <= '9')) {
> > +    Indent = unsigned(*Current - '0');
> > +    skip(1);
> > +  }
> > +  return Indent;
> > +}
> > +
> > +bool Scanner::scanBlockScalarHeader(char &ChompingIndicator,
> > +                                    unsigned &IndentIndicator,
> > +                                    bool &WasScanningSuccessful) {
> > +  // Chomping indicator ([164] c-chomping-indicator)
> 
> Remove this comment.  Add this to the docs for
> `scanBlockChompingIndicator()` if it's useful.
> 
> Fixed.
>  
> 
> > +  ChompingIndicator = scanBlockChompingIndicator();
> > +  // Indentation indicator ([163] c-indentation-indicator)
> 
> Remove this comment.  Add this to the docs for
> `scanBlockIndentationIndicator()` if it's useful.
> 
> Fixed.
>  
> 
> > +  IndentIndicator = scanBlockIndentationIndicator();
> > +  // Check for the chomping indicator once again.
> > +  if (ChompingIndicator == ' ')
> > +    ChompingIndicator = scanBlockChompingIndicator();
> > +  Current = skip_while(&Scanner::skip_s_white, Current);
> > +  skipComment();
> > +
> > +  if (Current == End) {
> > +    // End of file after block header, we have an empty scalar.
> > +    WasScanningSuccessful = true;
> > +    return true;
> > +  }
> > +  auto I = skip_b_break(Current);
> > +  if (I == Current) {
> > +    setError("Expected a line break after block scalar header", Current);
> > +    WasScanningSuccessful = false;
> > +    return true;
> > +  }
> > +  Current = I;
> > +  Column = 0;
> > +  ++Line;
> > +  return false;
> > +}
> > +
> > +bool Scanner::findBlockScalarIndent(unsigned &BlockIndent,
> > +                                    unsigned BlockExitIndent,
> > +                                    unsigned &LineBreaks) {
> > +  unsigned MaxAllSpaceLineCharacters = 0;
> > +  StringRef::iterator LongestAllSpaceLine;
> > +
> > +  while (true) {
> > +    advanceWhile(&Scanner::skip_s_space);
> > +    // If there is a non breaking character next, then this is a non empty
> > +    // line, thus we can use the discovered indentation as the block's
> > +    // indentation.
> 
> I'd move this...
> 
> > +    if (skip_nb_char(Current) != Current) {
> 
> ... to here:
> 
>      // This line isn't empty, so find the indentation.
> 
> Done.
>  
> 
> > +      if (Column <= BlockExitIndent) {
> > +        // This is the end of the block literal
> > +        return true;
> >        } else {
> 
> Remove the `else`.
> 
> Fixed.
>  
> 
> > -        // There was an error, which should already have been printed out.
> > -        return false;
> > +        BlockIndent = Column;
> > +        // So here we enforce the rule that the leading all space lines can't
> > +        // have more characters than the block's indentation level
> 
> Is this comment useful?  It looks to me like it's obvious from the error
> message.
> 
> Removed.
>  
> 
> > +        if (MaxAllSpaceLineCharacters > BlockIndent) {
> > +          setError("A leading all-space line must not have too many spaces",
> 
> Do you have a test for this error?
> 
> Yes, it’s tested in the YAML parser unit tests.
>  
> 
> "too many" is vague.  I suggest something more like:
> 
>     "Leading all-space lines must be smaller than the block indent"
> 
> Fixed.
>  
> 
> > +                   LongestAllSpaceLine);
> > +        }
> > +        // We found the block's indentation.
> > +        break;
> 
> This should just be `return false`.
> 
> > +      }
> > +    } else if (skip_b_break(Current) != Current) {
> 
> Remove the `else`.
>  
> Fixed.
>  
> 
> > +      // This is an all space line, so we have to record the amount of
> > +      // space characters it has so that later when we find the first
> > +      // text line that defines the indentation level we can make sure
> > +      // that all previous all space lines don't have more space
> > +      // characters than the indentation level.
> 
> This is long and inhibits legibility of the code...
> 
> > +      if (Column > MaxAllSpaceLineCharacters) {
> 
> ... so move it to here and skip the stuff that's obvious from reading
> the code.
> 
>     // Record the longest all-space line in case it's longer than the
>     // block indent.
> 
> Done.
>  
> 
> > +        MaxAllSpaceLineCharacters = Column;
> > +        LongestAllSpaceLine = Current;
> >        }
> >      }
> > -    Current = i;
> > +
> > +    // Check for EOF.
> > +    if (Current == End)
> > +      return true;
> > +
> > +    if (!consumeLineBreakIfPresent())
> > +      return true;
> > +    ++LineBreaks;
> > +  }
> > +  return false;
> > +}
> > +
> > +bool Scanner::scanBlockScalarIndent(unsigned BlockIndent,
> > +                                    unsigned BlockExitIndent) {
> > +  // Skip the indentation.
> > +  while (Column < BlockIndent) {
> > +    auto I = skip_s_space(Current);
> > +    if (I == Current)
> > +      break;
> > +    Current = I;
> >      ++Column;
> >    }
> >
> > -  if (Start == Current) {
> > -    setError("Got empty block scalar", Start);
> > -    return false;
> > +  // If this line isn't empty then we have to check the indentation to
> > +  // see if the block scalar ends.
> > +  if (skip_nb_char(Current) != Current) {
> > +    if (Column <= BlockExitIndent) {
> > +      // This is the end of the block literal
> > +      return true;
> > +    } else if (Column < BlockIndent) {
> 
> Remove the `else`.
> 
> Fixed.
>  
> 
> > +      if (Current != End && *Current == '#') {
> > +        // This is a trailing comment
> > +        return true;
> 
> What about spaces followed by a comment?  Is this duplicated from your
> other comment-handling code?
> 
> I’m not sure what you mean here. The leading all-space lines treat comments
> like content, this doesn’t come up there.
>  
> 
> > +      } else {
> 
> Remove the `else`.
> 
> Fixed.
>  
> 
> > +        setError("A text line is less indented than the block scalar", Current);
> 
> Aren't you returning "IsDone"?  Should you return `true` here?
> 
> I'm returning false for an error now.
>  
> 
> > +      }
> > +    }
> > +  }
> > +  return false;
> 
> If you invert the outer condition, you can return early, and this'll all
> be easier to read:
> 
>     if (skip_nb_char(Current) == Current)
>       return false;
> 
>     if (Column <= BlockExitIndent) // End of the block literal.
>       return true;
> 
>     if (Curent != End && *Current == '#') // Trailing comment.
>       return true;
> 
>     setError(...);
>     return false;
> 
> This logic matches what you have (I think!), but it seems kind of off.
> In particular, I'm not sure the error makes sense -- it seems to
> describe the "Column <= BlockExitIndent" scenario.  Once you've
> refactored this, please review the logic.
> 
> Refactored.
>  
> 
> > +}
> > +
> > +bool Scanner::scanBlockScalar(bool IsLiteral) {
> > +  auto Start = Current;
> > +  // Eat '|' or '>'
> > +  assert(*Current == '|' || *Current == '>');
> > +  skip(1);
> > +
> > +  // Block Header ([162] c-b-block-header)
> 
> This comment should be part of `scanBlockScalarHeader()`, if anywhere.
> 
> Removed.
>  
> 
> > +  char ChompingIndicator;
> > +  unsigned BlockIndent;
> > +  bool WasScanningSuccessful;
> > +  bool Done = scanBlockScalarHeader(ChompingIndicator, BlockIndent,
> > +                                    WasScanningSuccessful);
> 
> Usually we use returns from parsing functions as an error mechanism
> (see, e.g., `LLParser.cpp`).  Should you really use it for Done-ness?
> If you use it for errors, you get code like this:
> 
>     if (!scanBlockScalarHeader(...))
>       return false;
> 
> Play with it a bit, and see if it looks cleaner (you'd need to change
> the rest of your functions as well).  I don't feel strongly either way.
> 
> That would be good for consistency, I use the error return now in the
> updated patch.
>  
> 
> > +  if (Done) {
> > +    if (WasScanningSuccessful) {
> > +      Token T;
> > +      T.Kind = Token::TK_BlockScalar;
> > +      T.Range = StringRef(Start, Current - Start);
> > +      TokenQueue.push_back(T);
> 
> Would this logic fit inside `scanBlockScalarHeader()`?
> 
> Yes, done.
>  
> 
> > +    }
> > +    return WasScanningSuccessful;
> > +  }
> > +  Start = Current;
> > +
> > +  // Look for a first non-blank line to determine the indenation of the block
> > +  // scalar.
> 
> Not a useful comment.  The name of the function you're calling explains
> this.
> 
> Removed.
>  
> 
> > +  unsigned BlockExitIndent = Indent < 0 ? 0 : (unsigned)Indent;
> > +  unsigned LineBreaks = 0;
> > +  if (BlockIndent == 0)
> > +    Done = findBlockScalarIndent(BlockIndent, BlockExitIndent, LineBreaks);
> > +
> > +  // Scan the block's scalars body.
> > +  SmallString<256> Str;
> > +  while (!Done) {
> > +    // Handle the block's indentation
> 
> The function name below is sufficient.  This comment isn't useful.
> 
> Removed.
>  
> 
> > +    if (scanBlockScalarIndent(BlockIndent, BlockExitIndent))
> 
> What about errors from this?
> 
> Fixed.
>  
> 
> > +      break;
> > +
> > +    // Parse the current line.
> > +    auto LineStart = Current;
> > +    advanceWhile(&Scanner::skip_nb_char);
> > +    if (LineStart != Current) {
> > +      for (unsigned I = 0; I < LineBreaks; ++I)
> > +        Str.push_back('\n');
> 
> This loop is the same as:
> 
>     Str.append(LineBreaks, '\n');
> 
> Fixed.
>  
> 
> > +      LineBreaks = 0;
> > +      Str.append(StringRef(LineStart, Current - LineStart));
> 
> I think it'll be easier to read if you put the two `Str`-modifying
> operations next to each other, and then reset `LineBreaks`.
> 
> Done.
>  
> 
> > +    }
> > +
> > +    // Check for EOF.
> > +    if (Current == End)
> > +      break;
> > +
> > +    if (!consumeLineBreakIfPresent())
> > +      break;
> > +    ++LineBreaks;
> >    }
> >
> > +  if (Current == End && !LineBreaks)
> > +    // Ensure that there is at least one line break before the end of file.
> > +    LineBreaks = 1;
> 
> Is this tested somewhere?  Does it interact correctly with the
> chomping style?
> 
> Yes - it is tested at the 'YAMLParser/spec-09-24.test’ test.
>  
> 
> > +  LineBreaks = applyChompingIndicator(ChompingIndicator, LineBreaks, Str);
> > +  for (unsigned I = 0; I < LineBreaks; ++I)
> > +    Str.push_back('\n');
> 
> Another case of:
> 
>     Str.append(LineBreaks, '\n');
> 
> With the rename I suggested for this function, I think this would read
> well with both lines merged:
> 
>     Str.append(getChompedLineBreaks(ChompingIndicator, LineBreaks, Str),
>                '\n');
> 
> Up to you though.
> 
> Fixed.
>  
> 
> > +
> > +  // New lines may start a simple key.
> > +  if (!FlowLevel)
> > +    IsSimpleKeyAllowed = true;
> 
> Is this tested somewhere?  Can it be in a separate commit?
> 
> Yes - it is tested by several tests. I would really prefer this to be in this
> commit as It can’t be tested if I were to commit it before and
> after it would break most tests in this one. Those tests are derived from
> PyYAML and I think it’s important to try and keep the compatibility.
>  
> 
> > +
> >    Token T;
> > -  T.Kind = Token::TK_Scalar;
> > +  T.Kind = Token::TK_BlockScalar;
> >    T.Range = StringRef(Start, Current - Start);
> > +  T.Value = std::move(Str.str().str());
> 
> The `std::move()` isn't useful -- this is already returning a temporary.
> 
> Fixed.
>  
> 
> >    TokenQueue.push_back(T);
> >    return true;
> >  }
> > @@ -1607,6 +1860,7 @@
> >    case NK_Null:
> >      return "tag:yaml.org,2002:null";
> >    case NK_Scalar:
> > +  case NK_BlockScalar:
> >      // TODO: Tag resolution.
> >      return "tag:yaml.org,2002:str";
> 
> Is this change tested somewhere?
>  
> Yes - all the file check tests check the !!str tag, and yaml-bench derives this tag from this string.
>  
> 
> >    case NK_Mapping:
> > @@ -2138,6 +2392,11 @@
> >                  , AnchorInfo.Range.substr(1)
> >                  , TagInfo.Range
> >                  , T.Range);
> > +  case Token::TK_BlockScalar:
> > +    getNext();
> > +    return new (NodeAllocator)
> > +        BlockScalarNode(stream.CurrentDoc, AnchorInfo.Range.substr(1),
> > +                        TagInfo.Range, T.Value, T.Range);
> >    case Token::TK_Key:
> >      // Don't eat the TK_Key, KeyValueNode expects it.
> >      return new (NodeAllocator)