<div dir="ltr">Thanks, I'll post the updated patch soon.<div class="gmail_extra"><br><div class="gmail_quote">2015-05-12 11:57 GMT-07:00 Duncan P. N. Exon Smith <span dir="ltr"><<a href="mailto:dexonsmith@apple.com" target="_blank">dexonsmith@apple.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5"><br>

> On 2015-May-06, at 16:47, Alex Lorenz <<a href="mailto:arphaman@gmail.com">arphaman@gmail.com</a>> wrote:<br>

><br>

> As suggested by Duncan, I took out the extraction of the skipComment method from this patch,<br>

> which I've committed separately in r236663. I also took out the change to the KeyValueNode skip method,<br>

> which I've committed separately in r236669.<br>

><br>

><br>

> REPOSITORY<br>

>  rL LLVM<br>

><br>

> <a href="http://reviews.llvm.org/D9503" target="_blank">http://reviews.llvm.org/D9503</a><br>

><br>

> Files:<br>

>  include/llvm/Support/YAMLParser.h<br>

>  lib/Support/YAMLParser.cpp<br>

>  test/YAMLParser/spec-09-14.test<br>

>  test/YAMLParser/spec-09-18.test<br>

>  test/YAMLParser/spec-09-19.test<br>

>  test/YAMLParser/spec-09-20.test<br>

>  test/YAMLParser/spec-09-21.test<br>

>  test/YAMLParser/spec-09-22.test<br>

>  test/YAMLParser/spec-09-24.test<br>

>  test/YAMLParser/spec-09-25.test<br>

>  test/YAMLParser/spec-09-26.test<br>

>  test/YAMLParser/spec-09-27.test<br>

>  test/YAMLParser/spec-09-28.test<br>

>  test/YAMLParser/spec-09-29.test<br>

>  test/YAMLParser/spec-09-30.test<br>

>  test/YAMLParser/spec-09-31.test<br>

>  test/YAMLParser/spec-09-32.test<br>

>  test/YAMLParser/spec-09-33.test<br>

>  unittests/Support/YAMLParserTest.cpp<br>

>  utils/yaml-bench/YAMLBench.cpp<br>

><br>

> EMAIL PREFERENCES<br>

>  <a href="http://reviews.llvm.org/settings/panel/emailpreferences/" target="_blank">http://reviews.llvm.org/settings/panel/emailpreferences/</a><br>

</div></div>> <D9503.25106.patch><br>

<br>

This is looking better, but I still have a few comments.  Most of them<br>

boil down to:<br>

<br>

 1. Can X be separated into a helper function?<br>

 2. X looks like an encapsulated feature.  Can it be committed (and<br>

    tested) separately?<br>

<div><div class="h5"><br>

> Index: lib/Support/YAMLParser.cpp<br>

> ===================================================================<br>

> --- lib/Support/YAMLParser.cpp<br>

> +++ lib/Support/YAMLParser.cpp<br>

> @@ -101,6 +101,7 @@<br>

>  void Node::anchor() {}<br>

>  void NullNode::anchor() {}<br>

>  void ScalarNode::anchor() {}<br>

> +void BlockScalarNode::anchor() {}<br>

>  void KeyValueNode::anchor() {}<br>

>  void MappingNode::anchor() {}<br>

>  void SequenceNode::anchor() {}<br>

> @@ -128,6 +129,7 @@<br>

>      TK_Key,<br>

>      TK_Value,<br>

>      TK_Scalar,<br>

> +    TK_BlockScalar,<br>

>      TK_Alias,<br>

>      TK_Anchor,<br>

>      TK_Tag<br>

> @@ -137,6 +139,9 @@<br>

>    /// of the token in the input.<br>

>    StringRef Range;<br>

><br>

> +  /// The value of a block scalar node.<br>

> +  std::string Value;<br>

> +<br>

>    Token() : Kind(TK_Error) {}<br>

>  };<br>

>  }<br>

> @@ -348,6 +353,14 @@<br>

>    ///          b-break.<br>

>    StringRef::iterator skip_b_break(StringRef::iterator Position);<br>

><br>

> +  /// @brief Skip a single s-space[31] starting at Position.<br>

> +  ///<br>

> +  /// An s-space is 0x20<br>

> +  ///<br>

> +  /// @returns The code unit after the s-space, or Position if it's not a<br>

> +  ///          s-space.<br>

> +  StringRef::iterator skip_s_space(StringRef::iterator Position);<br>

> +<br>

>    /// @brief Skip a single s-white[33] starting at Position.<br>

>    ///<br>

>    /// A s-white is 0x20 | 0x9<br>

</div></div>> @@ -612,6 +625,9 @@<br>

<span class="">>      case Token::TK_Scalar:<br>

>        OS << "Scalar: ";<br>

>        break;<br>

> +    case Token::TK_BlockScalar:<br>

> +      OS << "Block Scalar: ";<br>

> +      break;<br>

>      case Token::TK_Alias:<br>

>        OS << "Alias: ";<br>

>        break;<br>

</span>> @@ -816,6 +832,13 @@<br>

<span class="">>    return Position;<br>

>  }<br>

><br>

> +StringRef::iterator Scanner::skip_s_space(StringRef::iterator Position) {<br>

> +  if (Position == End)<br>

> +    return Position;<br>

> +  if (*Position == ' ')<br>

> +    return Position + 1;<br>

> +  return Position;<br>

> +}<br>

><br>

>  StringRef::iterator Scanner::skip_s_white(StringRef::iterator Position) {<br>

>    if (Position == End)<br>

</span>> @@ -1375,37 +1398,196 @@<br>

<span class="">>  }<br>

><br>

>  bool Scanner::scanBlockScalar(bool IsLiteral) {<br>

> -  StringRef::iterator Start = Current;<br>

<br>

</span>It looks like you've renamed this variable from `Start` to `BlockStart`.<br>

Please commit that separately since it unnecessarily complicates this<br>

patch.<br></blockquote><div><br></div><div><br></div><div>Sure, I will do that.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class=""><br>

> -  skip(1); // Eat | or ><br>

> -  while(true) {<br>

> -    StringRef::iterator i = skip_nb_char(Current);<br>

> -    if (i == Current) {<br>

> -      if (Column == 0)<br>

> -        break;<br>

> -      i = skip_b_break(Current);<br>

> -      if (i != Current) {<br>

> -        // We got a line break.<br>

> -        Column = 0;<br>

> -        ++Line;<br>

> -        Current = i;<br>

> -        continue;<br>

> -      } else {<br>

> -        // There was an error, which should already have been printed out.<br>

> -        return false;<br>

> -      }<br>

> -    }<br>

> -    Current = i;<br>

> -    ++Column;<br>

> +  std::string Str;<br>

> +  raw_string_ostream OS(Str);<br>

<br>

</span>Is the raw_string_ostream really valuable here?  Why not a<br>

`SmallVector<char, 256>` or some such, and just append to it directly?<br>

<br>

I ask because it looks like you're only appending `char`s, so the<br>

overhead of a stream (even a fast stream) seems indirect.<br></blockquote><div><br></div><div>Yes, that would work well.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class=""><br>

> +  auto BlockStart = Current;<br>

> +  // If this is true, then the scanner has to look for a first non-blank line<br>

> +  // to determine the indenation of the block scalar.<br>

> +  bool LookingForIndent = true;<br>

<br>

</span>The name should be more obviously Boolean.  I suggest `IsIndentKnown`<br>

(and inverting the meaning), but you may find something better.<br>

<span class=""><br>

> +  unsigned BlockIndent = 0;<br>

> +  unsigned BlockExitIndent = Indent < 0 ? 0 : (unsigned)Indent;<br>

> +  unsigned LineBreaks = 0;<br>

> +<br>

> +  unsigned MaxAllSpaceLineCharacters = 0;<br>

> +  StringRef::iterator LongestAllSpaceLine;<br>

> +<br>

> +  bool FoldNextLine = false;<br>

<br>

</span>Please declare all of these variables immediately before their<br>

respective first uses.<br>

<span class=""><br>

> +<br>

> +  skip(1); // Eat '|' or '>'<br>

<br>

</span>Would it be valuable to assert that Current is one of those two?<br>

<span class=""><br>

> +<br>

> +  // Block Header ([162] c-b-block-header)<br>

> +  // Chomping indicator ([164] c-chomping-indicator)<br>

> +  char ChompingIndicator = ' ';<br>

> +  if (Current != End && (*Current == '+' || *Current == '-')) {<br>

> +    ChompingIndicator = *Current;<br>

> +    skip(1);<br>

>    }<br>

</span>> +  // Indentation indicator ([163] c-indentation-indicator)<br>

<span class="">> +  if (Current != End && (*Current >= '1' && *Current <= '9')) {<br>

> +    BlockIndent = unsigned(*Current - '0');<br>

> +    LookingForIndent = false;<br>

> +    skip(1);<br>

> +  }<br>

> +  // Check for chomping indicator once again.<br>

> +  if (Current != End && (*Current == '+' || *Current == '-')) {<br>

> +    ChompingIndicator = *Current;<br>

> +    skip(1);<br>

> +  }<br>

<br>

</span>Why do you need to check this twice?  Is it valid to for it to appear<br>

twice?  Are there any requirements for it?<br>

<br>

If possible, these should really be thrown into helper functions.<br>

Depending on the exact semantics of the chomping indicator, something<br>

like this would be easier to read:<br>

<br>

    char ChompingIndicator = scanChompingCharacter();<br>

    unsigned BlockIndent = scanBlockIndent();<br>

    skipChompingIndicator(ChompingIndicator);<br>

<br>

where I'm assuming:<br>

<br>

    void skipChompingIndicator(char ChompingIndicator) {<br>

      switch (ChompingIndicator) {<br>

        default:<br>

          llvm_unreachable("invalid chomping indicator");<br>

        case ' ':<br>

          break;<br>

        case '+':<br>

        case '-':<br>

          skip(1);<br>

      }<br>

    }<br>

<br>

But exactly how to make this readible depends on what the correct<br>

behaviour is.<br></blockquote><div><br></div><div>It can appear before or after the indentation indicator. Actually right</div><div>now there's a subtle bug in there that allows you to have two chomping</div><div>indicators, so I'll have to fix it.</div><div>It would make sense to make it into a separate method, I'll do that.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class=""><br>

> +  Current = skip_while(&Scanner::skip_s_white, Current);<br>

> +  skipComment();<br>

><br>

</span><span class="">> -  if (Start == Current) {<br>

> -    setError("Got empty block scalar", Start);<br>

> +  if (Current == End) {<br>

> +    // End of file after block header, we have an empty scalar.<br>

> +    Token T;<br>

> +    T.Kind = Token::TK_BlockScalar;<br>

> +    T.Range = StringRef(BlockStart, Current - BlockStart);<br>

> +    TokenQueue.push_back(T);<br>

> +    return true;<br>

> +  }<br>

<br>

</span>Can this change be in a separate commit?<br></blockquote><div><br></div><div>I suppose it could but I would prefer it if went into this one to make the</div><div>tests consistent. If I were to commit it later it would make tests for this</div><div>patch incomplete. Just because the old code reported an error there </div><div>doesn't mean it was correct, and I think it would be better to treat it like that </div><div>old code wasn't there so this part isn't so much of a change</div><div>but an essential part of a new feature.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class=""><br>

> +  auto I = skip_b_break(Current);<br>

> +  if (I == Current) {<br>

> +    setError("Expected a line break after block scalar header", Current);<br>

>      return false;<br>

>    }<br>

</span><span class="">> +  Current = I;<br>

> +  Column = 0;<br>

> +  ++Line;<br>

> +  BlockStart = Current;<br>

<br>

</span>And that looks like the end of the header.  Please find a way to make<br>

the whole thing read something like this:<br>

<br>

    scanBlockHeader(/* ... */);<br>

<br>

This probably means passing around (or otherwise storing) some state for<br>

scanning blocks.<br>

<br>

> +<br>

> +  while (true) {<br>

<span class="">> +    // Handle the block's indentation<br>

> +    if (LookingForIndent) {<br>

> +      while (true) {<br>

> +        auto I = skip_s_space(Current);<br>

</span><span class="">> +        if (I == Current)<br>

</span><span class="">> +          break;<br>

> +        Current = I;<br>

> +        ++Column;<br>

> +      }<br>

<br>

</span>This little while loop is a bit hard to understand, and should be split<br>

out.  I suggest something like:<br>

<br>

    template <class FTy> void advanceWhile(FTy F) {<br>

      auto Final = skip_while(F, Current);<br>

      Column += Final - Current;<br>

      Current = Final;<br>

    }<br>

<br>

I don't really know this class well, so if this API is strangely<br>

inconsistent just come up with something that works.  Anyway, then you<br>

could use this as:<br>

<br>

    if (LookingForIndent) {<br>

      advanceWhile(&Scanner::skip_s_space);<br>

<span class=""><br>

> +      // If there is a non breaking character next, then this is a non empty<br>

> +      // line, thus we can use the discovered indentation as the block's<br>

> +      // indentation.<br>

> +      if (skip_nb_char(Current) != Current) {<br>

> +        if (Column <= BlockExitIndent) {<br>

> +          // This is the end of the block literal, so we exit the loop.<br>

> +          break;<br>

> +        } else {<br>

> +          BlockIndent = Column;<br>

</span>> +          LookingForIndent = false;<br>

<span class="">> +          // So here we enforce the rule that the leading all space lines can't<br>

> +          // have more characters than the block's indentation level<br>

> +          if (MaxAllSpaceLineCharacters > BlockIndent) {<br>

> +            setError("A leading all-space line must not have too many spaces",<br>

> +                     LongestAllSpaceLine);<br>

> +          }<br>

> +        }<br>

> +      } else if (skip_b_break(Current) != Current) {<br>

> +        // This is an all space line, so we have to record the amount of<br>

> +        // space characters it has so that later when we find the first<br>

> +        // text line that defines the indentation level we can make sure<br>

> +        // that all previous all space lines don't have more space<br>

> +        // characters than the indentation level.<br>

> +        if (Column > MaxAllSpaceLineCharacters) {<br>

> +          MaxAllSpaceLineCharacters = Column;<br>

> +          LongestAllSpaceLine = Current;<br>

> +        }<br>

> +      }<br>

<br>

</span>This whole block of logic should probably be a in a function called<br>

something like `lookForIndent()`.<br>

<br>

Also, please find a way to separate this from the outer `while (true)`<br>

loop.  You only need to find the indentation once, right?\<br></blockquote><div><br></div><div>Yes, I need to look for indentation once. I'll try to take it out of the main</div><div>loop.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class=""><br>

> +    } else {<br>

> +      while (Column < BlockIndent) {<br>

> +        auto I = skip_s_space(Current);<br>

</span><span class="">> +        if (I == Current)<br>

</span><span class="">> +          break;<br>

> +        Current = I;<br>

> +        ++Column;<br>

> +      }<br>

> +      // If this line isn't empty then we have to check the indentation to<br>

> +      // see if the block scalar ends.<br>

> +      if (skip_nb_char(Current) != Current) {<br>

> +        if (Column <= BlockExitIndent) {<br>

> +          // This is the end of the block literal, exit the loop.<br>

> +          break;<br>

> +        } else if (Column < BlockIndent) {<br>

> +          if (Current != End && *Current == '#') {<br>

> +            // This is a trailing comment, exit the loop.<br>

> +            break;<br>

> +          } else {<br>

> +            setError("A text line is less indented than the block scalar",<br>

> +                     Current);<br>

<br>

</span>Should this somehow return in error?<br></blockquote><div><br></div><div>It would be better for consistency here, but I think that errors like this are recoverable. If we return here we will loose the</div><div>strict compatibility with the current YAML parser tests in this patch, which are derived from the PyYAML spec and assume that</div><div>the YAML parser is capable of recovering from some errors.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

> +          }<br>

> +        }<br>

> +      }<br>

<br>

It looks like you should also create a function for this block of code,<br>

which scans the indentation at the beginning of every line.<br>

<span class=""><br>

> +    }<br>

> +<br>

> +    // Fold this line if necessary, by removing an additional line break.<br>

> +    // Folding is applied to text lines which follow other text lines and don't<br>

> +    // have any extra spaces (or tabs) after the indentation spaces.<br>

> +    bool IsLineFolded = false;<br>

> +    if (!IsLiteral && FoldNextLine && Column == BlockIndent &&<br>

> +        skip_s_white(Current) == Current && skip_nb_char(Current) != Current &&<br>

> +        LineBreaks) {<br>

> +      IsLineFolded = true;<br>

> +      --LineBreaks;<br>

> +    }<br>

<br>

</span>What does it mean to have a folded line?  Do you need to include this<br>

line-folding logic in the same patch, or can you add that feature after?<br></blockquote><div><br></div><div><br></div><div>Line folding applies to block scalars that start with '>'. It basically removes</div><div>the newlines for lines that don't have any spaces or other newlines between them.</div><div><br></div><div>It would be better to make line folding into a separate feature that I can send out a patch for</div><div>later, yeah.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class=""><br>

> +<br>

> +    // Parse the current line.<br>

> +    auto Start = Current;<br>

> +    Current = skip_while(&Scanner::skip_nb_char, Current);<br>

> +    Column += (Current - Start);<br>

> +    if (Start != Current) {<br>

> +      for (unsigned I = 0; I < LineBreaks; ++I)<br>

> +        OS << "\n";<br>

> +      if (IsLineFolded && !LineBreaks)<br>

> +        OS << ' ';<br>

> +      LineBreaks = 0;<br>

> +      auto Line = StringRef(Start, Current - Start);<br>

> +      OS << Line;<br>

> +      // Don't fold the next line when this line ends with a whitespace<br>

> +      // character.<br>

> +      FoldNextLine = Line.back() != ' ' && Line.back() != '\t';<br>

> +    } else {<br>

> +      FoldNextLine = false;<br>

> +    }<br>

<br>

</span>This is the start of the line-folding logic (and the second half is up<br>

above), right?  Is it necessary that it's split across the loop<br>

boundary, or is there a way to localize it here?  It'll be easier to<br>

follow if it's localized.  (I still think this might be better as a<br>

separate patch, since it seems like a separate feature.)<br>

<span class=""><br>

> +<br>

> +    // Check for EOF.<br>

</span><span class="">> +    if (Current == End) {<br>

</span><span class="">> +      // Ensure that there is at least one line break before the end of file.<br>

> +      if (!LineBreaks)<br>

> +        LineBreaks = 1;<br>

> +      break;<br>

> +    }<br>

<br>

</span>Can this be separated into a function?<br>

<span class=""><br>

> +    auto I = skip_b_break(Current);<br>

> +    if (I == Current) {<br>

</span>> +      // There was an error, which should already have been printed out.<br>

> +      return false;<br>

<br>

Why not return at the error site?<br></blockquote><div><br></div><div>This isn't related to the two errors that don't return above. The comment is</div><div>misleading (I copied the comment from the old code).</div><div>This is a separate error (unexpected characters instead of line break) and </div><div>I think that we should actually report and error here as the error for this </div><div>particular problem doesn't appear to be actually reported.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class=""><br>

> +    }<br>

> +    // We've got a line break<br>

> +    Column = 0;<br>

> +    ++Line;<br>

> +    Current = I;<br>

> +    ++LineBreaks;<br>

> +  }<br>

> +<br>

> +  // Don't output any trailing new lines if the stripping chomping behaviour is<br>

> +  // specified.<br>

> +  if (ChompingIndicator == '-')<br>

> +    LineBreaks = 0;<br>

> +  // Clip trailing lines (default chomping behaviour). The final line break<br>

> +  // is preserved as long as the string isn't empty, but the other trailing<br>

> +  // lines aren't kept.<br>

> +  else if (ChompingIndicator != '+')<br>

> +    LineBreaks = OS.str().empty() ? 0 : 1;<br>

<br>

</span>Can you put this in some sort of helper function?<br>

<span class=""><br>

> +<br>

> +  for (unsigned I = 0; I < LineBreaks; ++I)<br>

> +    OS << "\n";<br>

> +<br>

> +  // New lines may start a simple key.<br>

> +  if (!FlowLevel)<br>

> +    IsSimpleKeyAllowed = true;<br>

<br>

</span>Is this a separable change?<br></blockquote><div><br></div><div>No, I can't really take it out. if I commit it earlier it wouldn't be possible to test it,</div><div>and if I'll leave it until later it would really break some of the yaml parser tests in this</div><div>commit.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class=""><br>

><br>

>    Token T;<br>

> -  T.Kind = Token::TK_Scalar;<br>

> -  T.Range = StringRef(Start, Current - Start);<br>

</span><span class="">> +  T.Kind = Token::TK_BlockScalar;<br>

> +  T.Range = StringRef(BlockStart, Current - BlockStart);<br>

</span><span class="">> +  T.Value = std::move(OS.str());<br>

>    TokenQueue.push_back(T);<br>

>    return true;<br>

>  }<br>

</span>> @@ -1607,6 +1789,7 @@<br>

<span class="">>    case NK_Null:<br>

>      return "tag:<a href="http://yaml.org" target="_blank">yaml.org</a>,2002:null";<br>

>    case NK_Scalar:<br>

> +  case NK_BlockScalar:<br>

>      // TODO: Tag resolution.<br>

>      return "tag:<a href="http://yaml.org" target="_blank">yaml.org</a>,2002:str";<br>

>    case NK_Mapping:<br>

</span>> @@ -2138,6 +2321,11 @@<br>

<span class="">>                  , AnchorInfo.Range.substr(1)<br>

>                  , TagInfo.Range<br>

>                  , T.Range);<br>

</span>> +  case Token::TK_BlockScalar:<br>

<div class="HOEnZb"><div class="h5">> +    getNext();<br>

> +    return new (NodeAllocator)<br>

> +        BlockScalarNode(stream.CurrentDoc, AnchorInfo.Range.substr(1),<br>

> +                        TagInfo.Range, T.Value, T.Range);<br>

>    case Token::TK_Key:<br>

>      // Don't eat the TK_Key, KeyValueNode expects it.<br>

>      return new (NodeAllocator)<br>

<br>

</div></div></blockquote></div><br></div></div>