<div dir="ltr">Thanks, I'll post the updated patch soon.<div class="gmail_extra"><br><div class="gmail_quote">2015-05-12 11:57 GMT-07:00 Duncan P. N. Exon Smith <span dir="ltr"><<a href="mailto:dexonsmith@apple.com" target="_blank">dexonsmith@apple.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5"><br>
> On 2015-May-06, at 16:47, Alex Lorenz <<a href="mailto:arphaman@gmail.com">arphaman@gmail.com</a>> wrote:<br>
><br>
> As suggested by Duncan, I took out the extraction of the skipComment method from this patch,<br>
> which I've committed separately in r236663. I also took out the change to the KeyValueNode skip method,<br>
> which I've committed separately in r236669.<br>
><br>
><br>
> REPOSITORY<br>
> rL LLVM<br>
><br>
> <a href="http://reviews.llvm.org/D9503" target="_blank">http://reviews.llvm.org/D9503</a><br>
><br>
> Files:<br>
> include/llvm/Support/YAMLParser.h<br>
> lib/Support/YAMLParser.cpp<br>
> test/YAMLParser/spec-09-14.test<br>
> test/YAMLParser/spec-09-18.test<br>
> test/YAMLParser/spec-09-19.test<br>
> test/YAMLParser/spec-09-20.test<br>
> test/YAMLParser/spec-09-21.test<br>
> test/YAMLParser/spec-09-22.test<br>
> test/YAMLParser/spec-09-24.test<br>
> test/YAMLParser/spec-09-25.test<br>
> test/YAMLParser/spec-09-26.test<br>
> test/YAMLParser/spec-09-27.test<br>
> test/YAMLParser/spec-09-28.test<br>
> test/YAMLParser/spec-09-29.test<br>
> test/YAMLParser/spec-09-30.test<br>
> test/YAMLParser/spec-09-31.test<br>
> test/YAMLParser/spec-09-32.test<br>
> test/YAMLParser/spec-09-33.test<br>
> unittests/Support/YAMLParserTest.cpp<br>
> utils/yaml-bench/YAMLBench.cpp<br>
><br>
> EMAIL PREFERENCES<br>
> <a href="http://reviews.llvm.org/settings/panel/emailpreferences/" target="_blank">http://reviews.llvm.org/settings/panel/emailpreferences/</a><br>
</div></div>> <D9503.25106.patch><br>
<br>
This is looking better, but I still have a few comments. Most of them<br>
boil down to:<br>
<br>
1. Can X be separated into a helper function?<br>
2. X looks like an encapsulated feature. Can it be committed (and<br>
tested) separately?<br>
<div><div class="h5"><br>
> Index: lib/Support/YAMLParser.cpp<br>
> ===================================================================<br>
> --- lib/Support/YAMLParser.cpp<br>
> +++ lib/Support/YAMLParser.cpp<br>
> @@ -101,6 +101,7 @@<br>
> void Node::anchor() {}<br>
> void NullNode::anchor() {}<br>
> void ScalarNode::anchor() {}<br>
> +void BlockScalarNode::anchor() {}<br>
> void KeyValueNode::anchor() {}<br>
> void MappingNode::anchor() {}<br>
> void SequenceNode::anchor() {}<br>
> @@ -128,6 +129,7 @@<br>
> TK_Key,<br>
> TK_Value,<br>
> TK_Scalar,<br>
> + TK_BlockScalar,<br>
> TK_Alias,<br>
> TK_Anchor,<br>
> TK_Tag<br>
> @@ -137,6 +139,9 @@<br>
> /// of the token in the input.<br>
> StringRef Range;<br>
><br>
> + /// The value of a block scalar node.<br>
> + std::string Value;<br>
> +<br>
> Token() : Kind(TK_Error) {}<br>
> };<br>
> }<br>
> @@ -348,6 +353,14 @@<br>
> /// b-break.<br>
> StringRef::iterator skip_b_break(StringRef::iterator Position);<br>
><br>
> + /// @brief Skip a single s-space[31] starting at Position.<br>
> + ///<br>
> + /// An s-space is 0x20<br>
> + ///<br>
> + /// @returns The code unit after the s-space, or Position if it's not a<br>
> + /// s-space.<br>
> + StringRef::iterator skip_s_space(StringRef::iterator Position);<br>
> +<br>
> /// @brief Skip a single s-white[33] starting at Position.<br>
> ///<br>
> /// A s-white is 0x20 | 0x9<br>
</div></div>> @@ -612,6 +625,9 @@<br>
<span class="">> case Token::TK_Scalar:<br>
> OS << "Scalar: ";<br>
> break;<br>
> + case Token::TK_BlockScalar:<br>
> + OS << "Block Scalar: ";<br>
> + break;<br>
> case Token::TK_Alias:<br>
> OS << "Alias: ";<br>
> break;<br>
</span>> @@ -816,6 +832,13 @@<br>
<span class="">> return Position;<br>
> }<br>
><br>
> +StringRef::iterator Scanner::skip_s_space(StringRef::iterator Position) {<br>
> + if (Position == End)<br>
> + return Position;<br>
> + if (*Position == ' ')<br>
> + return Position + 1;<br>
> + return Position;<br>
> +}<br>
><br>
> StringRef::iterator Scanner::skip_s_white(StringRef::iterator Position) {<br>
> if (Position == End)<br>
</span>> @@ -1375,37 +1398,196 @@<br>
<span class="">> }<br>
><br>
> bool Scanner::scanBlockScalar(bool IsLiteral) {<br>
> - StringRef::iterator Start = Current;<br>
<br>
</span>It looks like you've renamed this variable from `Start` to `BlockStart`.<br>
Please commit that separately since it unnecessarily complicates this<br>
patch.<br></blockquote><div><br></div><div><br></div><div>Sure, I will do that.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span class=""><br>
> - skip(1); // Eat | or ><br>
> - while(true) {<br>
> - StringRef::iterator i = skip_nb_char(Current);<br>
> - if (i == Current) {<br>
> - if (Column == 0)<br>
> - break;<br>
> - i = skip_b_break(Current);<br>
> - if (i != Current) {<br>
> - // We got a line break.<br>
> - Column = 0;<br>
> - ++Line;<br>
> - Current = i;<br>
> - continue;<br>
> - } else {<br>
> - // There was an error, which should already have been printed out.<br>
> - return false;<br>
> - }<br>
> - }<br>
> - Current = i;<br>
> - ++Column;<br>
> + std::string Str;<br>
> + raw_string_ostream OS(Str);<br>
<br>
</span>Is the raw_string_ostream really valuable here? Why not a<br>
`SmallVector<char, 256>` or some such, and just append to it directly?<br>
<br>
I ask because it looks like you're only appending `char`s, so the<br>
overhead of a stream (even a fast stream) seems indirect.<br></blockquote><div><br></div><div>Yes, that would work well.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span class=""><br>
> + auto BlockStart = Current;<br>
> + // If this is true, then the scanner has to look for a first non-blank line<br>
> + // to determine the indenation of the block scalar.<br>
> + bool LookingForIndent = true;<br>
<br>
</span>The name should be more obviously Boolean. I suggest `IsIndentKnown`<br>
(and inverting the meaning), but you may find something better.<br>
<span class=""><br>
> + unsigned BlockIndent = 0;<br>
> + unsigned BlockExitIndent = Indent < 0 ? 0 : (unsigned)Indent;<br>
> + unsigned LineBreaks = 0;<br>
> +<br>
> + unsigned MaxAllSpaceLineCharacters = 0;<br>
> + StringRef::iterator LongestAllSpaceLine;<br>
> +<br>
> + bool FoldNextLine = false;<br>
<br>
</span>Please declare all of these variables immediately before their<br>
respective first uses.<br>
<span class=""><br>
> +<br>
> + skip(1); // Eat '|' or '>'<br>
<br>
</span>Would it be valuable to assert that Current is one of those two?<br>
<span class=""><br>
> +<br>
> + // Block Header ([162] c-b-block-header)<br>
> + // Chomping indicator ([164] c-chomping-indicator)<br>
> + char ChompingIndicator = ' ';<br>
> + if (Current != End && (*Current == '+' || *Current == '-')) {<br>
> + ChompingIndicator = *Current;<br>
> + skip(1);<br>
> }<br>
</span>> + // Indentation indicator ([163] c-indentation-indicator)<br>
<span class="">> + if (Current != End && (*Current >= '1' && *Current <= '9')) {<br>
> + BlockIndent = unsigned(*Current - '0');<br>
> + LookingForIndent = false;<br>
> + skip(1);<br>
> + }<br>
> + // Check for chomping indicator once again.<br>
> + if (Current != End && (*Current == '+' || *Current == '-')) {<br>
> + ChompingIndicator = *Current;<br>
> + skip(1);<br>
> + }<br>
<br>
</span>Why do you need to check this twice? Is it valid to for it to appear<br>
twice? Are there any requirements for it?<br>
<br>
If possible, these should really be thrown into helper functions.<br>
Depending on the exact semantics of the chomping indicator, something<br>
like this would be easier to read:<br>
<br>
char ChompingIndicator = scanChompingCharacter();<br>
unsigned BlockIndent = scanBlockIndent();<br>
skipChompingIndicator(ChompingIndicator);<br>
<br>
where I'm assuming:<br>
<br>
void skipChompingIndicator(char ChompingIndicator) {<br>
switch (ChompingIndicator) {<br>
default:<br>
llvm_unreachable("invalid chomping indicator");<br>
case ' ':<br>
break;<br>
case '+':<br>
case '-':<br>
skip(1);<br>
}<br>
}<br>
<br>
But exactly how to make this readible depends on what the correct<br>
behaviour is.<br></blockquote><div><br></div><div>It can appear before or after the indentation indicator. Actually right</div><div>now there's a subtle bug in there that allows you to have two chomping</div><div>indicators, so I'll have to fix it.</div><div>It would make sense to make it into a separate method, I'll do that.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span class=""><br>
> + Current = skip_while(&Scanner::skip_s_white, Current);<br>
> + skipComment();<br>
><br>
</span><span class="">> - if (Start == Current) {<br>
> - setError("Got empty block scalar", Start);<br>
> + if (Current == End) {<br>
> + // End of file after block header, we have an empty scalar.<br>
> + Token T;<br>
> + T.Kind = Token::TK_BlockScalar;<br>
> + T.Range = StringRef(BlockStart, Current - BlockStart);<br>
> + TokenQueue.push_back(T);<br>
> + return true;<br>
> + }<br>
<br>
</span>Can this change be in a separate commit?<br></blockquote><div><br></div><div>I suppose it could but I would prefer it if went into this one to make the</div><div>tests consistent. If I were to commit it later it would make tests for this</div><div>patch incomplete. Just because the old code reported an error there </div><div>doesn't mean it was correct, and I think it would be better to treat it like that </div><div>old code wasn't there so this part isn't so much of a change</div><div>but an essential part of a new feature.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span class=""><br>
> + auto I = skip_b_break(Current);<br>
> + if (I == Current) {<br>
> + setError("Expected a line break after block scalar header", Current);<br>
> return false;<br>
> }<br>
</span><span class="">> + Current = I;<br>
> + Column = 0;<br>
> + ++Line;<br>
> + BlockStart = Current;<br>
<br>
</span>And that looks like the end of the header. Please find a way to make<br>
the whole thing read something like this:<br>
<br>
scanBlockHeader(/* ... */);<br>
<br>
This probably means passing around (or otherwise storing) some state for<br>
scanning blocks.<br>
<br>
> +<br>
> + while (true) {<br>
<span class="">> + // Handle the block's indentation<br>
> + if (LookingForIndent) {<br>
> + while (true) {<br>
> + auto I = skip_s_space(Current);<br>
</span><span class="">> + if (I == Current)<br>
</span><span class="">> + break;<br>
> + Current = I;<br>
> + ++Column;<br>
> + }<br>
<br>
</span>This little while loop is a bit hard to understand, and should be split<br>
out. I suggest something like:<br>
<br>
template <class FTy> void advanceWhile(FTy F) {<br>
auto Final = skip_while(F, Current);<br>
Column += Final - Current;<br>
Current = Final;<br>
}<br>
<br>
I don't really know this class well, so if this API is strangely<br>
inconsistent just come up with something that works. Anyway, then you<br>
could use this as:<br>
<br>
if (LookingForIndent) {<br>
advanceWhile(&Scanner::skip_s_space);<br>
<span class=""><br>
> + // If there is a non breaking character next, then this is a non empty<br>
> + // line, thus we can use the discovered indentation as the block's<br>
> + // indentation.<br>
> + if (skip_nb_char(Current) != Current) {<br>
> + if (Column <= BlockExitIndent) {<br>
> + // This is the end of the block literal, so we exit the loop.<br>
> + break;<br>
> + } else {<br>
> + BlockIndent = Column;<br>
</span>> + LookingForIndent = false;<br>
<span class="">> + // So here we enforce the rule that the leading all space lines can't<br>
> + // have more characters than the block's indentation level<br>
> + if (MaxAllSpaceLineCharacters > BlockIndent) {<br>
> + setError("A leading all-space line must not have too many spaces",<br>
> + LongestAllSpaceLine);<br>
> + }<br>
> + }<br>
> + } else if (skip_b_break(Current) != Current) {<br>
> + // This is an all space line, so we have to record the amount of<br>
> + // space characters it has so that later when we find the first<br>
> + // text line that defines the indentation level we can make sure<br>
> + // that all previous all space lines don't have more space<br>
> + // characters than the indentation level.<br>
> + if (Column > MaxAllSpaceLineCharacters) {<br>
> + MaxAllSpaceLineCharacters = Column;<br>
> + LongestAllSpaceLine = Current;<br>
> + }<br>
> + }<br>
<br>
</span>This whole block of logic should probably be a in a function called<br>
something like `lookForIndent()`.<br>
<br>
Also, please find a way to separate this from the outer `while (true)`<br>
loop. You only need to find the indentation once, right?\<br></blockquote><div><br></div><div>Yes, I need to look for indentation once. I'll try to take it out of the main</div><div>loop.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span class=""><br>
> + } else {<br>
> + while (Column < BlockIndent) {<br>
> + auto I = skip_s_space(Current);<br>
</span><span class="">> + if (I == Current)<br>
</span><span class="">> + break;<br>
> + Current = I;<br>
> + ++Column;<br>
> + }<br>
> + // If this line isn't empty then we have to check the indentation to<br>
> + // see if the block scalar ends.<br>
> + if (skip_nb_char(Current) != Current) {<br>
> + if (Column <= BlockExitIndent) {<br>
> + // This is the end of the block literal, exit the loop.<br>
> + break;<br>
> + } else if (Column < BlockIndent) {<br>
> + if (Current != End && *Current == '#') {<br>
> + // This is a trailing comment, exit the loop.<br>
> + break;<br>
> + } else {<br>
> + setError("A text line is less indented than the block scalar",<br>
> + Current);<br>
<br>
</span>Should this somehow return in error?<br></blockquote><div><br></div><div>It would be better for consistency here, but I think that errors like this are recoverable. If we return here we will loose the</div><div>strict compatibility with the current YAML parser tests in this patch, which are derived from the PyYAML spec and assume that</div><div>the YAML parser is capable of recovering from some errors.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
> + }<br>
> + }<br>
> + }<br>
<br>
It looks like you should also create a function for this block of code,<br>
which scans the indentation at the beginning of every line.<br>
<span class=""><br>
> + }<br>
> +<br>
> + // Fold this line if necessary, by removing an additional line break.<br>
> + // Folding is applied to text lines which follow other text lines and don't<br>
> + // have any extra spaces (or tabs) after the indentation spaces.<br>
> + bool IsLineFolded = false;<br>
> + if (!IsLiteral && FoldNextLine && Column == BlockIndent &&<br>
> + skip_s_white(Current) == Current && skip_nb_char(Current) != Current &&<br>
> + LineBreaks) {<br>
> + IsLineFolded = true;<br>
> + --LineBreaks;<br>
> + }<br>
<br>
</span>What does it mean to have a folded line? Do you need to include this<br>
line-folding logic in the same patch, or can you add that feature after?<br></blockquote><div><br></div><div><br></div><div>Line folding applies to block scalars that start with '>'. It basically removes</div><div>the newlines for lines that don't have any spaces or other newlines between them.</div><div><br></div><div>It would be better to make line folding into a separate feature that I can send out a patch for</div><div>later, yeah.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span class=""><br>
> +<br>
> + // Parse the current line.<br>
> + auto Start = Current;<br>
> + Current = skip_while(&Scanner::skip_nb_char, Current);<br>
> + Column += (Current - Start);<br>
> + if (Start != Current) {<br>
> + for (unsigned I = 0; I < LineBreaks; ++I)<br>
> + OS << "\n";<br>
> + if (IsLineFolded && !LineBreaks)<br>
> + OS << ' ';<br>
> + LineBreaks = 0;<br>
> + auto Line = StringRef(Start, Current - Start);<br>
> + OS << Line;<br>
> + // Don't fold the next line when this line ends with a whitespace<br>
> + // character.<br>
> + FoldNextLine = Line.back() != ' ' && Line.back() != '\t';<br>
> + } else {<br>
> + FoldNextLine = false;<br>
> + }<br>
<br>
</span>This is the start of the line-folding logic (and the second half is up<br>
above), right? Is it necessary that it's split across the loop<br>
boundary, or is there a way to localize it here? It'll be easier to<br>
follow if it's localized. (I still think this might be better as a<br>
separate patch, since it seems like a separate feature.)<br>
<span class=""><br>
> +<br>
> + // Check for EOF.<br>
</span><span class="">> + if (Current == End) {<br>
</span><span class="">> + // Ensure that there is at least one line break before the end of file.<br>
> + if (!LineBreaks)<br>
> + LineBreaks = 1;<br>
> + break;<br>
> + }<br>
<br>
</span>Can this be separated into a function?<br>
<span class=""><br>
> + auto I = skip_b_break(Current);<br>
> + if (I == Current) {<br>
</span>> + // There was an error, which should already have been printed out.<br>
> + return false;<br>
<br>
Why not return at the error site?<br></blockquote><div><br></div><div>This isn't related to the two errors that don't return above. The comment is</div><div>misleading (I copied the comment from the old code).</div><div>This is a separate error (unexpected characters instead of line break) and </div><div>I think that we should actually report and error here as the error for this </div><div>particular problem doesn't appear to be actually reported.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span class=""><br>
> + }<br>
> + // We've got a line break<br>
> + Column = 0;<br>
> + ++Line;<br>
> + Current = I;<br>
> + ++LineBreaks;<br>
> + }<br>
> +<br>
> + // Don't output any trailing new lines if the stripping chomping behaviour is<br>
> + // specified.<br>
> + if (ChompingIndicator == '-')<br>
> + LineBreaks = 0;<br>
> + // Clip trailing lines (default chomping behaviour). The final line break<br>
> + // is preserved as long as the string isn't empty, but the other trailing<br>
> + // lines aren't kept.<br>
> + else if (ChompingIndicator != '+')<br>
> + LineBreaks = OS.str().empty() ? 0 : 1;<br>
<br>
</span>Can you put this in some sort of helper function?<br>
<span class=""><br>
> +<br>
> + for (unsigned I = 0; I < LineBreaks; ++I)<br>
> + OS << "\n";<br>
> +<br>
> + // New lines may start a simple key.<br>
> + if (!FlowLevel)<br>
> + IsSimpleKeyAllowed = true;<br>
<br>
</span>Is this a separable change?<br></blockquote><div><br></div><div>No, I can't really take it out. if I commit it earlier it wouldn't be possible to test it,</div><div>and if I'll leave it until later it would really break some of the yaml parser tests in this</div><div>commit.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<span class=""><br>
><br>
> Token T;<br>
> - T.Kind = Token::TK_Scalar;<br>
> - T.Range = StringRef(Start, Current - Start);<br>
</span><span class="">> + T.Kind = Token::TK_BlockScalar;<br>
> + T.Range = StringRef(BlockStart, Current - BlockStart);<br>
</span><span class="">> + T.Value = std::move(OS.str());<br>
> TokenQueue.push_back(T);<br>
> return true;<br>
> }<br>
</span>> @@ -1607,6 +1789,7 @@<br>
<span class="">> case NK_Null:<br>
> return "tag:<a href="http://yaml.org" target="_blank">yaml.org</a>,2002:null";<br>
> case NK_Scalar:<br>
> + case NK_BlockScalar:<br>
> // TODO: Tag resolution.<br>
> return "tag:<a href="http://yaml.org" target="_blank">yaml.org</a>,2002:str";<br>
> case NK_Mapping:<br>
</span>> @@ -2138,6 +2321,11 @@<br>
<span class="">> , AnchorInfo.Range.substr(1)<br>
> , TagInfo.Range<br>
> , T.Range);<br>
</span>> + case Token::TK_BlockScalar:<br>
<div class="HOEnZb"><div class="h5">> + getNext();<br>
> + return new (NodeAllocator)<br>
> + BlockScalarNode(stream.CurrentDoc, AnchorInfo.Range.substr(1),<br>
> + TagInfo.Range, T.Value, T.Range);<br>
> case Token::TK_Key:<br>
> // Don't eat the TK_Key, KeyValueNode expects it.<br>
> return new (NodeAllocator)<br>
<br>
</div></div></blockquote></div><br></div></div>