[PATCH] YAML: Implement block scalar parsing
Duncan P. N. Exon Smith
dexonsmith at apple.com
Wed May 13 11:37:27 PDT 2015
> On 2015 May 13, at 09:53, Alex Lorenz <arphaman at gmail.com> wrote:
>
> This update uses a new helper function that consumes a line break in another part of code that I forgot to update last time.
>
>
> REPOSITORY
> rL LLVM
>
> http://reviews.llvm.org/D9503
>
> Files:
> include/llvm/Support/YAMLParser.h
> lib/Support/YAMLParser.cpp
> test/YAMLParser/spec-09-14.test
> test/YAMLParser/spec-09-18.test
> test/YAMLParser/spec-09-19.test
> test/YAMLParser/spec-09-20.test
> test/YAMLParser/spec-09-21.test
> test/YAMLParser/spec-09-22.test
> test/YAMLParser/spec-09-24.test
> test/YAMLParser/spec-09-25.test
> test/YAMLParser/spec-09-26.test
> test/YAMLParser/spec-09-27.test
> test/YAMLParser/spec-09-28.test
> unittests/Support/YAMLParserTest.cpp
> utils/yaml-bench/YAMLBench.cpp
>
> EMAIL PREFERENCES
> http://reviews.llvm.org/settings/panel/emailpreferences/
> <D9503.25702.patch>
It's a lot easier to see what's going on now; thanks.
I wonder whether "returning IsDone" is the right thing for your
helpers, or whether "returning IsError" would be cleaner. I also
think in one spot you fail to return correctly on error (I pointed
it out below somewhere).
Many of the rest of the comments amount to: "skip `else` after a
`return`".
> Index: lib/Support/YAMLParser.cpp
> ===================================================================
> --- lib/Support/YAMLParser.cpp
> +++ lib/Support/YAMLParser.cpp
> @@ -101,6 +101,7 @@
> void Node::anchor() {}
> void NullNode::anchor() {}
> void ScalarNode::anchor() {}
> +void BlockScalarNode::anchor() {}
> void KeyValueNode::anchor() {}
> void MappingNode::anchor() {}
> void SequenceNode::anchor() {}
> @@ -128,6 +129,7 @@
> TK_Key,
> TK_Value,
> TK_Scalar,
> + TK_BlockScalar,
> TK_Alias,
> TK_Anchor,
> TK_Tag
> @@ -137,6 +139,9 @@
> /// of the token in the input.
> StringRef Range;
>
> + /// The value of a block scalar node.
> + std::string Value;
> +
> Token() : Kind(TK_Error) {}
> };
> }
> @@ -348,6 +353,14 @@
> /// b-break.
> StringRef::iterator skip_b_break(StringRef::iterator Position);
>
> + /// @brief Skip a single s-space[31] starting at Position.
> + ///
> + /// An s-space is 0x20
> + ///
> + /// @returns The code unit after the s-space, or Position if it's not a
> + /// s-space.
> + StringRef::iterator skip_s_space(StringRef::iterator Position);
> +
> /// @brief Skip a single s-white[33] starting at Position.
> ///
> /// A s-white is 0x20 | 0x9
> @@ -373,6 +386,13 @@
> StringRef::iterator skip_while( SkipWhileFunc Func
> , StringRef::iterator Position);
>
> + /// @brief Skip minimal well-formed code unit subsequences until Func
> + /// returns its input.
> + ///
> + /// @returns The code unit after the last minimal well-formed code unit
> + /// subsequence that Func accepted.
> + void advanceWhile(SkipWhileFunc Func);
> +
> /// @brief Scan ns-uri-char[39]s starting at Cur.
> ///
> /// This updates Cur and Column while scanning.
> @@ -393,6 +413,11 @@
> /// Pos is whitespace or a new line
> bool isBlankOrBreak(StringRef::iterator Position);
>
> + /// @brief Consume a single b-break[28] if it's present at the current
> + /// position. Return false if the code unit at the current position ins't a
> + /// line break.
> + bool consumeLineBreakIfPresent();
> +
> /// @brief If IsSimpleKeyAllowed, create and push_back a new SimpleKey.
> void saveSimpleKeyCandidate( TokenQueueT::iterator Tok
> , unsigned AtColumn
> @@ -466,6 +491,26 @@
> /// @brief Scan a block scalar starting with | or >.
> bool scanBlockScalar(bool IsLiteral);
>
> + /// @brief Scan a chomping indicator in a block scalar header.
> + char scanBlockChompingIndicator();
> +
> + /// @brief Scan an indentation indicator in a block scalar header.
> + unsigned scanBlockIndentationIndicator();
> +
> + /// @brief Scan a block scalar header.
> + /// Return true if we've reached the end of the block scalar.
> + bool scanBlockScalarHeader(char &ChompingIndicator, unsigned &IndentIndicator,
> + bool &WasScanningSuccessful);
> +
> + /// @brief Look for the indentation level of a block scalar.
> + /// Return true if we've reached the end of the block scalar.
> + bool findBlockScalarIndent(unsigned &BlockIndent, unsigned BlockExitIndent,
> + unsigned &LineBreaks);
> +
> + /// @brief Scan the indentation of a text line in a block scalar.
> + /// Return true if we've reached the end of the block scalar.
> + bool scanBlockScalarIndent(unsigned BlockIndent, unsigned BlockExitIndent);
> +
> /// @brief Scan a tag of the form !stuff.
> bool scanTag();
>
> @@ -612,6 +657,9 @@
> case Token::TK_Scalar:
> OS << "Scalar: ";
> break;
> + case Token::TK_BlockScalar:
> + OS << "Block Scalar: ";
> + break;
> case Token::TK_Alias:
> OS << "Alias: ";
> break;
> @@ -816,6 +864,13 @@
> return Position;
> }
>
> +StringRef::iterator Scanner::skip_s_space(StringRef::iterator Position) {
> + if (Position == End)
> + return Position;
> + if (*Position == ' ')
> + return Position + 1;
> + return Position;
> +}
>
> StringRef::iterator Scanner::skip_s_white(StringRef::iterator Position) {
> if (Position == End)
> @@ -844,6 +899,12 @@
> return Position;
> }
>
> +void Scanner::advanceWhile(SkipWhileFunc Func) {
> + auto Final = skip_while(Func, Current);
> + Column += Final - Current;
> + Current = Final;
> +}
> +
> static bool is_ns_hex_digit(const char C) {
> return (C >= '0' && C <= '9')
> || (C >= 'a' && C <= 'z')
> @@ -906,6 +967,16 @@
> return false;
> }
>
> +bool Scanner::consumeLineBreakIfPresent() {
> + auto Next = skip_b_break(Current);
> + if (Next == Current)
> + return false;
> + Column = 0;
> + ++Line;
> + Current = Next;
> + return true;
> +}
> +
> void Scanner::saveSimpleKeyCandidate( TokenQueueT::iterator Tok
> , unsigned AtColumn
> , bool IsRequired) {
> @@ -1374,38 +1445,220 @@
> return true;
> }
>
> -bool Scanner::scanBlockScalar(bool IsLiteral) {
> - StringRef::iterator Start = Current;
> - skip(1); // Eat | or >
> - while(true) {
> - StringRef::iterator i = skip_nb_char(Current);
> - if (i == Current) {
> - if (Column == 0)
> - break;
> - i = skip_b_break(Current);
> - if (i != Current) {
> - // We got a line break.
> - Column = 0;
> - ++Line;
> - Current = i;
> - continue;
> +char Scanner::scanBlockChompingIndicator() {
> + char Indicator = ' ';
> + if (Current != End && (*Current == '+' || *Current == '-')) {
> + Indicator = *Current;
> + skip(1);
> + }
> + return Indicator;
> +}
> +
> +/// \brief Return the number of trailing line breaks that have to be emitted
> +/// as determined by the chomping indicator.
Summarize the function in one line. If that's not enough, add more
after a blank line. You don't need the `\brief` anymore, since we've
turned on auto-brief. Something like this:
/// Get the number of line breaks after chomping.
///
/// Return the number of trailing line breaks to emit, depending on
/// \c ChompingIndicator.
Please adjust the rest of your doxygen comments similarly.
> +static unsigned applyChompingIndicator(char ChompingIndicator,
I think `getChompedLineBreaks()` is a better name.
> + unsigned LineBreaks, StringRef Str) {
I think your comments should be more succinct. The code would be easier
to read.
> + // Don't output any trailing new lines if the stripping chomping behaviour is
> + // specified.
If you move the comment inside the `if`, you don't need to describe the
condition. This could just be a comment on the same line as the
`return` (or the line before) that said something like "// Strip all
line breaks.".
> + if (ChompingIndicator == '-')
> + return 0;
> + // Keep trailing lines when the keep chomping behaviour is specified.
Similarly, this comment could be "// Keep all line breaks.".
> + else if (ChompingIndicator == '+')
This should be just `if`.
> + return LineBreaks;
> + // Clip trailing lines (default chomping behaviour). The final line break
> + // is preserved as long as the string isn't empty, but the other trailing
> + // lines aren't kept.
And this could be "Clip trailing lines."
> + return Str.empty() ? 0 : 1;
Resulting in something like:
if (ChompingIndicator == '-') // Strip all line breaks.
return 0;
if (ChompingIndicator == '+') // Keep all line breaks.
return LineBreaks;
// Clip trailing lines.
return Str.empty() ? 0 : 1;
> +}
> +
> +unsigned Scanner::scanBlockIndentationIndicator() {
> + unsigned Indent = 0;
> + if (Current != End && (*Current >= '1' && *Current <= '9')) {
> + Indent = unsigned(*Current - '0');
> + skip(1);
> + }
> + return Indent;
> +}
> +
> +bool Scanner::scanBlockScalarHeader(char &ChompingIndicator,
> + unsigned &IndentIndicator,
> + bool &WasScanningSuccessful) {
> + // Chomping indicator ([164] c-chomping-indicator)
Remove this comment. Add this to the docs for
`scanBlockChompingIndicator()` if it's useful.
> + ChompingIndicator = scanBlockChompingIndicator();
> + // Indentation indicator ([163] c-indentation-indicator)
Remove this comment. Add this to the docs for
`scanBlockIndentationIndicator()` if it's useful.
> + IndentIndicator = scanBlockIndentationIndicator();
> + // Check for the chomping indicator once again.
> + if (ChompingIndicator == ' ')
> + ChompingIndicator = scanBlockChompingIndicator();
> + Current = skip_while(&Scanner::skip_s_white, Current);
> + skipComment();
> +
> + if (Current == End) {
> + // End of file after block header, we have an empty scalar.
> + WasScanningSuccessful = true;
> + return true;
> + }
> + auto I = skip_b_break(Current);
> + if (I == Current) {
> + setError("Expected a line break after block scalar header", Current);
> + WasScanningSuccessful = false;
> + return true;
> + }
> + Current = I;
> + Column = 0;
> + ++Line;
> + return false;
> +}
> +
> +bool Scanner::findBlockScalarIndent(unsigned &BlockIndent,
> + unsigned BlockExitIndent,
> + unsigned &LineBreaks) {
> + unsigned MaxAllSpaceLineCharacters = 0;
> + StringRef::iterator LongestAllSpaceLine;
> +
> + while (true) {
> + advanceWhile(&Scanner::skip_s_space);
> + // If there is a non breaking character next, then this is a non empty
> + // line, thus we can use the discovered indentation as the block's
> + // indentation.
I'd move this...
> + if (skip_nb_char(Current) != Current) {
... to here:
// This line isn't empty, so find the indentation.
> + if (Column <= BlockExitIndent) {
> + // This is the end of the block literal
> + return true;
> } else {
Remove the `else`.
> - // There was an error, which should already have been printed out.
> - return false;
> + BlockIndent = Column;
> + // So here we enforce the rule that the leading all space lines can't
> + // have more characters than the block's indentation level
Is this comment useful? It looks to me like it's obvious from the error
message.
> + if (MaxAllSpaceLineCharacters > BlockIndent) {
> + setError("A leading all-space line must not have too many spaces",
Do you have a test for this error?
"too many" is vague. I suggest something more like:
"Leading all-space lines must be smaller than the block indent"
> + LongestAllSpaceLine);
> + }
> + // We found the block's indentation.
> + break;
This should just be `return false`.
> + }
> + } else if (skip_b_break(Current) != Current) {
Remove the `else`.
> + // This is an all space line, so we have to record the amount of
> + // space characters it has so that later when we find the first
> + // text line that defines the indentation level we can make sure
> + // that all previous all space lines don't have more space
> + // characters than the indentation level.
This is long and inhibits legibility of the code...
> + if (Column > MaxAllSpaceLineCharacters) {
... so move it to here and skip the stuff that's obvious from reading
the code.
// Record the longest all-space line in case it's longer than the
// block indent.
> + MaxAllSpaceLineCharacters = Column;
> + LongestAllSpaceLine = Current;
> }
> }
> - Current = i;
> +
> + // Check for EOF.
> + if (Current == End)
> + return true;
> +
> + if (!consumeLineBreakIfPresent())
> + return true;
> + ++LineBreaks;
> + }
> + return false;
> +}
> +
> +bool Scanner::scanBlockScalarIndent(unsigned BlockIndent,
> + unsigned BlockExitIndent) {
> + // Skip the indentation.
> + while (Column < BlockIndent) {
> + auto I = skip_s_space(Current);
> + if (I == Current)
> + break;
> + Current = I;
> ++Column;
> }
>
> - if (Start == Current) {
> - setError("Got empty block scalar", Start);
> - return false;
> + // If this line isn't empty then we have to check the indentation to
> + // see if the block scalar ends.
> + if (skip_nb_char(Current) != Current) {
> + if (Column <= BlockExitIndent) {
> + // This is the end of the block literal
> + return true;
> + } else if (Column < BlockIndent) {
Remove the `else`.
> + if (Current != End && *Current == '#') {
> + // This is a trailing comment
> + return true;
What about spaces followed by a comment? Is this duplicated from your
other comment-handling code?
> + } else {
Remove the `else`.
> + setError("A text line is less indented than the block scalar", Current);
Aren't you returning "IsDone"? Should you return `true` here?
> + }
> + }
> + }
> + return false;
If you invert the outer condition, you can return early, and this'll all
be easier to read:
if (skip_nb_char(Current) == Current)
return false;
if (Column <= BlockExitIndent) // End of the block literal.
return true;
if (Curent != End && *Current == '#') // Trailing comment.
return true;
setError(...);
return false;
This logic matches what you have (I think!), but it seems kind of off.
In particular, I'm not sure the error makes sense -- it seems to
describe the "Column <= BlockExitIndent" scenario. Once you've
refactored this, please review the logic.
> +}
> +
> +bool Scanner::scanBlockScalar(bool IsLiteral) {
> + auto Start = Current;
> + // Eat '|' or '>'
> + assert(*Current == '|' || *Current == '>');
> + skip(1);
> +
> + // Block Header ([162] c-b-block-header)
This comment should be part of `scanBlockScalarHeader()`, if anywhere.
> + char ChompingIndicator;
> + unsigned BlockIndent;
> + bool WasScanningSuccessful;
> + bool Done = scanBlockScalarHeader(ChompingIndicator, BlockIndent,
> + WasScanningSuccessful);
Usually we use returns from parsing functions as an error mechanism
(see, e.g., `LLParser.cpp`). Should you really use it for Done-ness?
If you use it for errors, you get code like this:
if (!scanBlockScalarHeader(...))
return false;
Play with it a bit, and see if it looks cleaner (you'd need to change
the rest of your functions as well). I don't feel strongly either way.
> + if (Done) {
> + if (WasScanningSuccessful) {
> + Token T;
> + T.Kind = Token::TK_BlockScalar;
> + T.Range = StringRef(Start, Current - Start);
> + TokenQueue.push_back(T);
Would this logic fit inside `scanBlockScalarHeader()`?
> + }
> + return WasScanningSuccessful;
> + }
> + Start = Current;
> +
> + // Look for a first non-blank line to determine the indenation of the block
> + // scalar.
Not a useful comment. The name of the function you're calling explains
this.
> + unsigned BlockExitIndent = Indent < 0 ? 0 : (unsigned)Indent;
> + unsigned LineBreaks = 0;
> + if (BlockIndent == 0)
> + Done = findBlockScalarIndent(BlockIndent, BlockExitIndent, LineBreaks);
> +
> + // Scan the block's scalars body.
> + SmallString<256> Str;
> + while (!Done) {
> + // Handle the block's indentation
The function name below is sufficient. This comment isn't useful.
> + if (scanBlockScalarIndent(BlockIndent, BlockExitIndent))
What about errors from this?
> + break;
> +
> + // Parse the current line.
> + auto LineStart = Current;
> + advanceWhile(&Scanner::skip_nb_char);
> + if (LineStart != Current) {
> + for (unsigned I = 0; I < LineBreaks; ++I)
> + Str.push_back('\n');
This loop is the same as:
Str.append(LineBreaks, '\n');
> + LineBreaks = 0;
> + Str.append(StringRef(LineStart, Current - LineStart));
I think it'll be easier to read if you put the two `Str`-modifying
operations next to each other, and then reset `LineBreaks`.
> + }
> +
> + // Check for EOF.
> + if (Current == End)
> + break;
> +
> + if (!consumeLineBreakIfPresent())
> + break;
> + ++LineBreaks;
> }
>
> + if (Current == End && !LineBreaks)
> + // Ensure that there is at least one line break before the end of file.
> + LineBreaks = 1;
Is this tested somewhere? Does it interact correctly with the
chomping style?
> + LineBreaks = applyChompingIndicator(ChompingIndicator, LineBreaks, Str);
> + for (unsigned I = 0; I < LineBreaks; ++I)
> + Str.push_back('\n');
Another case of:
Str.append(LineBreaks, '\n');
With the rename I suggested for this function, I think this would read
well with both lines merged:
Str.append(getChompedLineBreaks(ChompingIndicator, LineBreaks, Str),
'\n');
Up to you though.
> +
> + // New lines may start a simple key.
> + if (!FlowLevel)
> + IsSimpleKeyAllowed = true;
Is this tested somewhere? Can it be in a separate commit?
> +
> Token T;
> - T.Kind = Token::TK_Scalar;
> + T.Kind = Token::TK_BlockScalar;
> T.Range = StringRef(Start, Current - Start);
> + T.Value = std::move(Str.str().str());
The `std::move()` isn't useful -- this is already returning a temporary.
> TokenQueue.push_back(T);
> return true;
> }
> @@ -1607,6 +1860,7 @@
> case NK_Null:
> return "tag:yaml.org,2002:null";
> case NK_Scalar:
> + case NK_BlockScalar:
> // TODO: Tag resolution.
> return "tag:yaml.org,2002:str";
Is this change tested somewhere?
> case NK_Mapping:
> @@ -2138,6 +2392,11 @@
> , AnchorInfo.Range.substr(1)
> , TagInfo.Range
> , T.Range);
> + case Token::TK_BlockScalar:
> + getNext();
> + return new (NodeAllocator)
> + BlockScalarNode(stream.CurrentDoc, AnchorInfo.Range.substr(1),
> + TagInfo.Range, T.Value, T.Range);
> case Token::TK_Key:
> // Don't eat the TK_Key, KeyValueNode expects it.
> return new (NodeAllocator)
More information about the llvm-commits
mailing list