[PATCH] D102590: [YAMLParser] Add multi-line literal folding support

Fri Aug 20 12:40:47 PDT 2021

scott.linder added a comment.

I'm not sure I'm the best person for the job, but I tried to give my thoughts.

The only correctness issue I'm not certain about is the discussion around what constitutes an "empty" line. My understanding is that lines containing only whitespace can be non-empty, and that only the indent is stripped before checking if there is content in the line. @fodinabor do you have any links that describe your definition of empty? The docs I'm looking at are https://yaml.org/spec/1.2/spec.html#l-empty(n,c)

Other than that I believe your code as-is faithfully implements the YAML spec, but the approach you took doesn't match how I model it in my head, which makes it harder to review. I suggested a different approach that I'm more confident is correct (or at the very least, if it is incorrect it might help me identify the part of the spec I don't understand!)

Let me know what you think!

================
Comment at: llvm/lib/Support/YAMLParser.cpp:1538
+bool Scanner::isFoldedBlock(StringRef::iterator Position, char &NewlineOrSpace,
+                            char ChompingIndicator) {
+  bool IsFolded = false;
----------------
This appears to be dead, was it used previously?

================
Comment at: llvm/lib/Support/YAMLParser.cpp:1687
   assert(*Current == '|' || *Current == '>');
+  const auto *Start = Current;
   skip(1);
----------------
I'd rather hoist the call to `isFoldedBlock` up instead of juggling `Start` like this.

I think even better would be to replace `isFoldedBlock` with something like:

```
  /// Scan a block scalar style indicator and header.
  ///
  /// Note: This is distinct from scanBlockScalarHeader to mirror the fact that
  /// YAML does not consider the style indicator to be a part of the header.
  ///
  /// Return false if an error occurred.
  bool scanBlockScalarIndicators(char &StyleIndicator, char &ChompingIndicator, unsigned &IndentIndicator, bool &IsDone);
```

Then the code here is:

```
bool Scanner::scanBlockScalar(bool IsLiteral) {
  char StyleIndicator;
  char ChompingIndicator;
  unsigned BlockIndent;
  bool IsDone;
  if (!scanBlockScalarIndicators(StyleIndicator, ChompingIndicator, BlockIndent, IsDone))
    return false;
  ...
  // Here you could do `bool IsFolded = StyleIndicator == '>';`
```

It would be even better to have this return actual enums instead of chars or bools:

```
enum BlockStyle { Literal; Folded; };
enum BlockChomping { ... };
bool scanBlockScalarIndicators(BlockStyle &Style, BlockChomping &Chomping, unsigned &Indent, bool &IsDone);

// Now code can check `if (Style == BlockStyle::Folded)`, which I think reads the best out of the options, with the bonus of type safety to catch typos.
```

I don't know if this case is enough reason to change the style of surrounding code to use `enum`s though, so I'd be fine with the new `scan` method just accepting a `bool&` or `char&` for the style indicator and calling it a day. Do others have any thoughts?

================
Comment at: llvm/lib/Support/YAMLParser.cpp:1720-1744
     if (LineStart != Current) {
-      Str.append(LineBreaks, '\n');
+      if (Str.empty() && IsFolded) {
+        Str.append(LineBreaks, '\n');
+      } else if (Str.empty() || Str.back() != '\n') {
+        Str.append(LineBreaks, NewlineOrSpace);
+      }
       Str.append(StringRef(LineStart, Current - LineStart));
----------------
I think this can be simplified if you just think of the folding as a special case at the point this loop encounters a line where `LineStart != Current` (i.e. a line with content). At that point, for literal blocks we just insert `LineBreaks` count of newlines as we did previously, whereas for folded blocks we do one of three things:

* If `LineBreaks == 0` we don't append anything.
* If `LineBreaks == 1` we append a single space (this is the "folding").
* Otherwise, we append `LineBreaks - 1` newlines (this is the clause of YAML which allows you to get newlines into the final scalar string by using multiple adjacent newlines; the general case is that the first two newlines together get you into this "no folding" mode, then you can chain as many newlines as you want before encountering content again and going back to "folding" mode).

This avoids maintaining more state in checking the last character in `Str`, and it shows the common aspects of the two styles more clearly.

I'm not a YAML expert, though, so if I'm just wrong please correct me! This is just based on my understanding after skimming the relevant parts of the YAML spec.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D102590/new/

https://reviews.llvm.org/D102590