[llvm] [YAMLParser] Improve plain scalar spec compliance (PR #68946)

via llvm-commits llvm-commits at lists.llvm.org
Mon Oct 16 19:33:30 PDT 2023


================
@@ -1848,13 +1869,15 @@ bool Scanner::fetchMoreTokens() {
   if (*Current == ',')
     return scanFlowEntry();
 
-  if (*Current == '-' && isBlankOrBreak(Current + 1))
+  if (*Current == '-' && (isBlankOrBreak(Current + 1) || Current + 1 == End))
     return scanBlockEntry();
 
-  if (*Current == '?' && (FlowLevel || isBlankOrBreak(Current + 1)))
+  if (*Current == '?' && (Current + 1 == End || isBlankOrBreak(Current + 1)))
     return scanKey();
 
-  if (*Current == ':' && (FlowLevel || isBlankOrBreak(Current + 1)))
+  if (*Current == ':' && ((FlowLevel && (IsAdjacentValueAllowed ||
----------------
akirchhoff-modular wrote:

> Isn't checking `FlowLevel` here redundant assuming `IsAdjacentValueAllowed` is set correctly?

That's tricky. Consider the input `"a":"b"`. This is not valid YAML -- while the key is JSON-style (half of what is needed to allow adjacent values), it is not in a flow context, so we are not allowed to have an adjacent value. I suppose `IsAdjacentValueAllowed` is misleadingly named -- I have renamed it to `IsAdjacentValueAllowedInFlow`.

That said, it may not matter that much, since we mis-parse it whether or not we consider `FlowLevel`:

* If we do not check `FlowLevel`, then we allow `"a":"b"` to parse as `{"a": "b"}`, which is not allowed per the YAML spec (bug in scanner).
* If we do check `FlowLevel`, then the `:` in `"a":"b"` does not get parsed as a key… but it does get parsed as a plain literal, so we end up parsing the document sequence `["a", ":\"b\""]`. The scanner is actually doing the right thing here, but the parser is not -- two `TK_Scalar`s in a row are not valid, but the parser accepts it anyway.

Neither of these are great outcomes (both violate the YAML spec), but I guess the first one is less surprising, so I think I will remove the `FlowLevel` check as per your suggestion. If the parser were to be fixed to disallow consecutive bare documents as the spec says it should, then we would want to reintroduce the `FlowLevel` check here in the scanner.

> I'm also confused about the condition `!isPlainSafeNonBlank`. Shouldn't we always parse a (possibly empty) value following the `:` when an adjacent value is allowed?

The condition is `||`, not `&&`. We already have an example in the test suite, `llvm/test/YAMLParser/spec-08-13.test`:

```yaml
{
  ? foo :,
  ? : bar,
}
```

Consider the line `? foo :,`, when `Current` points at the `:`. If we had only the condition `isBlankOrBreak(Current + 1) || (FlowLevel && IsAdjacentValueAllowed)`, we would incorrectly return `false` here, since `*(Current + 1) == ','` (`isBlankOrBreak(Current + 1)` false), we are in flow, and `IsAdjacentValueAllowed == false` since `foo` is YAML-style.

Hence, we do need this condition. That said… I think `!isPlainSafeNonBlank(Current + 1)` subsumes `isBlankOrBreak(Current + 1)`, so we could remove the latter.

> I also think the order should make the "normal" case of a blank/break following the indicator come first

Good point, done.

> all together is the following still correct, or am I missing something?
> 
> ```cpp
> if (*Current == ':' && (isBlankOrBreak(Current + 1) || IsAdjacentValueAllowed))
> ```

This doesn't work due to the `llvm/test/YAMLParser/spec-08-13.test` failure, but after your feedback, I think it can all be simplified down to

```cpp
if (*Current == ':' &&
    (!isPlainSafeNonBlank(Current + 1) || IsAdjacentValueAllowedInFlow))
```

https://github.com/llvm/llvm-project/pull/68946


More information about the llvm-commits mailing list