[PATCH] D128486: [pseudo] Add error-recovery framework & brace-based recovery

Wed Jun 29 13:41:08 PDT 2022

sammccall added inline comments.

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/GLR.h:150
+// OldHeads is the parse state at TokenIndex.
+// This function consumes consumes zero or more tokens (advancing TokenIndex),
+// and places any recovery states created in NewHeads.
----------------
hokein wrote:
> If I read it correctly, consuming zero token is consider failure of the function, right?
Consuming zero tokens + producing heads is success, consuming zero tokens + not producing heads is failure.
Tweaked this comment a bit.

(Also fixed a bit of logic that ignored recovery that didn't advance TokenIndex!)

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/grammar/Grammar.h:134
+  uint8_t RecoveryIndex = -1;
+  RecoveryStrategy Recovery = RecoveryStrategy::None;
+
----------------
hokein wrote:
> So each rule will have at most 1 recovery-strategy (it is fine for the initial version), but I think in the future we want more (probably we need to change the Sequence to an array of `{SymbolID, RevoeryStrategy}`).
> 
> ```
> selection-statement := IF CONSTEXPR_opt ( init-statement_opt condition ) statement ELSE statement
> ```
> We might want different recoveries in `( . init-statement_opt condition )` `(init-statement_opt condition) . statement`, `ELSE . statement`.
> 
> 
> 
Added a comment to call out this limitation.
There are several ways to deal with this, e.g. we could split up the grammar into multiple rules, each with one recovery. (But the approach you describe makes sense too)

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/grammar/LRTable.h:248
+  // A given state has [RecoveryOffset[S], RecoveryOffset[S+1]).
+  std::vector<uint32_t> RecoveryOffset;
+  std::vector<Recovery> Recoveries;
----------------
hokein wrote:
> I see the motivation of the `OffsetTable` structure now, this would come as a follow-up to simplify the `ReduceOffset` and `RecoveryOffset`, right?
Yes. Though I'm on the fence about whether it's worth it with one case (it's a bit awkward to generalize the building IIRC).

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:44
+
+void glrRecover(llvm::ArrayRef<const GSS::Node *> OldHeads,
+                unsigned &TokenIndex, const TokenStream &Tokens,
----------------
hokein wrote:
> The `GLR.cpp` file is growing significantly, I think the recovery part is large enough to be lived in a separate file `GLRRecovery.cpp`, but the declaration can still be in the `GLR.h`.
This is interesting, recover/shift/reduce/parse are (vertically) self-contained enough that it didn't seem like a big problem yet...

If the concern is file length, maybe we'd thather start with `reduce`; if it's relatedness, `GSS`?

My line count is:
 - recover: 156
 - shift: 47
 - reduce: 319
 - parse: 87
 - GSS: 66

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:68
+    // one arbitrarily.
+    std::vector<const ForestNode *> Path;
+  };
----------------
hokein wrote:
> nit: maybe name it `Parses` or `PartialParses`. Path make me think this is a patch of GSS nodes.
Renamed to DiscardedParse, does that work for you?

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:83
+  //    expr := type ( . expr )   - (up) we should recover this outer expr
+  //    expr := . type ( expr )   - (up+left) we should not recover this type!
+  //
----------------
hokein wrote:
> I think you're right -- I thought the first GSS node with a recovery state we encounter during the Walkup  state is the node we want. 
> 
> This example is a little confusing (it still matches our previous mental model), I didn't get it until we discussed offline. I think the following example is clearer 
> 
> ```
> parsing the text `if (true) else ?`
> 
> IfStmt := if (stmt) else . stmt - which we're currently parsing
> IfStmt := if (.stmt) else stmt  - (left) the first recovery GSS node, should not recover from this
> IfStmt := . if (stmt) else stmt - (up), we should recover from this
> ```
> 
> I also think it is worth to add it into the test.
Fair enough, a couple of problems with that example though:
 - dropping the "then" clause from the grammar of an if statement is confusing (but adding it back in makes the example complicated)
 - using a non-kernel item for the last fails to show the "up" edge clearly (but there's not enough context to show a kernel item)

Came up with another one that hopefully works.

I didn't manage to write a reasonable test showing it - it basically can't happen with bracket recovery. In order to demonstrate it we need `{` in the "wrong" recovery construct to be paired with a `}` after the cursor... which makes it look *very* much like a correct recovery.

Added a FIXME to add a testcase once we can define new recovery strategies instead.

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:88
+  // For now, we have to take this into account when defining recovery rules.
+  // FIXME: find a more satisfying way to avoid such false recovery.
+  std::vector<const ForestNode *> Path;
----------------
hokein wrote:
> I can't think of a better solution other than a search-based one (would like to think more about it).
> 
> Currently, we find all recovery options by traversing the whole GSS, and do a post-filter (based on the Start, and End). I might do both in the DFS (which will save some cost of traversing), the DFS will give us the best recovery options, and then we build the GSS node, and forest node.  But up to you.
I'm not sure what you mean by a search-based one, can you elaborate?

> Currently, we find all recovery options by traversing the whole GSS, and do a post-filter (based on the Start, and End). I might do both in the DFS (which will save some cost of traversing)

This replaces a relatively simple traversal of GSS nodes (which there aren't that many of) with running the recovery strategy more times - this is (usually) a more complicated algorithm running over tokens, which is (usually) a larger data set. It seems likely to be a bad trade performance-wise.

In any case, performance isn't terribly critical here, and mixing the discovery & evaluation of options  seems harder to read & debug (we lose the nice documented data structures we can debug, predictable -debug dumping of not-taken recovery options, etc).

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:111
+  };
+  for (auto *N : llvm::reverse(OldHeads))
+    DFS(N, TokenIndex, DFS);
----------------
hokein wrote:
> any particular reason why we iterate the OldHeads in a reversed order?
Not that I can remember, and the unit tests don't seem to care, so removed

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:130
+
+  assert(NewHeads.empty()); // We may repeatedly populate and clear it.
+  llvm::Optional<Token::Range> RecoveryRange;
----------------
hokein wrote:
> nit: might be clearer to move the assertion to the beginning of the function.
The purpose of the assertion is to make it obvious that NewHeads.clear() only discards items added during the loop below.
An assertion at the top of the function would be less useful for this, both because it's not on the screen when reading NewHeads.clear() and because it's much less locally obvious that nothing has been pushed onto NewHeads at some prior point in the function.

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:174
+                            << " strategies\n");
+    ++TokenIndex;
+  }
----------------
hokein wrote:
> Advancing the TokenIndex here seems a little surprising and doesn't match what the comment says (` On failure, NewHeads is empty and TokenIndex is unchanged.`). I think this should be done in caller side.
Oops, this shouldn't be done at all - originally I had a different contract for glrRecover().

(In fact, if NewHeads is empty after glrRecover() we never look at TokenIndex again, but let's not rely on that)

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:150
+        NewHeads.clear();
+    }
+    // Create recovery nodes and heads for them in the GSS. These may be
----------------
hokein wrote:
> The Line135-Line150 code looks like a good candidate for  an `evaluateOption` function. 
What would the signature of such a function look like?
There are many different possible outcomes:
 - replace the set with this option (BestOptions.clear)
 - add this option to the set (BestOptions.push_back)
 - discard this option (continue)
 - discard this option and all future options (break)
 - update recovery range or not

Access to control flow (break/continue) and read/write access to RecoveryRange and BestOptions seem like the natural way to express this, so it's not clear what a function could abstract away.

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:590
+    if (NextHeads.empty()) {
+      glrRecover(Heads, I, Tokens, Params, NextHeads);
+      if (NextHeads.empty())
----------------
hokein wrote:
> currently, it is fine for bracket-based recovery. In the future, we will need to run a force reduction for `Heads` regardless of the lookahead token before running the glrRecover, add a FIXME?
Done. This impacts the internal structure of the recovered node much more than it does which recoveries are available, so I think we can defer this for a fair while.

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:591
+      glrRecover(Heads, I, Tokens, Params, NextHeads);
+      if (NextHeads.empty())
+        // FIXME: Ensure the `_ := start-symbol` rules have a fallback
----------------
hokein wrote:
> If the glrRecover returns a bool indicating whether we're able to recover, then the code is simpler:
> 
> ```
> if (NextHeads.empty() && !glrRecover(...))
>   return Params.Forest.createOpaque(...);
> ```
> 
I had this initially, but:
a) it's redundant, and makes it less locally obvious that we've repopulated NewHeads
b) soon recovery will always succeed, so !NewHeads.empty() becomes an assertion

================
Comment at: clang-tools-extra/pseudo/unittests/GLRTest.cpp:382
+  auto *Question1 = &Arena.createTerminal(tok::question, 1);
+  const auto *Root = GSStack.addNode(0, nullptr, {});
+  const auto *OpenedBraces = GSStack.addNode(1, LBrace, {Root});
----------------
hokein wrote:
> nit: for GSS nodes, I will name them like `GSSNode<StateID>`, it is less describable, but it can be easily distinguished from other forest nodes, and match the GSS described in the comment, which I think it is clearer.
Yeah, I found those testcases difficult to read :-(
I did work out that the numbers might be state numbers rather than node numbers, but I found it hard to a) remember that, since it's often GSSNode0, GSSNode1 etc and b) remember which state is which.
It doesn't generalize, e.g. the RecoverRightmost case has 3 nodes with the same state.

================
Comment at: clang-tools-extra/pseudo/unittests/GLRTest.cpp:425
+  // Innermost brace is unmatched, to test fallback to next brace.
+  TokenStream Tokens = cook(lex("{ { { ? ? } }", LOptions), LOptions);
+  Tokens.tokens()[0].Pair = 5;
----------------
hokein wrote:
> I suppose there is an extra `?` in the string literal, the index of the brackets (`4`, `5`) used below doesn't match the string literal here.
Oops, yes!
Originally this testcase was meant to test more things, and apparently I didn't finish simplifying it...

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128486/new/

https://reviews.llvm.org/D128486