[PATCH] D128486: [pseudo] Add error-recovery framework & brace-based recovery

Tue Jul 5 07:08:41 PDT 2022

hokein accepted this revision.
hokein added a comment.
This revision is now accepted and ready to land.

Looks great, let's ship it! feel free to land it in any form you think it is suitable.

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/GLR.h:150
+// OldHeads is the parse state at TokenIndex.
+// This function consumes consumes zero or more tokens by advancing TokenIndex,
+// and places any recovery states created in NewHeads.
----------------
nit: remove the duplicated `consumes`.

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/grammar/LRGraph.h:148
+    RecoveryStrategy Strategy; // Heuristic choosing the tokens to match.
+    SymbolID Result;           // The symbol that is produced.
+  };
----------------
nit: mention the `Result` must be a nonterminal.

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/grammar/LRTable.h:131
+  // Expected to be called by LR parsers.
+  llvm::ArrayRef<Action> getActions(StateID State, SymbolID Terminal) const;
   // Returns the state after we reduce a nonterminal.
----------------
nit: unrelated method?

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/grammar/LRTable.h:248
+  // A given state has [RecoveryOffset[S], RecoveryOffset[S+1]).
+  std::vector<uint32_t> RecoveryOffset;
+  std::vector<Recovery> Recoveries;
----------------
sammccall wrote:
> hokein wrote:
> > I see the motivation of the `OffsetTable` structure now, this would come as a follow-up to simplify the `ReduceOffset` and `RecoveryOffset`, right?
> Yes. Though I'm on the fence about whether it's worth it with one case (it's a bit awkward to generalize the building IIRC).
A motivating bit is that it is tricky to implement a correct `get` method (we both made the same out-of-bound issue). I'm fine with the current form, we can revisit it afterwards.

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:178
+    const ForestNode &Placeholder =
+        Params.Forest.createOpaque(Option->Symbol, Option->Position);
+    const GSS::Node *NewHead = Params.GSStack.addNode(
----------------
should we worry about the case where we create a duplicated forest node? e.g. we have two best options and both recover to the same nonterminal.

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:44
+
+void glrRecover(llvm::ArrayRef<const GSS::Node *> OldHeads,
+                unsigned &TokenIndex, const TokenStream &Tokens,
----------------
sammccall wrote:
> hokein wrote:
> > The `GLR.cpp` file is growing significantly, I think the recovery part is large enough to be lived in a separate file `GLRRecovery.cpp`, but the declaration can still be in the `GLR.h`.
> This is interesting, recover/shift/reduce/parse are (vertically) self-contained enough that it didn't seem like a big problem yet...
> 
> If the concern is file length, maybe we'd thather start with `reduce`; if it's relatedness, `GSS`?
> 
> My line count is:
>  - recover: 156
>  - shift: 47
>  - reduce: 319
>  - parse: 87
>  - GSS: 66
Yeah, indeed these pieces can be split, my main concern is not file length -- I would prefer keeping shift/reduce/parse into a single file, as they form a complete GLR algorithm (splitting them would make it harder to read and follow).

Recovery seems like a different thing, I would image this part of code will grow more in the future
- the GLR algorithm has the core recovery mechanism framework with some fallback recovery strategies (eof, brackets)
- we have different recovery strategies implemented (some of them might be cxx-specific, probably be part of pseudoCXX library);

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:68
+    // one arbitrarily.
+    std::vector<const ForestNode *> Path;
+  };
----------------
sammccall wrote:
> hokein wrote:
> > nit: maybe name it `Parses` or `PartialParses`. Path make me think this is a patch of GSS nodes.
> Renamed to DiscardedParse, does that work for you?
It is better than the original name. The `DiscardedParse` is a bit weird when we start to put it under the opaque node, in that sense, they are not discarded, IMO

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:83
+  //    expr := type ( . expr )   - (up) we should recover this outer expr
+  //    expr := . type ( expr )   - (up+left) we should not recover this type!
+  //
----------------
sammccall wrote:
> hokein wrote:
> > I think you're right -- I thought the first GSS node with a recovery state we encounter during the Walkup  state is the node we want. 
> > 
> > This example is a little confusing (it still matches our previous mental model), I didn't get it until we discussed offline. I think the following example is clearer 
> > 
> > ```
> > parsing the text `if (true) else ?`
> > 
> > IfStmt := if (stmt) else . stmt - which we're currently parsing
> > IfStmt := if (.stmt) else stmt  - (left) the first recovery GSS node, should not recover from this
> > IfStmt := . if (stmt) else stmt - (up), we should recover from this
> > ```
> > 
> > I also think it is worth to add it into the test.
> Fair enough, a couple of problems with that example though:
>  - dropping the "then" clause from the grammar of an if statement is confusing (but adding it back in makes the example complicated)
>  - using a non-kernel item for the last fails to show the "up" edge clearly (but there's not enough context to show a kernel item)
> 
> Came up with another one that hopefully works.
> 
> I didn't manage to write a reasonable test showing it - it basically can't happen with bracket recovery. In order to demonstrate it we need `{` in the "wrong" recovery construct to be paired with a `}` after the cursor... which makes it look *very* much like a correct recovery.
> 
> Added a FIXME to add a testcase once we can define new recovery strategies instead.
oops, your example makes more sense -- I didn't notice that I missed the if-body stmt.

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:88
+  // For now, we have to take this into account when defining recovery rules.
+  // FIXME: find a more satisfying way to avoid such false recovery.
+  std::vector<const ForestNode *> Path;
----------------
sammccall wrote:
> hokein wrote:
> > I can't think of a better solution other than a search-based one (would like to think more about it).
> > 
> > Currently, we find all recovery options by traversing the whole GSS, and do a post-filter (based on the Start, and End). I might do both in the DFS (which will save some cost of traversing), the DFS will give us the best recovery options, and then we build the GSS node, and forest node.  But up to you.
> I'm not sure what you mean by a search-based one, can you elaborate?
> 
> > Currently, we find all recovery options by traversing the whole GSS, and do a post-filter (based on the Start, and End). I might do both in the DFS (which will save some cost of traversing)
> 
> This replaces a relatively simple traversal of GSS nodes (which there aren't that many of) with running the recovery strategy more times - this is (usually) a more complicated algorithm running over tokens, which is (usually) a larger data set. It seems likely to be a bad trade performance-wise.
> 
> In any case, performance isn't terribly critical here, and mixing the discovery & evaluation of options  seems harder to read & debug (we lose the nice documented data structures we can debug, predictable -debug dumping of not-taken recovery options, etc).
> I'm not sure what you mean by a search-based one, can you elaborate?

The search-based one refers to the current one -- basically we perform a brute-force search for all available recoveries and get the best one. 

> In any case, performance isn't terribly critical here, and mixing the discovery & evaluation of options seems harder to read & debug (we lose the nice documented data structures we can debug, predictable -debug dumping of not-taken recovery options, etc).

That's fair enough. 

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:130
+
+  assert(NewHeads.empty()); // We may repeatedly populate and clear it.
+  llvm::Optional<Token::Range> RecoveryRange;
----------------
sammccall wrote:
> hokein wrote:
> > nit: might be clearer to move the assertion to the beginning of the function.
> The purpose of the assertion is to make it obvious that NewHeads.clear() only discards items added during the loop below.
> An assertion at the top of the function would be less useful for this, both because it's not on the screen when reading NewHeads.clear() and because it's much less locally obvious that nothing has been pushed onto NewHeads at some prior point in the function.
Yeah, but I'd treat it as a contract of the API (the `NewHeads` argument passed to `glrRecover` must be empty).

btw, the empty assertion is missing in the latest version.

================
Comment at: clang-tools-extra/pseudo/unittests/GLRTest.cpp:487
+
+// FIXME: Add a test for the spurious recovery mentioned in glrRecovery()
+// once we can extend the recovery strategies to do so.
----------------
nit: I'd probably move this to the comment mentioned in glrRecovery(), which is more discoverable.

================
Comment at: clang-tools-extra/pseudo/unittests/GLRTest.cpp:558

+TEST_F(GLRTest, RecoveryEndToEnd) {
+  // Simple example of brace-based recovery showing:
----------------
nit: not sure the intention having the `RecoveryEndToEnd` separated from the above recover-related tests, why not grouping them together?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128486/new/

https://reviews.llvm.org/D128486