[PATCH] D128486: [pseudo] Add error-recovery framework & brace-based recovery

Wed Jun 29 03:43:42 PDT 2022

hokein added a comment.

the patch looks a good start to me, some initial comments (mostly around the recovery part).

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/GLR.h:150
+// OldHeads is the parse state at TokenIndex.
+// This function consumes consumes zero or more tokens (advancing TokenIndex),
+// and places any recovery states created in NewHeads.
----------------
If I read it correctly, consuming zero token is consider failure of the function, right?

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/grammar/Grammar.h:134
+  uint8_t RecoveryIndex = -1;
+  RecoveryStrategy Recovery = RecoveryStrategy::None;
+
----------------
So each rule will have at most 1 recovery-strategy (it is fine for the initial version), but I think in the future we want more (probably we need to change the Sequence to an array of `{SymbolID, RevoeryStrategy}`).

```
selection-statement := IF CONSTEXPR_opt ( init-statement_opt condition ) statement ELSE statement
```
We might want different recoveries in `( . init-statement_opt condition )` `(init-statement_opt condition) . statement`, `ELSE . statement`.

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/grammar/LRTable.h:248
+  // A given state has [RecoveryOffset[S], RecoveryOffset[S+1]).
+  std::vector<uint32_t> RecoveryOffset;
+  std::vector<Recovery> Recoveries;
----------------
I see the motivation of the `OffsetTable` structure now, this would come as a follow-up to simplify the `ReduceOffset` and `RecoveryOffset`, right?

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:29
+
+llvm::Optional<unsigned>
+findRecoveryEndpoint(RecoveryStrategy Strategy,
----------------
nit: unsigned => `Token::Index`

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:44
+
+void glrRecover(llvm::ArrayRef<const GSS::Node *> OldHeads,
+                unsigned &TokenIndex, const TokenStream &Tokens,
----------------
The `GLR.cpp` file is growing significantly, I think the recovery part is large enough to be lived in a separate file `GLRRecovery.cpp`, but the declaration can still be in the `GLR.h`.

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:68
+    // one arbitrarily.
+    std::vector<const ForestNode *> Path;
+  };
----------------
nit: maybe name it `Parses` or `PartialParses`. Path make me think this is a patch of GSS nodes.

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:83
+  //    expr := type ( . expr )   - (up) we should recover this outer expr
+  //    expr := . type ( expr )   - (up+left) we should not recover this type!
+  //
----------------
I think you're right -- I thought the first GSS node with a recovery state we encounter during the Walkup  state is the node we want. 

This example is a little confusing (it still matches our previous mental model), I didn't get it until we discussed offline. I think the following example is clearer 

```
parsing the text `if (true) else ?`

IfStmt := if (stmt) else . stmt - which we're currently parsing
IfStmt := if (.stmt) else stmt  - (left) the first recovery GSS node, should not recover from this
IfStmt := . if (stmt) else stmt - (up), we should recover from this
```

I also think it is worth to add it into the test.

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:88
+  // For now, we have to take this into account when defining recovery rules.
+  // FIXME: find a more satisfying way to avoid such false recovery.
+  std::vector<const ForestNode *> Path;
----------------
I can't think of a better solution other than a search-based one (would like to think more about it).

Currently, we find all recovery options by traversing the whole GSS, and do a post-filter (based on the Start, and End). I might do both in the DFS (which will save some cost of traversing), the DFS will give us the best recovery options, and then we build the GSS node, and forest node.  But up to you.

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:111
+  };
+  for (auto *N : llvm::reverse(OldHeads))
+    DFS(N, TokenIndex, DFS);
----------------
any particular reason why we iterate the OldHeads in a reversed order?

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:130
+
+  assert(NewHeads.empty()); // We may repeatedly populate and clear it.
+  llvm::Optional<Token::Range> RecoveryRange;
----------------
nit: might be clearer to move the assertion to the beginning of the function.

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:174
+                            << " strategies\n");
+    ++TokenIndex;
+  }
----------------
Advancing the TokenIndex here seems a little surprising and doesn't match what the comment says (` On failure, NewHeads is empty and TokenIndex is unchanged.`). I think this should be done in caller side.

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:64
+    // These represent the partial parse that is being discarded.
+    // They should become the children of the opaque recovery node.
+    //
----------------
this is not implemented, right? Add a FIXME?

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:91
+  llvm::DenseSet<const GSS::Node *> Seen;
+  auto DFS = [&](const GSS::Node *N, Token::Index NextTok, auto &DFS) {
+    if (!Seen.insert(N).second)
----------------
nit: Walkup seems a bit clearer than DFS. 

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:133
+  for (const PlaceholderRecovery &Option : Options) {
+    // If this starts further right than options we've already found, then
+    // we'll never find anything better. Skip computing End for the rest.
----------------
`further right` should be `further left`, right?

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:149
+      if (RecoveryRange->End > *End)
+        NewHeads.clear();
+    }
----------------
If we find a better recovery option, we're throwing all the newly-built heads and forest nodes, this seems wasteful. I think we can avoid it by first finding the best solutions and creating new heads and gss nodes for them.

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:150
+        NewHeads.clear();
+    }
+    // Create recovery nodes and heads for them in the GSS. These may be
----------------
The Line135-Line150 code looks like a good candidate for  an `evaluateOption` function. 

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:590
+    if (NextHeads.empty()) {
+      glrRecover(Heads, I, Tokens, Params, NextHeads);
+      if (NextHeads.empty())
----------------
currently, it is fine for bracket-based recovery. In the future, we will need to run a force reduction for `Heads` regardless of the lookahead token before running the glrRecover, add a FIXME?

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:591
+      glrRecover(Heads, I, Tokens, Params, NextHeads);
+      if (NextHeads.empty())
+        // FIXME: Ensure the `_ := start-symbol` rules have a fallback
----------------
If the glrRecover returns a bool indicating whether we're able to recover, then the code is simpler:

```
if (NextHeads.empty() && !glrRecover(...))
  return Params.Forest.createOpaque(...);
```

================
Comment at: clang-tools-extra/pseudo/unittests/GLRTest.cpp:382
+  auto *Question1 = &Arena.createTerminal(tok::question, 1);
+  const auto *Root = GSStack.addNode(0, nullptr, {});
+  const auto *OpenedBraces = GSStack.addNode(1, LBrace, {Root});
----------------
nit: for GSS nodes, I will name them like `GSSNode<StateID>`, it is less describable, but it can be easily distinguished from other forest nodes, and match the GSS described in the comment, which I think it is clearer.

================
Comment at: clang-tools-extra/pseudo/unittests/GLRTest.cpp:425
+  // Innermost brace is unmatched, to test fallback to next brace.
+  TokenStream Tokens = cook(lex("{ { { ? ? } }", LOptions), LOptions);
+  Tokens.tokens()[0].Pair = 5;
----------------
I suppose there is an extra `?` in the string literal, the index of the brackets (`4`, `5`) used below doesn't match the string literal here.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128486/new/

https://reviews.llvm.org/D128486