[PATCH] D128472: [pseudo] Check follow-sets instead of tying reduce actions to lookahead tokens.

Mon Jun 27 06:11:00 PDT 2022

hokein accepted this revision.
hokein added a comment.
This revision is now accepted and ready to land.

This looks a great improvement. A few nits, it looks good to me overall.

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/grammar/LRTable.h:136
+  // check whether a reduction should apply in the current context.
+  llvm::ArrayRef<RuleID> getReduceRules(StateID State) const {
+    return llvm::makeArrayRef(&Reduces[ReduceOffset[State]],
----------------
I think the API is motivated by the LR(0) experimentation, we're now decoupling the reduction look up from the lookahead token, this seems nice (LR(0) only needs `getReduceRules`, other LR(1) variant need `getReduceRules` + `canFollow`).

The API usage for SLR(1) is not quite straight forward at the first glance,  would be better to add a small snippet in the comment.

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/grammar/LRTable.h:143
+    assert(isToken(Terminal));
+    assert(!isToken(Nonterminal));
+    return FollowSets.test(tok::NUM_TOKENS * Nonterminal +
----------------
nit:  use `isNonterminal(Nonterminal)`? IMO it is clearer.

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/grammar/LRTable.h:209
+
+  // Reduces from state S: Reduces[ [ReduceOffset[S], ReduceOffset[S+1]) ]
+  std::vector<uint32_t> ReduceOffset;
----------------
nit: I think rephasing it as: `a half-open range of Reduces [ReduceOffset[S], ReduceOffset[S+1]).` is clearer -- the current syntax `Reduces[ [ReduceOffset[S], ReduceOffset[S+1]) ]` is a bit weird (My first attempt was to read it as an unmatch-brackets code snippet)

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/grammar/LRTable.h:212
+  std::vector<RuleID> Reduces;
+  // T in Follow(NT) == FollowSets[NT * NUM_TOKENS + tok::Kind(T)]
+  llvm::BitVector FollowSets;
----------------
The NT is not quite obvious, what is NT? I think it is the symbol ID, then I don't understand the `NT * NUM_TOKENS + tok::Kind(T)` part.

================
Comment at: clang-tools-extra/pseudo/include/clang-pseudo/grammar/LRTable.h:213
+  // T in Follow(NT) == FollowSets[NT * NUM_TOKENS + tok::Kind(T)]
+  llvm::BitVector FollowSets;
 };
----------------
Oh, I understand it now after reading the build code...

NT here is the non-terminal (symbol to be reduced), T is the lookahead. And FollowSets is conceptually a flat array of a 2d bool array [Nonterminal, Tokens].

Probably add some more comments.

================
Comment at: clang-tools-extra/pseudo/lib/GLR.cpp:369
   // (Roughly this means there's no local ambiguity, so the LR algorithm works).
   bool popAndPushTrivial() {
     if (!Sequences.empty() || Heads->size() != NextPopHead + 1)
----------------
nit: add a comment about what does return value indicate.

================
Comment at: clang-tools-extra/pseudo/lib/grammar/LRTableBuild.cpp:116
+  llvm::DenseMap<StateID, llvm::SmallSet<RuleID, 4>> Reduces;
+  std::vector<llvm::DenseSet<SymbolID>> FollowSets;
+
----------------
 make these members as private -- having two public members seems to violate the pattern. (FollowSets can be initialized in the constructor, we might need a separate method for `Reduces`. or either change the Builder to a struct).

================
Comment at: clang-tools-extra/pseudo/unittests/LRTableTest.cpp:28
+  auto G = Grammar::parseBNF(R"bnf(
+    _ := expr
+    expr := term
----------------
nit: we use the hard-coded ruleID in the following test, I'd suggest add a trailing comment for the rule ID for each rule here, for better code readability.

================
Comment at: clang-tools-extra/pseudo/unittests/LRTableTest.cpp:38
+  SymbolID Eof = tokenSymbol(tok::eof);
+  SymbolID Semi = tokenSymbol(tok::semi);
+  SymbolID Int = tokenSymbol(tok::kw_int);
----------------
It would be better to change it to an `IDENTIFIER` to reflect the grammar above. 

================
Comment at: clang-tools-extra/pseudo/unittests/LRTableTest.cpp:39
+  SymbolID Semi = tokenSymbol(tok::semi);
+  SymbolID Int = tokenSymbol(tok::kw_int);
+  SymbolID Plus = tokenSymbol(tok::plus);
----------------
nit: I'd move this on Line67, as this token serves as an invalid token according to the grammar. 

================
Comment at: clang-tools-extra/pseudo/unittests/LRTableTest.cpp:42
+
+  //           eof   semi   term   reduce
+  // +-------+----+-------+------+-------
----------------
the row is usual for terminals,  and the reduce is a separate table, which seems a bit confusing -- I would probably split the reduce table out (adding an empty col between term and reduce)

================
Comment at: clang-tools-extra/pseudo/unittests/LRTableTest.cpp:45
+  // |state0 |    | s0    |      | r0
+  // |state1 |    |       | g3   | acc
+  // |state2 |    |       |      | r1
----------------
nit: `acc` => `R2 (acc)` (sorry, I didn't update the comment in the removing-accept patch)

================
Comment at: clang-tools-extra/pseudo/unittests/LRTableTest.cpp:53
+  std::vector<LRTable::ReduceEntry> ReduceEntries = {
+      {/* State=*/0, 0},
+      {/* State=*/1, 2},
----------------
nit: add a `/*RuleID=*/` comment annotation for the second argument. 

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128472/new/

https://reviews.llvm.org/D128472