[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

Wed Feb 2 09:34:31 PST 2022

sammccall added a comment.

A few initial comments just on the "easy parts" - changes to grammar and the first/follow set computation.

I need to study the rest. Generally I have a feeling some of the impl is very inefficient but efficiency may not matter much in this context, and it's unclear how much better it can be made without adding lots of complexity. Going to look more closely :-)

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h:128
+  // Returns all symbols.
+  std::vector<SymbolID> symbols() const;
+
----------------
This function seems like a bit of an attractive nuisance: it could be a cheap accessor, but isn't. There are just two callers - both are calling it in a loop and both seem dubious to be iterating over all symbols at all. I wonder if we can avoid it entirely.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h:130
+
+  // Returns the SymbolID of the augmented symbol '_'
+  SymbolID augmentedSymbol() const { return 0; }
----------------
this "augmented symbol" name doesn't seem widely used. (search for `lr "augmented symbol"` produces few results, mostly unrelated). It's also a confusing name, because it's IIUC it's not the symbol that's augmented, but rather the grammar has been augmented with the symbol...

I'd suggest just calling it the start symbol, unless the distinction is critical (I don't yet understand why it's important - I added it in the prototype just to avoid having the language-specific name "translation-unit" hardcoded).

Otherwise it's hard to suggest a good name without understanding what it's for, but it should be descriptive...

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h:133
+
+  // Computes the FIRST(X) for all non-terminal symbol X.
+  std::vector<llvm::DenseSet<SymbolID>> firstSets() const;
----------------
In about the same number of words you could explain what this actually is :-)

// Computes the set of terminals that could appear at the start of each non-terminal.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h:133
+
+  // Computes the FIRST(X) for all non-terminal symbol X.
+  std::vector<llvm::DenseSet<SymbolID>> firstSets() const;
----------------
sammccall wrote:
> In about the same number of words you could explain what this actually is :-)
> 
> // Computes the set of terminals that could appear at the start of each non-terminal.
is there some reason these functions need to be part of the grammar class?

They seem like part of the LR building to me, it could be local to LRBuilder.cpp, or exposed from LRTable.h or so for testing

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h:66
+
+using ItemSet = std::set<Item>;
+// A state of the LR automaton is an item set, which contains all possible rules
----------------
why is std::set used here? (generally a data structure to avoid)

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h:75
+// precompute all the transitions between these states.
+using State = ItemSet;
+llvm::hash_code hash_code(const State &S);
----------------
I have a feeling we've discussed this before but...

I find this alias/abstraction unhelpful when reading the code, because operations on states are e.g. `state.insert` or `state.contains`, which don't match the abstraction.

I'd probably prefer just using `ItemSet`, but alternatively `struct State { ItemSet Items; }` seems like it would work well: `State.Items.contains(...)` reads very naturally.

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp:39

+std::vector<llvm::DenseSet<SymbolID>> Grammar::firstSets() const {
+  std::vector<llvm::DenseSet<SymbolID>> FirstSets(T->Nonterminals.size());
----------------
this function could use some comments.

```
A rule S := T ... implies elements in first(S):
 - if T is a terminal, first(S) contains T
 - if T is a nonterminal, first(S) contains first(T)
Since first(T) may not have been fully computed yet,
first(S) itself may end up being incomplete.
We iterate until a fixed point.
(This isn't particularly efficient, but table building performance isn't critical).
```

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp:51
+
+  bool Changed = false;
+  do {
----------------
`bool Changed = true; while(Changed) {`? or maybe `for (bool Changed = true; Changed;) {`?

(do-while is rare enough that it always takes me a bit to grasp the control flow)

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp:56
+      assert(R.size() > 0);
+      Changed |= ExpandFirstSet(R.target(), R.seq().front());
+    }
----------------
comment that we only need to consider the first element because symbols are non-nullable

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp:88
+    for (const auto &R : T->Rules) {
+      // Rule 2: for a rule X := ...Yz, we add all symbols from FIRST(z) to
+      // FOLLOW(Y).
----------------
nit: `... Y Z`, with consistent spaces/capitalization
Unless case is meant to imply something here?

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp:95
+        // First set for a non-terminal is itself.
+        auto First =
+            isToken(Next) ? llvm::DenseSet<SymbolID>{Next} : FirstSets[Next];
----------------
copying the set here seems gratuitous, especially since we're doing this in a loop where !changed is the usual thing.

I think inlining ExpandFollowSet for each case would make sense here. It turns into 1 line for the NT case plus two lines for the T case, so it's actually shorter overall

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Grammar.cpp:101
+      if (!isNonterminal(Z))
+        continue;
+      // Rule 3: for a rule X := ....Z, we add all symbols from FOLLOW(X) to
----------------
I think `if (isNonterminal(Z)) ...expand...` would be clearer here, since the condition is really part of rule 3 (but in a not-totally obvious way, it's nice to see them side-by-side)

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:80
+}
+// For all items A := a.Xb in S, returns a closure itemset with a dot
+// advance of X.
----------------
nit: please be consistent about spacing & capitalization for rules

I think trying to use a convention of uppercase/lowercase for term/nonterm is difficult because there's no clear way to indicate that a symbol could be either, which is common.

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:80
+}
+// For all items A := a.Xb in S, returns a closure itemset with a dot
+// advance of X.
----------------
sammccall wrote:
> nit: please be consistent about spacing & capitalization for rules
> 
> I think trying to use a convention of uppercase/lowercase for term/nonterm is difficult because there's no clear way to indicate that a symbol could be either, which is common.
I can't really understand this description, it'd be helpful if it didn't reuse the name "advance" (particularly as a noun)

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:153
+    PendingStates.pop_back();
+    for (SymbolID Move : G.symbols()) {
+      State NextState = advance(Builder.find(CurrentStateID), Move, G);
----------------
this loop looks like it's iterating over every possible symbol that might appear next, rather than just the ones that do (and creating a temporary array of 1000 items each outer iteration to do it).

Seems like we could do better by instead advancing every rule, and spitting out a list of {symbol, ItemSet} pairs?

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:157
+        continue;
+      auto Insert = Builder.insert(NextState);
+      if (Insert.second) // new state, insert to the processing.
----------------
Hmm, and advance() is computing closures from scratch every time, probably just to find "oh, the state already existed". Seems like we could save work by using the kernel set to represent the full set until we actually need to iterate over its items.

I'd like to think about this some more.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196