[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

Sam McCall via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Thu Feb 3 07:31:48 PST 2022


sammccall added a comment.

OK, I think I understand it now! Sorry for so many comments, happy to chat offline about which ones to address/where to start.



================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:39
+public:
+  struct Action {
+    enum class Kind : uint8_t {
----------------
API polish: this class is a mix of accessors and direct member state - it can be hard to know how to best interact with such a thing.

I'd suggest:
 - public factories `static Action Action::reduce(RuleID)` etc, private data & constructor
 - expose `Kind kind()`, but not the individual is() methods: they don't buy much and this pattern encourages switch where appropriate
 - make `Kind` enum rather than `enum class` - `Action::Shift` is sufficiently scoped already
 - shift() & goTo() -> `targetState()` which asserts either condition. Apart from anything else, this makes naming easier: "shift" is a verb so `shift()` is a confusing name.


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:39
+public:
+  struct Action {
+    enum class Kind : uint8_t {
----------------
sammccall wrote:
> API polish: this class is a mix of accessors and direct member state - it can be hard to know how to best interact with such a thing.
> 
> I'd suggest:
>  - public factories `static Action Action::reduce(RuleID)` etc, private data & constructor
>  - expose `Kind kind()`, but not the individual is() methods: they don't buy much and this pattern encourages switch where appropriate
>  - make `Kind` enum rather than `enum class` - `Action::Shift` is sufficiently scoped already
>  - shift() & goTo() -> `targetState()` which asserts either condition. Apart from anything else, this makes naming easier: "shift" is a verb so `shift()` is a confusing name.
mention that this combines the Action and Go-to tables from LR literature?


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:41
+    enum class Kind : uint8_t {
+      // Action table
+      Error = 0,
----------------
I think it's worth documenting what each of these actions means.

Referring to enum values as "action table" and "goto table" seems confusing - they're part of the same table! The main reference point when reading the code needs to be the code, rather than the textbook.


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:67
+
+    bool operator==(const Action &L) const {
+      if (K != L.K)
----------------
if you make the data members private the union could just be a uint16 and save yourself some trouble here :-) Up to you


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:84
+
+    Kind K = Kind::Error;
+    // Value
----------------
This action struct can be squeezed to 16 bits.

Kind only needs 3 bits (I suspect it can be 2: Error and Accept aren't heavily used).
RuleID is 12 bits, StateID would fit within 11 and 12 should be safe.

Combined with the optimization suggested below, this should get the whole table down to  335KB.


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:102
+  // Returns the transisted state for the state From on a nonterminal.
+  StateID goTo(StateID From, SymbolID NonTerminal) const {
+    assert(pseudo::isNonterminal(NonTerminal));
----------------
nit: please be consistent with `Nonterminal` (and "nonterminal") vs `NonTerminal` (and "non-terminal"). I think `Nonterminal` is used elsewhere in the code.


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:105
+    auto Goto = find({From, NonTerminal});
+    assert(Goto.size() == 1 && Goto.front().isGoTo());
+    return Goto.front().goTo();
----------------
Why is this assertion valid? Is it a precondition of the function that some item in From can advance over NonTerminal?


================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:121
+private:
+  struct IndexKey {
+    StateID From;
----------------
I think there's an even more compact representation:
Basically sort the index keys as symbol-major, but then don't store the actual symbol there but rather store the range for the symbol in an external array:

```
// index is lookahead tok::TokenKind. Values are indices into Index: the range for tok is StateIdx[TokenIdx[tok]...TokenIdx[tok+1]]
vector<unsigned> TokenIdx;

// index is nonterminal SymbolID. Values are  indices into Index: the range for sym is StateIdx[NontermIdx[sym]...NontermIdx[sym+1]].
vector<unsigned> NontermIdx;

// parallel to actions, value is the from state. Only subranges are sorted.
vector<StateID> StateIdx;

vector<Action> Actions;
```

This should be 4 * (392 + 245) + (2 + 4) * 83k ~= 500k. And the one extra indirection should save 5-10 steps of binary search.

This is equivalent to 2 * `vector<unsigned>` + `vector<pair<StateID, Action>>` but keeping the searched index compact is probably a good thing.


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:74
+      Item NewItem(RID, G.lookupRule(RID).size(), /*dot*/ 0);
+      if (IS.insert(NewItem).second) // new
+        ProcessingItems.push_back(std::move(NewItem));
----------------
As far as I can tell, this is the only place where we need to deduplicate/check for membership in the set.

WDYT about using a local DenseSet<Item> in `closure()` itself for this, and making the representation of ItemSet a (small)vector?

(IIUC the reason for std::set over DenseMap is determinism. Using vector should preserve that: if you really want sorting you can sort after computing closure, but if we just want a well-defined order to ensure stable state numbers then insertion order should work fine)


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:109
+      auto Hash = hash_code(S);
+      auto It = StatesIndex.find(Hash);
+      if (It != StatesIndex.end())
----------------
using hashcode instead of the key itself should be explained.
(If it's just an issue with hash of std::set, I think using std::vector fixes it)


================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:157
+        continue;
+      auto Insert = Builder.insert(NextState);
+      if (Insert.second) // new state, insert to the processing.
----------------
sammccall wrote:
> Hmm, and advance() is computing closures from scratch every time, probably just to find "oh, the state already existed". Seems like we could save work by using the kernel set to represent the full set until we actually need to iterate over its items.
> 
> I'd like to think about this some more.
Yeah, I think this should work. Builder would maintain a map from (sorted kernel items) => stateID, and Builder.insert() would call closure() only if actually inserting.


================
Comment at: clang/tools/clang-pseudo/ClangPseudo.cpp:46
     llvm::errs() << llvm::formatv("grammar file {0} is parsed successfully\n", CheckGrammar);
+    llvm::errs()
+        << clang::syntax::pseudo::LRTable::build(*RSpecs).dumpStatistics();
----------------
I think that we should also have flags to:
 - dump the grammar
 - dump the LR table


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196



More information about the cfe-commits mailing list