[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

Mon Feb 7 11:39:58 PST 2022

hokein marked 11 inline comments as done.
hokein added a comment.

comments around LRAutomaton should be addressed in the separate patch, https://reviews.llvm.org/D119172.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h:32
+public:
+  Item(RuleID ID, uint8_t RuleLength, uint8_t DotPos)
+      : RID(ID), RuleLength(RuleLength), DotPos(DotPos) {}
----------------
sammccall wrote:
> accepting the grammar as a parameter rather than the rule length would let us ensure the length is correct locally rather than relying on the caller.
> Essentially all the callers need to do a lookup anyway.
right, but we still need this one, for constructing an empty/Tombstone items for DenseMapInfo.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h:66
+
+using ItemSet = std::set<Item>;
+// A state of the LR automaton is an item set, which contains all possible rules
----------------
sammccall wrote:
> why is std::set used here? (generally a data structure to avoid)
the main reason to we want a deterministic order for computing the hash value. Changed to a sorted vector.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRAutomaton.h:79
+
+// A deterministic finite state automaton for LR parsing.
+//
----------------
sammccall wrote:
> Hmm, isn't the point of GLR that it's a nondeterministic automaton?
> 
> We've done the LR-style NFA->DFA conversion, but it's not complete (when we hit a conflict we continue rather than giving up)
yes and no. There are a few conceptually difference here.

GLR is indeed a nondeterministic parser, but I think the term deterministic/nondeterministic in parser is different than the ones used in automaton context. 
A deterministic parser (typical LR) has a property that there will always be only one possible action to choose from, thus it is fast (linear), and it produces only one syntax tree;
A deterministic automaton is described [here](https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton).

> We've done the LR-style NFA->DFA conversion, but it's not complete (when we hit a conflict we continue rather than giving up)

Not exactly, there is no NFA->DFA conversion here. What this patch does:
1. we constructs a deterministic LR(0) automaton (DFA) directly, this is what LRAutomaton represents;
2. based on the DFA, we construct an LR table (called Action and GoTo Table in standard LR). In particular, we use the FollowSet to determine reduce actions, it is called SLR(1) table which is powerful the the LR(0) table.

A conflict in the LRTable doesn't imply an incomplete DFA. For an arbitrary grammar (no matter it is ambiguous or not), we can always construct a DFA. For example, based on the same LR(0) automaton, we could build a SLR(1) table (without conflicts) or LR(0) table (with conflicts). 
The real problem lies in the interpretation of states of the automaton, considering a node in the automaton with state 0
```
E := T. + E
E := T.
```
and the node as a "+" outer edge, if the next symbol is `+`, we have two options "reduce E := T" or shift `+`. SLR can tell reduction is not available `+` is not in the FOLLOW(E) (thus no conflict) while LR(0) will accept both (thus conflicts). 

I added some bits in the file comment of the LRGraph.h, hope it can clarify these concepts.

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:109
+      auto Hash = hash_code(S);
+      auto It = StatesIndex.find(Hash);
+      if (It != StatesIndex.end())
----------------
sammccall wrote:
> using hashcode instead of the key itself should be explained.
> (If it's just an issue with hash of std::set, I think using std::vector fixes it)
using the ItemSet as key seems heavy (for cxx.grammar, 65KB for `llvm::DenseMap<itemset, ..>` vs 32KB for `llvm::DenseMap<hash_code>`) and unnecessary (we don't need to access the Itemset).

Added comments.

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRBuilder.cpp:144
+           "augmented symbol rule must be _ := start_symbol.");
+    auto StartState = closure({{{RID, Rule.size(), 0}}}, G);
+    auto Result = Builder.insert(std::move(StartState));
----------------
sammccall wrote:
> nit: too many braces - please spell out some of the types
> (and Auto should probably be State here?)
> too many braces - please spell out some of the types
this is one of cons of using inlay-hints :(

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196