[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

Fri Feb 11 09:33:13 PST 2022

sammccall added a comment.

This is really nice, mostly nits.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h:177
 };
+void initTerminals(std::vector<std::string> &Terminals);

----------------
why is this exposed/required rather than being initialized by the GrammarTable constructor?
Since this is essentially static (must always correspond to tok::TokenKind) it seems that GrammarTable could just have an ArrayRef and it could be initialized by a lazy-init singleton:

```
// in Grammar.cpp
static ArrayRef<std::string> getTerminalNames() {
  static std::vector<std::string> *TerminalNames = []{

  };
  return *TerminalNames;
}
```

(I know eventually we'd like GrammarTable to be generated and have very minimal dynamic init, but there are lots of other details needed e.g. we can't statically initialize `vector<Rule>` either, so we should cross that bridge when we come to it)

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:28
+//  LRTable is *performance-critial* as it is consulted frequently during a
+//  parse. In general, LRTable is very sprase (most of the entries are empty).
+//  For example, for the C++ language, the SLR table has ~1500 states and 650
----------------
space -> sparse

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:58
+// Unlike A typical LR parsing table allows at most one available action per
+// entry, conflicted actions are allowed in LRTable.
+class LRTable {
----------------
Maybe mention relation to GLR: GLR can execute these in parallel?

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:67
+    enum Kind : uint8_t {
+      // Terminal actions, corresponding to entries of ACTION table.
+      Error = 0,
----------------
nit: I think you want this to be a comment on the section rather than on Error, leave a blank line?

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:68
+      // Terminal actions, corresponding to entries of ACTION table.
+      Error = 0,
+      // Shift to state n: move forward with the lookahead, and push state n
----------------
Error seems like an action we'd dynamically handle, rather than an unreachable sentinel.

I'd prefer to call this sentinel and llvm_unreachable() on it in the relevant places. Even if we do plan dynamic error actions, we have enough spare bits to keep these two cases separate.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:80
+      // Nonterminal actions, corresponding to entry of GOTO table.
+      // Go to state n: pust state n onto the state stack.
+      // Similar to Shift, but it is a nonterminal forward transition.
----------------
again blank line between "nonterminal actions" and "go to state n"

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:80
+      // Nonterminal actions, corresponding to entry of GOTO table.
+      // Go to state n: pust state n onto the state stack.
+      // Similar to Shift, but it is a nonterminal forward transition.
----------------
sammccall wrote:
> again blank line between "nonterminal actions" and "go to state n"
pust -> push?

I thought goto replaces the top of the stack rather than being pushed onto it.

e.g.
stack is { stmt := . expr ; }, lookahead is IDENT, action is shift "expr := IDENT ."
stack is { stmt := . expr ; | expr := IDENT . }, lookahead is semi, action is reduce
stack is { stmt := . expr ; }, reduced symbol is expr, goto is state "stmt := expr . ;"
stack is { stmt := expr . ;}, lookahead is semi...

Line 3=>4 doesn't seem like a push

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:106
+    bool operator==(const Action &L) const { return value() == L.value(); }
+    uint16_t value() const { return K << ValueBits | Value; };
+
----------------
maybe value()=>opaque() or asInteger()?

Partly to give a hint a bit more at the purpose, partly to avoid confusion of `Value != value()`

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:114
+    // Either StateID or RuleID, depending on the Kind.
+    uint16_t Value : ValueBits;
+  };
----------------
static assert RuleBits <= ValueBits, and I think we want a StateBits above plus some assertions on the number of states when building

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:114
+    // Either StateID or RuleID, depending on the Kind.
+    uint16_t Value : ValueBits;
+  };
----------------
sammccall wrote:
> static assert RuleBits <= ValueBits, and I think we want a StateBits above plus some assertions on the number of states when building
FWIW I've mostly seen just `unsigned` as the field type for bitfields. It seems a little confusing to explicitly specify the size both as 16 and 13 bits.

Up to you, but if you change this one also consider the Kind enum.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:136
+
+  class Builder {
+  public:
----------------
This is quite a lot of surface (together with the DenseMapInfo stuff) to expose.
Currently it looks like it needs to be in the header so that the LRTableTest can construct a table directly rather than going through LRGraph.

I think you could expose a narrower interface:
```
struct Entry { ;... };
// Specify exact table contents for testing.
static LRTable buildForTests(ArrayRef<Entry>);
```

Then the builder/densemapinfo can be moved to the cpp file, and both `buildForTests` and `buildSLR` can use the builder internally. WDYT?

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:154
+private:
+  // Index is nonterminal SymbolID, values are indices of StateIdx.
+  // Give a non-terminal id, the corresponding half-open range of StateIdx is
----------------
maybe "value is the offset into States/Actions where the entries for this symbol begin".

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:163
+  // Grouped by the SymbolID, and only subranges are sorted.
+  std::vector<StateID> StateIdx;
+  // A flat list of available (non-error) actions.
----------------
Reading the building/lookup code, "index" and "idx" are used so often they're hard to follow.

I think calling this "States", parallel to "Actions", would be clearer.

Similarly maybe the NontermIdx -> NontermOffset (Offset into States/Actions arrays where we find entries for this nonterminal).
Then we can just use the word "index" for actual subscripts.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:165
+  // A flat list of available (non-error) actions.
+  // Sorted by (SymbolID, State), concepetually, an action is identified by a
+  // row (State) and a column (SymbolID) in a matrix.
----------------
concepetually -> conceptually.

I think mixing this comment in with the concrete data is confusing.
Maybe lift it to the top of the private section like:

```
// Conceptually this is a multimap from (State, SymbolID) => action.
// In the literature this is often a table (and multiple entries, i.e. conflicts are forbidden).
// Our physical representation is quite different for compactness.
```

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp:25
+  case LRTable::Action::Error:
+    return OS << "err";
+  case LRTable::Action::Shift:
----------------
as things currently stand this should be unreachable - these values should never escape the DenseMap. assert or llvm_unreachable?

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp:97
+  auto Result = find(State, Nonterminal);
+  assert(Result.size() == 1 && Result.front().kind() == Action::GoTo);
+  return Result.front().getGoToState();
----------------
(moving the comment thread to the new code location)

The DFA input guarantees Result.size() <= 1, but why can't it be zero?
If this is a requirement on the caller, mention it in the header?

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp:162
+  // default.
+  Table.TerminalIdx.resize(GT.Terminals.size() + 1, 0);
+  Table.NontermIdx.resize(GT.Nonterminals.size() + 1, 0);
----------------
nit: "assign" seems clearer than "resize"

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp:165
+  llvm::ArrayRef<Entry> SortedEntries = Sorted;
+  while (!SortedEntries.empty()) {
+    // We iterate each symbol per time (conceptually a column in the matrix per
----------------
We end up with holes because we're looping over the values rather than the keys. This also feels quite indirect.

Seems clear enough to reverse this, something like:

```
unsigned Pos = 0;
for (unsigned NT = 0; TK < GT.Nonterminals.size(); ++I) {
  NontermIdx[NT] = Pos;
  while (Pos < Sorted.size() && Sorted[Pos].Action == NT)
    ++Pos;
}
NontermIdx.back() = Pos;
// and the same for terminals
```

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTableBuild.cpp:19
+LRTable LRTable::buildSLR(const Grammar &G) {
+  Builder Build;
+  auto Graph = LRGraph::buildLR0(G);
----------------
(I think it would be worth moving this into LRTable in order to hide the Builder type from the header...)

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTableBuild.cpp:31
+      if (G.lookupRule(I.rule()).Target == G.startSymbol() && !I.hasNext()) {
+        Build.insert(SID, tokenSymbol(tok::eof), Action::accept());
+        continue;
----------------
(rephrasing a comment that got lost earlier in the review)

I'm not totally clear on the scope/purpose of the accept action:
 - if it's meant to be sufficient to know whether the parse succeeded, I think it needs the symbol we accepted as a payload. The grammar will accept them all, but the caller will place restrictions.
 - if we're OK inspecting the stack and seeing our last reduction was a Decl or so, why doens't the parser just do that at the end of the stream and do away with `Accept` altogether? (This would avoid the need to splice `eof` tokens into token streams to parse subranges of them).

================
Comment at: clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp:20

 TEST(LRGraph, Build) {
   struct TestCase {
----------------
Seems sensible to combine the table-building tests in here, but it does make the organization of the tests hard.
(Particularly since LRTableTests.cpp exists but the most important tests are in this file instead).

Since these cross module boundaries, and they are exactly "bundle of text in, bundle of text out"...
how do you feel about making them lit tests instead?

```
# RUN: clang-pseudo -grammar %s -print-graph | FileCheck --check-prefix=GRAPH
# RUN: clang-pseudo -grammar %s -print-table | FileCheck --check-prefix=TABLE
_ := expr
...
#      GRAPH: States:
# GRAPH-NEXT: ...
#      TABLE: LRTable:
# TABLE-NEXT: ...
```

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196