[PATCH] D118196: [syntax][pseudo] Implement LR parsing table.

Fri Feb 18 03:35:50 PST 2022

hokein added inline comments.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h:177
 };
+void initTerminals(std::vector<std::string> &Terminals);

----------------
sammccall wrote:
> why is this exposed/required rather than being initialized by the GrammarTable constructor?
> Since this is essentially static (must always correspond to tok::TokenKind) it seems that GrammarTable could just have an ArrayRef and it could be initialized by a lazy-init singleton:
> 
> ```
> // in Grammar.cpp
> static ArrayRef<std::string> getTerminalNames() {
>   static std::vector<std::string> *TerminalNames = []{
> 
>   };
>   return *TerminalNames;
> }
> ```
> 
> (I know eventually we'd like GrammarTable to be generated and have very minimal dynamic init, but there are lots of other details needed e.g. we can't statically initialize `vector<Rule>` either, so we should cross that bridge when we come to it)
fair enough, let's hide it in a GrammarTable contructor.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:68
+      // Terminal actions, corresponding to entries of ACTION table.
+      Error = 0,
+      // Shift to state n: move forward with the lookahead, and push state n
----------------
sammccall wrote:
> Error seems like an action we'd dynamically handle, rather than an unreachable sentinel.
> 
> I'd prefer to call this sentinel and llvm_unreachable() on it in the relevant places. Even if we do plan dynamic error actions, we have enough spare bits to keep these two cases separate.
changed to Sentinel, and remove Error action for now.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:80
+      // Nonterminal actions, corresponding to entry of GOTO table.
+      // Go to state n: pust state n onto the state stack.
+      // Similar to Shift, but it is a nonterminal forward transition.
----------------
sammccall wrote:
> sammccall wrote:
> > again blank line between "nonterminal actions" and "go to state n"
> pust -> push?
> 
> I thought goto replaces the top of the stack rather than being pushed onto it.
> 
> e.g.
> stack is { stmt := . expr ; }, lookahead is IDENT, action is shift "expr := IDENT ."
> stack is { stmt := . expr ; | expr := IDENT . }, lookahead is semi, action is reduce
> stack is { stmt := . expr ; }, reduced symbol is expr, goto is state "stmt := expr . ;"
> stack is { stmt := expr . ;}, lookahead is semi...
> 
> Line 3=>4 doesn't seem like a push
>  
> 
yeah, this is mostly right -- stack in step 2 has two frames, rather than a combined one.

The truth is:

1. stack is [{ _ := . stmt | stmt := . expr ; }], lookahead is IDENT, action is shift "expr := IDENT ." (push a new state to the stack)
2. stack is [{ _ := . stmt | stmt := . expr ; }, { expr := IDENT .  }], lookahead is semi, action is reduce
3. stack is [{ _ := . stmt | stmt := . expr ; }], reduced symbol is expr, goto is state "stmt := expr . ;"
4. stack is [{ _ := . stmt | stmt := . expr ;}, { stmt := expr . ;}], lookahead is semi, action is shift "stmt := expr ; ."
5. stack is  [{ _ := . stmt | stmt := . expr ;}, { stmt := expr . ;}, { stmt := expr ; .}], lookahead is eof, action is reduce "stmt := expr ;"
6. stack is [{ _ := . stmt  | stmt := . expr ;}], reduced symbol is stmt, goto state is "_ := stmt ."
7. stack is [{ _ := . stmt | stmt := . expr ;}, { _ := stmt .}], lookahead is eof, action is accept, the parsing is done

Step 3 => 4 is a push. 

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:106
+    bool operator==(const Action &L) const { return value() == L.value(); }
+    uint16_t value() const { return K << ValueBits | Value; };
+
----------------
sammccall wrote:
> maybe value()=>opaque() or asInteger()?
> 
> Partly to give a hint a bit more at the purpose, partly to avoid confusion of `Value != value()`
renamed to `opaque`.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/LRTable.h:136
+
+  class Builder {
+  public:
----------------
sammccall wrote:
> This is quite a lot of surface (together with the DenseMapInfo stuff) to expose.
> Currently it looks like it needs to be in the header so that the LRTableTest can construct a table directly rather than going through LRGraph.
> 
> I think you could expose a narrower interface:
> ```
> struct Entry { ;... };
> // Specify exact table contents for testing.
> static LRTable buildForTests(ArrayRef<Entry>);
> ```
> 
> Then the builder/densemapinfo can be moved to the cpp file, and both `buildForTests` and `buildSLR` can use the builder internally. WDYT?
Yeah, I exposed the Builder mainly for testing purposes. The new interface looks good to me. 

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp:97
+  auto Result = find(State, Nonterminal);
+  assert(Result.size() == 1 && Result.front().kind() == Action::GoTo);
+  return Result.front().getGoToState();
----------------
sammccall wrote:
> (moving the comment thread to the new code location)
> 
> The DFA input guarantees Result.size() <= 1, but why can't it be zero?
> If this is a requirement on the caller, mention it in the header?
> 
yeah, getGoToState and getActions are expected to be called by the GLR parser, the parser itself should guarantee it during parsing. Added comments.

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTable.cpp:165
+  llvm::ArrayRef<Entry> SortedEntries = Sorted;
+  while (!SortedEntries.empty()) {
+    // We iterate each symbol per time (conceptually a column in the matrix per
----------------
sammccall wrote:
> We end up with holes because we're looping over the values rather than the keys. This also feels quite indirect.
> 
> Seems clear enough to reverse this, something like:
> 
> ```
> unsigned Pos = 0;
> for (unsigned NT = 0; TK < GT.Nonterminals.size(); ++I) {
>   NontermIdx[NT] = Pos;
>   while (Pos < Sorted.size() && Sorted[Pos].Action == NT)
>     ++Pos;
> }
> NontermIdx.back() = Pos;
> // and the same for terminals
> ```
yeah, this is much better!

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTableBuild.cpp:19
+LRTable LRTable::buildSLR(const Grammar &G) {
+  Builder Build;
+  auto Graph = LRGraph::buildLR0(G);
----------------
sammccall wrote:
> (I think it would be worth moving this into LRTable in order to hide the Builder type from the header...)
I tend to move all build bits (Builder, BuildSLRTable, BuildFor tests) to this file rather than `LRTable.cpp`. 

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/LRTableBuild.cpp:31
+      if (G.lookupRule(I.rule()).Target == G.startSymbol() && !I.hasNext()) {
+        Build.insert(SID, tokenSymbol(tok::eof), Action::accept());
+        continue;
----------------
sammccall wrote:
> (rephrasing a comment that got lost earlier in the review)
> 
> I'm not totally clear on the scope/purpose of the accept action:
>  - if it's meant to be sufficient to know whether the parse succeeded, I think it needs the symbol we accepted as a payload. The grammar will accept them all, but the caller will place restrictions.
>  - if we're OK inspecting the stack and seeing our last reduction was a Decl or so, why doens't the parser just do that at the end of the stream and do away with `Accept` altogether? (This would avoid the need to splice `eof` tokens into token streams to parse subranges of them).
The accept action is following what is described in the standard LR literature.

Yeah, the main purpose of the accept action is to tell us whether a parse succeeded.
We could add a ruleID as a payload for accept action, but I'm not sure whether it would be useful, the associated rule is about `_` (e.g. `_ := translation_unit`) --  we are interested in `translation_unit`, this can be retrieved from the forest.

I think the both options are mostly equivalent (the 1st option is a standard LR implementation, the 2rd one seems more add-hoc)-- we can treat accept action as a "special" reduce action of the rule "_ := translation_unit" except that we don't do the reduce, we just stop parsing.

I think we probably will revisit later -- things become tricker, if we start supporting snippets (the start symbol `_` will have multiple rules), e.g. for the follow grammar,

```
_ := stmt
_ := expr
stmt := expr
expr := ID
```
the input "ID" can be parsed as a "stmt" or an "expr", if we use a unified state `{ _ := . stmt | _ := . expr }` to start, we will have a new-type conflict (accept/reduce). I don't think we want to handle this kind of new conflicts. There are some options:
- careful manage the `_` rules, to make sure no such conflicts happened;
- introduce a new augmented grammar `__ := _` (the accept/reduce would become reduce/reduce conflict);
- use separate start state per `_` rule, and callers need to pass a targeting non-terminal to the parser;

Since there are some uncertain bits here, I'd prefer not doing further changes in this patch (but happy to add a ruleID payload for accept action). WDYT?

================
Comment at: clang/unittests/Tooling/Syntax/Pseudo/LRGraphTest.cpp:20

 TEST(LRGraph, Build) {
   struct TestCase {
----------------
sammccall wrote:
> Seems sensible to combine the table-building tests in here, but it does make the organization of the tests hard.
> (Particularly since LRTableTests.cpp exists but the most important tests are in this file instead).
> 
> Since these cross module boundaries, and they are exactly "bundle of text in, bundle of text out"...
> how do you feel about making them lit tests instead?
> 
> ```
> # RUN: clang-pseudo -grammar %s -print-graph | FileCheck --check-prefix=GRAPH
> # RUN: clang-pseudo -grammar %s -print-table | FileCheck --check-prefix=TABLE
> _ := expr
> ...
> #      GRAPH: States:
> # GRAPH-NEXT: ...
> #      TABLE: LRTable:
> # TABLE-NEXT: ...
> ```
I don't have strong opinion about these, both of them work. I think having options in the `clang-pseudo` tool to dump graph/table is a helpful feature for debugging as well, using lit tests might make more sense.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118196/new/

https://reviews.llvm.org/D118196