[PATCH] D121368: [pseudo][WIP] Build Ambiguous forest node in the GLR Parser.

Wed Mar 23 07:53:00 PDT 2022

sammccall added a comment.

A bunch of comments, but probably the most important one is that the inner loop feels uncomfortably chaotic:

- there's a huge function with a lot of stuff in scope and referenced, it's unclear what the data flow is
- there are very many (hopefully, unneccesarily many) concepts and data structures
- it's not really clear what the motivation for the concepts are, comments mostly describe how the algorithm uses them

I have a suspicion that all the allocation and indirection limits performance too, but at the moment that's a secondary concern.

I realize the minimal complexity is probably unavoidably high here but it'd be good if we can sketch out the data structures from first principles and try to find a way to simplify them.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Grammar.h:161
     std::string Name;
+    // Value of topological order of the nonterminal.
+    // For nonterminals A and B, if A := B (or transitively), then
----------------
As discussed offline, the algorithm is simplest if the RuleIDs themselves reflect the topological order in which we want to reduce them.

We require the RuleIDs to be grouped by the SymbolID of their LHS, but I think don't have much of a requirement on SymbolIDs themselves. (Currently they're alphabetical and we binary-search for `_` in the grammar constructor, but not much else).

So I think we can just sort the symbols with this topo order when constructing the grammar.
(This is fairly self-contained, probably affects dumps from tests, could be a separate patch if you like)

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:165
+//  1) Split --
+//     1.1 when a stack head has mutiple reduce actions, the head is
+//     made to split to accommodate the various possiblities.
----------------
... can be reduced using multiple rules
(avoid relying on subtleties of what the identify of a "reduce action" is)

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:174
+//
+//     1.2 when a stack head has a reduce action with multiple bases, the head
+//     will be split if each base leads to different states.
----------------
base hasn't been defined. In general, you've only mentioned the special cases here, but not the basic case. (case 0?)

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:174
+//
+//     1.2 when a stack head has a reduce action with multiple bases, the head
+//     will be split if each base leads to different states.
----------------
sammccall wrote:
> base hasn't been defined. In general, you've only mentioned the special cases here, but not the basic case. (case 0?)
this is confusing because below you define "ReduceAction" to be specific to a base.

I'd say "when a stack head can be reduced by a rule, but there are multiple possible bases for the reduction (due to a previous merge), ..."

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:203

-  // Reduce can manipulate the GSS in following way:
-  //
-  //  1) Split --
-  //     1.1 when a stack head has mutiple reduce actions, the head is
-  //     made to split to accommodate the various possiblities.
-  //     E.g.
-  //       0 -> 1 (ID)
-  //     After performing reduce of production rules (class-name := ID,
-  //     enum-name := ID), the GSS now has two new heads:
-  //       0 -> 2 (class-name)
-  //        `-> 3 (enum-name)
+  // Reductions per reduce path.
+  struct ReduceAction {
----------------
This comment seems a bit confusing: it suggests that all reductions from a state are naturally "the same reduction" unless we're explicitly splitting them out for different paths here.

However this doesn't really match the grammatical meaning of the word "reduction", it's the act of reducing **some symbols** using **a rule**. If the symbols or the rule change, it's a different reduction.

Maybe "A reduction is characterized by the symbols on the stack being reduced and the rule used to transform them. There may be multiple reductions possible for the same rule, if some node near the top of the stack has multiple parents."

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:205
+  struct ReduceAction {
+    std::vector<const Graph::Node *> ReducePath;
+    RuleID ReduceRuleID;
----------------
SmallVector. Size is at max RHS of rule

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:208
+  };
+  auto OrderCmp = [this](const ReduceAction &L, const ReduceAction &R) {
+    auto LBegin = L.ReducePath.back()->Parsed->startLoc();
----------------
If we can use ruleid instead of grammar table, the comparator becomes stateless and this code is way easier to (re)organize :-)

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:211
+    auto RBegin = R.ReducePath.back()->Parsed->startLoc();
+    if (LBegin == RBegin)
+      return G.table()
----------------
nit: usually early-exit, e.g. `if (LBegin != RBegin) return LBegin < RBegin`

as well as reducing nesting it puts the higher-priority rules first

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:214
+                 .Nonterminals[G.lookupRule(L.ReduceRuleID).Target]
+                 .TopologicalOrder >
+             G.table()
----------------
This isn't sufficient given your documented definition of topological order: if A and B are incomparable (nether A:=B nor B:=A) then both A and B could have topological order 5, and we could sort ReduceActions as ABAB and fail to pack the ambiguity.

We either need SymbolID as a second tiebreak or to bake the topological order into the symbolID/ruleid itself.

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:220
+  };
+  auto Equal = [this](const ReduceAction &L, const ReduceAction &R) {
+    return L.ReducePath.back()->Parsed->startLoc() ==
----------------
this is not equality of reduce actions, it's something else.

Maybe just SameRangeAndSymbol?

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:227
+
+  // Forest node is unmutable. To create an amgbiguous forest node, we need to
+  // know all alternatives in advance.
----------------
unmutable --> immutable

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:227
+
+  // Forest node is unmutable. To create an amgbiguous forest node, we need to
+  // know all alternatives in advance.
----------------
sammccall wrote:
> unmutable --> immutable
nit: I think this comment belongs above the definition of order rather than below it.

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:227
+
+  // Forest node is unmutable. To create an amgbiguous forest node, we need to
+  // know all alternatives in advance.
----------------
sammccall wrote:
> sammccall wrote:
> > unmutable --> immutable
> nit: I think this comment belongs above the definition of order rather than below it.
We haven't really explained *why* order is important to finding all alternatives.
The explanation here is useful, but still too far on the side of describing the code IMO.

```
As we reduce, new reductions may become available.
Initially, each stack head holds the terminal we just shifted.
-- expr -- + -- IDENT
We can only reduce by rules whose RHS end with this token.
After reducing by expr := IDENT we have an `expr` on the stack:
-- expr -- + -- expr
We can now reduce by expr := expr + expr.

If we reduce in arbitrary order until exhausted, we'd see all
possible reductions, but "related" reductions may not be
seen at the same time.
Reductions that interpret the same token range as the same
nonterminal should share a single Ambiguous forest node.
If they reach the same parse state, they share a GSS node too.
These nodes are immutable and so related reductions need
to happen in a batch.

So instead, we perform reductions in a careful order that
ensures all related reductions are visible at once.
A reduction of N tokens as symbol S can depend on:
  1) reducing the last M < N tokens as T, then S := ... T
  2) reducing the last N tokens as symbol T, then S := T
To handle 1), shorter reductions happen before longer ones.
To handle 2), we use the fact that S := T and T := S can't
both be possible (even transitively) in a valid grammar.
A topological order of the target symbols is used to ensure
that T is reduced before S for fixed N (in our example).
```

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:248
+      OrderedReduceList(OrderCmp);
+  auto addToOrderedReduceList = [&OrderedReduceList,
+                                 this](decltype(ReduceList) &ReduceList) {
----------------
nit: capital AddToOrdered...

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:248
+      OrderedReduceList(OrderCmp);
+  auto addToOrderedReduceList = [&OrderedReduceList,
+                                 this](decltype(ReduceList) &ReduceList) {
----------------
sammccall wrote:
> nit: capital AddToOrdered...
ReduceQueue?

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:254
+      const Rule &ReduceRule = G.lookupRule(ReduceRuleID);
+      enumerateReducePath(RA.Head, ReduceRule.Size, ReducePath, [&]() {
+        OrderedReduceList.push({ReducePath, ReduceRuleID});
----------------
I think as previously discussed, I'd be happier if enumerateReducePath was an instance method and just directly wrote into OrderedReduceList (which can be a member), rather than passing around callbacks.

This seems like a tight loop to be using std::function

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:258
+    }
+    ReduceList.clear();
+  };
----------------
if OrderedReduceList was a member instead of a local here, we wouldn't need to move data from ReduceList into OrderedReduceList, we could just put it there in the first place

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:269
+             Equal(OrderedReduceList.top(), Batch.front()));

+    // newly-created GSS node -> corresponding forest node.
----------------
the overall function is too long, this next chunk "process a batch of reductions that produce the same symbol over the same token range" seems like a reasonable place to pull out a function.

We should introduce a name for this concept: maybe a reduction family?

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:273
+    // Used to assemble the ambiguous forest node at the end.
+    llvm::DenseMap<Graph::Node *, std::vector<const ForestNode *>>
+        BuiltForestNodes;
----------------
I don't really like the complexity of the transient data structures we're building here: many hashtables and vectors for each reduction.

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:275
+        BuiltForestNodes;
+    // Track whether we hit ambiguities, determined by the equality of
+    // predecessors.
----------------
I'm not really sure precisely what "ambiguities", "equality" and "predecessors" mean in this comment.
Can we be more specific?

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:280
+    for (const auto &RA : Batch) {
+      SymbolID ReduceSymbolID = G.lookupRule(RA.ReduceRuleID).Target;
+      // Create a corresponding sequence forest node for the reduce rule.
----------------
this doesn't make sense inside the loop, the batch has the same target symbol by definition

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:282
+      // Create a corresponding sequence forest node for the reduce rule.
+      std::vector<const ForestNode *> ForestChildren;
+      for (const Graph::Node *PN : llvm::reverse(RA.ReducePath))
----------------
seems a bit wasteful we have to materialize a temporary vector to copy from just because our first one is in the wrong order! Can we have it be in the right order instead?

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:285
+        ForestChildren.push_back(PN->Parsed);
+      const ForestNode &ForestNode = ParsedForest.createSequence(
+          ReduceSymbolID, RA.ReduceRuleID, ForestChildren.front()->startLoc(),
----------------
avoid same name for type & variable

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:303
+      //
+      // Each RA corresponds to a single reduce path, but a reduce path can have
+      // multiple Bases, which could split the stack (depends on whether their
----------------
so IIUC given we're reducing 3 symbols, head is Z, Z.parent is Y:
 - if Y has two parents X1, X2, then we may multiple reduce paths/ReduceActions [Z Y X1] [Z Y X2]
 - but if Y.parent is X and X has two parents W1 W2, then we have *one* reduce path [Z Y X] and two BaseInfos for W1 and W2.

why? why not rather say that there's just one ReduceAction concept and the Base is part of its identity, and it gets built by enumerateReducePath?

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:324

       llvm::ArrayRef<BaseInfo> Partition = llvm::makeArrayRef(Bases);
       while (!Partition.empty()) {
----------------
partition seems like a really important concept here, but isn't defined.

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:338

+        // Not needed, as it is created outside of the partition-loop.
+        // Create a corresponding sequence forest node for the reduce rule.
----------------
I'm not really sure what this chunk of commented-out code is trying to say

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:351
+
         // Check ambiguities.
+        // FIXME: this is a linear scan, it might be too slow.
----------------
Again, you're talking about ambiguity, but haven't ever really defined it (apart from how we're going to represent parse ambiguity in the forest). Can we use more precise language? It's hard to understand what the purpose of this block is.

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:376

-        // Create a corresponding sequence forest node for the reduce rule.
-        std::vector<const ForestNode *> ForestChildren;
-        for (const Graph::Node *PN : llvm::reverse(ReducePath))
-          ForestChildren.push_back(PN->Parsed);
-        const ForestNode &ForestNode = ParsedForest.createSequence(
-            ReduceRule.Target, RA.PerformAction->getReduceRule(),
-            ForestChildren.front()->startLoc(), ForestChildren);
-        LLVM_DEBUG(llvm::dbgs() << llvm::formatv(
-                       "     after reduce: {0} -> state {1} ({2})\n",
-                       llvm::join(getStateString(Predecessors), ", "),
-                       NextState, G.symbolName(ReduceRule.Target)));
-
-        // Create a new stack head.
-        const Graph::Node *Head =
-            GSS.addNode(NextState, &ForestNode, Predecessors);
-        CreatedHeads.push_back(Head);
+        // Create a new GSS node.
+        Graph::Node *Head = GSS.addNode(NextState, &ForestNode, Predecessors);
----------------
nit: comment just echoes the code

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:381

-        // Actions that are enabled by this reduce.
+        // Actions that are enabled by this reduction.
         addActions(Head, Lookahead);
----------------
sure, but why?

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:386
+    // We're good to assmeble the ambiguous forest node if any.
+    for (auto It : BuiltForestNodes) {
+      if (It.second.size() > 1) {
----------------
we're iterating over BuiltForestNodes in order to... build forest nodes

I guess BuiltForestNodes should rather be SequenceNodes?

Why is it that we're grouping by GSS node here? My understanding was we wanted one forest node per (nonterminal, token range), but GSS nodes are finer grained than that (e.g. may be differentiated only by parse state, possibly parent list too?)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121368/new/

https://reviews.llvm.org/D121368