[PATCH] D121150: [pseudo][WIP] Implement a GLR parser.

Tue Mar 8 08:03:01 PST 2022

sammccall added inline comments.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:30
+// A node in a forest.
+class ForestNode {
+public:
----------------
sammccall wrote:
> I wonder if we should put these in a namespace `forest::Node` rather than giving them the "Forest" prefix?
alignas(ForestNode*)

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:67
+  std::string Dump(const Grammar &) const;
+  std::string DumpRecursive(const Grammar &, bool abbreviated = false) const;
+
----------------
dump(), dumpRecursive()

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:78
+  }
+  static uint16_t SequenceData(RuleID rule,
+                               llvm::ArrayRef<const ForestNode *> elements) {
----------------
Rule, Elements

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:90
+  // Terminal - unused
+  uint16_t Data_;
+  // A trailing array of Node* .
----------------
remove trailing _

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:91
+  uint16_t Data_;
+  // A trailing array of Node* .
+};
----------------
ForestNode*

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:100
+public:
+  size_t nodeNum() const { return NodeNum; }
+  size_t bytes() const { return Arena.getBytesAllocated() + sizeof(this); }
----------------
count()?

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:40
+
+// An implementation of a directed acyclic graph (DAG), used as a
+// graph-structured stack (GSS) in the GLR parser.
----------------
the fact that this uses a DAG seems less important than what it represents. I think the next sentence is a better introduction, and then we should describe what data the stacks are modelling, finally we should describe the data structure after (see comment below).

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:43
+//
+// GSS is an efficient data structure to represent multiple active stacks, it
+// employs a stack-combination optimization to avoid potentially exponential
----------------
as with forest, I think this places too much emphasis on the memory-compactness when it's not the main benefit. (If RAM were free we still wouldn't want a big array of stacks).

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:43
+//
+// GSS is an efficient data structure to represent multiple active stacks, it
+// employs a stack-combination optimization to avoid potentially exponential
----------------
sammccall wrote:
> as with forest, I think this places too much emphasis on the memory-compactness when it's not the main benefit. (If RAM were free we still wouldn't want a big array of stacks).
To address the two above comments, maybe something like.

```
A Graph-Structured Stack represents the multiple parse stacks for a generalized LR parser.

Each node stores a parse state, the last parsed ForestNode, and its parent(s).
There may be several heads (top of stack), and the parser operates by:
 - shift: pushing terminal symbols on top of the stack
 - reduce: replace N symbols on top of the stack with one nonterminal

The structure is a DAG rather than a stack:
- GLR allows multiple actions on the same head, producing forks (nodes with the same parent).
- The parser reconciles nodes with the same (state, ForestNode), producing joins (node with multiple parents).

The parser is responsible for creating nodes, and keeping track of the set of heads.
```

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:46
+// growth of the stack:
+//  - combing equal stack prefixes -- A new stack doesn't need to have a full
+//    copy of its parent’s stack. They share a common prefix.
----------------
combing -> combining

I wonder if combining is the right explanation here - unlike local ambiguity packing, it's not like we produce two things and then merge them.

What about rather: `sharing stack prefixes: when the parser must take two conflicting paths, it creates two new stack head nodes with the same parent.`

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:48
+//    copy of its parent’s stack. They share a common prefix.
+//  - combing euqal stack suffices -- as there are a finite number of DFA's
+//    state the parser can be in. A set of heads can be in the same state
----------------
combing -> combining
euqal -> equal

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:48
+//    copy of its parent’s stack. They share a common prefix.
+//  - combing euqal stack suffices -- as there are a finite number of DFA's
+//    state the parser can be in. A set of heads can be in the same state
----------------
sammccall wrote:
> combing -> combining
> euqal -> equal
the "-- as..." clause is a bit confusing and doesn't seem necessary.

I think what's missing here is a high level explanation of what's going on in the parse to trigger this, and explicitly mentioning this is how we get a node with multiple parents.

When alternative parses converge on the same interpretation of later tokens, their stack heads end up  in the same state. These are merged, resulting in a single head with several parents.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:53
+//
+// E.g. we have two active stacks:
+//   0 -> 1 -> 2
----------------
this shows forking but not merging
and in particular it suggests that #heads == #stacks, which is not the case

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:54
+// E.g. we have two active stacks:
+//   0 -> 1 -> 2
+//        |    ^ head1, representing a stack [2, 1, 0]
----------------
arrows are pointing the wrong way (at least opposite to our pointers)

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:58
+//             ^ head2, representing a stack [3, 1, 0]
+struct Graph {
+  // Represents a node in the graph.
----------------
the name "Graph" is IMO too general, this isn't a reusable graph or even dag class.

I think GSS is fine, it's a distinctive name from the literature, and the expansion of the abbreviation is nice and descriptive (but too verbose to be the actual name)

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:59
+struct Graph {
+  // Represents a node in the graph.
+  struct Node {
----------------
this comment doesn't say anything, remove?

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:60
+  // Represents a node in the graph.
+  struct Node {
+    // The parsing state presented by the graph node.
----------------
alignas(Node*)

(in practice this will match ForestNode* so is a no-op, but it documents the intent)

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:61
+  struct Node {
+    // The parsing state presented by the graph node.
+    LRTable::StateID State : LRTable::StateBits;
----------------
again, drop comment unless there's something to say

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:64
+    static constexpr unsigned PredecessorBits = 3;
+    // Number of the predecessors of the node.
+    // u is the predecessor of v, if u -> v.
----------------
This is not what predecessor usually means in graph theory: u is the predecessor of v if there is *some path* from u->v.

I think "parent" is a common and well-understood term.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:68
+    // The forest node for a termina/nonterminal symbol.
+    // The symbol correponds to the label of edges which leads to current node
+    // from the predecessor nodes.
----------------
Also not sure what this comment says:
 - first line just repeats the type: type is a forest node, forest nodes are always for symbols, and terminal/nonterminal are all the possibilities
 - second line is either referring to or defining (not sure) edge labels, but edge labels aren't defined or referred to anywhere else, so why?

Maybe:
```
The parse tree for the last symbol we parsed.
This symbol appears on the left of the dot in the parse state.
```

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:77
+
+    bool operator==(const Node &L) const {
+      return State == L.State && predecessors() == L.predecessors();
----------------
why is Parsed not part of the identity?
(maybe there's a good reason, but there should be a comment)

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:87
+                      llvm::ArrayRef<const Node *> Predecessors) {
+    assert(Predecessors.size() < (1 << Node::PredecessorBits) &&
+           "Too many predecessors to fit in PredecessorBits!");
----------------
8 parents isn't that many and it seems like this might be a dynamic property of the input code rather than a static property of the grammar.

But I don't think this bitpacking is buying anything, it looks like the layout is:
 - State : 13
 - PredecessorCount : 3
 - (padding) : 48
 - Parsed : 64

So I think we might as well just use a uint16 PredecessorCount and still have room left over for a uint16 refcount later. Is there anything else we might usefully store in the extra space?

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:100
+  size_t bytes() const { return Arena.getTotalMemory() + sizeof(*this); }
+  size_t nodeCount() const { return NodeCount; }
+
----------------
name consistently with forest arena

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:112
+
+  const ForestNode *parse(const TokenStream &Code);
+
----------------
as discussed offline, the signature here should involve a token range, or likely an ArrayRef<ForestNode> for the relevant terminals.

Probably a FIXME to allow the start symbol to be specified.

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:114
+
+  const Graph &getGSS() const { return GSS; }
+
----------------
sammccall wrote:
> this function never exposes an interesting graph, as it's mostly empty before + after parse().
> It's only really useful for working out how much memory is allocated.
> 
> If we were to drop this, the public interface of GLRParser would be a single function (so it need not be a public class) and Graph would disappear entirely!
> 
> With this in mind, it seems like we could replace this header file with something like:
> 
> ```
> struct ParseStats {
>   unsigned GSSBytes = 0;
> };
> ForestNode *glrParse(const TokenStream &, const LRTable&, const Grammar&, ForestArena&, ParseStats *Stats = nullptr);
> ```
> 
> It feels like hiding the GSS might be an obstacle to debugging/testing.
> However this really would need a story/API around how to interrupt the parser.
> LLVM_DEBUG seems reasonable to me for now.
After offline discussion I think we either want:
 - to hide GSS as mentioned above, and just test the forest output
 - expose GSS class, and have a two versions of the parse function: one that finalizes the GSS into a result ForestNode, and a "raw" one that just returns the GSS. Then we can get at the GSS state at a point by running the raw parser on some prefix of the code. (Or even *only* have the raw one and make it the caller's responsibility to extract the node from the GSS, but this is probably silly)

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:129
+  // Frontier is to track all avaiable actions from all active stack heads.
+  struct Frontier {
+    // A corresponding stack head.
----------------
I love the name frontier!
Unfortunately the word refers to the whole boundary/set, rather than individual elements of it.

Maybe this?
```
/// The list of actions we're ready to perform.
struct {
  std::vector<Pending> Shift;
  std::vector<Pending> Reduce;
  std::vector<Pending> Accept;
  bool empty(); // accept is an odd-one-out here, but I think it's going away?
} Frontier;
```

Also maybe a FIXME that the frontier needs to order reductions so that local ambiguity packing is guaranteed to work as a single pass

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:133
+    // An action associated with the Head.
+    const LRTable::Action *PerformAction = nullptr;
+  };
----------------
just Action?

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Forest.cpp:30
+  }
+  llvm_unreachable("unhandle node kind!");
+}
----------------
unhandled

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:33
+  ParsedForest.init(Code);
+  const Token *Lookahead = &Code.tokens().front();
+  addActions(GSS.addNode(/*StartState*/ 0, nullptr, {}), *Lookahead);
----------------
sammccall wrote:
> It actually currently looks like it would be at least as natural to have the parser operate on the sequence of terminal ForestNodes rather than on the Tokens themselves...
I'd suggest calling this NextTok instead of Lookahead:
 - in common english this is a verb rather than a noun
 - in parsers it more often refers to the *number* of tokens in the tables than the tokens themselves
 - referring to the token you're *currently shifting* as "ahead" is bizarre!

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:67
+
+std::vector<const Graph::Node *>
+GLRParser::performShift(Token::Index Lookahead) {
----------------
if this is a hot loop we shouldn't be creating std::vectors and throwing them away

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:80
+
+  // Merge the stack -- if multiple stack heads are going to shift a same
+  // state, we perform the shift only once by combining these heads.
----------------
we shift *tokens*, not states: the token conceptually moves from the input to the stack. (This seems to be other references not just me!)

if multiple stack heads will reach the same state after shifting a token?

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:85
+  // state 2 and state 3:
+  //   0 -> 1 -> 2
+  //        ` -> 3
----------------
I think the arrows directions aren't particularly clear (our pointers go the opposite way) or necessary here, and backticks and up arrows are hard to read.

A couple of (widely-supported) box-drawing characters would help a lot I think:

```
0--1--2
   └--3

0--1--2--4
   └--3--┘

```

(I would keep using `-` and `|` rather than making everything pretty boxes, it's readable enough and easier to edit. We may want to consider box-drawing characters for dump functions though...)

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:93
+  // Shifts are partitioned by the shift state, so each partition (per loop
+  // iteration) corresponds to a "perform" shift.
+  llvm::sort(ShiftList, [](const Frontier &L, const Frontier &R) {
----------------
I can't parse `a "perform" shift`.

Batch shifts by target state so we can merge matching groups?

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:97
+           R.PerformAction->kind() == LRTable::Action::Shift);
+    return std::forward_as_tuple(L.PerformAction->getShiftState(), L.Head) <
+           std::forward_as_tuple(R.PerformAction->getShiftState(), R.Head);
----------------
I don't think tiebreaking by `Head` achieves anything.

If we really need to ensure deterministic behavior, comparing pointers won't do that as allocation may vary.
On the other hand, `stable_sort` should work: we'll tiebreak by order enqueued, so we're deterministic if our inputs are.

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:108
+    // Predecessors of the new head in GSS.
+    std::vector<const Graph::Node *> Predecessors;
+    llvm::for_each(Batch, [&Predecessors](const Frontier &F) {
----------------
SmallVector

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:110
+    llvm::for_each(Batch, [&Predecessors](const Frontier &F) {
+      assert(llvm::find(Predecessors, F.Head) == Predecessors.end() &&
+             "Unexpected duplicated stack heads during shift!");
----------------
llvm::is_contained

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:114
+    });
+    const auto *Head = GSS.addNode(NextState, Leaf, Predecessors);
+    LLVM_DEBUG(llvm::dbgs()
----------------
if this turns out to be hot, you could avoid the temporary array of predecessors: return a mutable node from addNode

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:128
+static std::vector<std::string>
+getStateString(llvm::ArrayRef<const Graph::Node *> A) {
+  std::vector<std::string> States;
----------------
name doesn't match return type

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:131
+  for (const auto &N : A)
+    States.push_back(llvm::formatv("state {0}", N->State));
+  return States;
----------------
I suspect `S{0}` will be verbose enough, it's not likely spelling out `state` will make this much less cryptic, but it may be harder to scan

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:141
+  assert(PathStorage.empty() && "PathStorage must be empty!");
+  std::function<void(const Graph::Node *, unsigned)> enumPath =
+      [&CB, &PathStorage, &enumPath](const Graph::Node *Current,
----------------
We're leaning on std::function a lot for code that's supposedly in the hot path.
It's probably fine, but I'd like to see this written more directly.

As far as I can tell, this could easily be a class, with enumerateReducePath a (recursive) method that concretely calls into handleReducePath() or something at the bottom.

Also if it's a class we can easily reset() instead of destroying it if we want to reuse its vectors across separate reductions.

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:158
+// heads.
+void GLRParser::performReduction(const Token &Lookahead) {
+  if (!ReduceList.empty())
----------------
(just a note, I need to dig into this part tomorrow!)

================
Comment at: clang/tools/clang-pseudo/ClangPseudo.cpp:33
                             desc("Print the LR table for the grammar"));
+static opt<std::string> ParseFile("parse", desc("Parse a C++ source file"),
+                                  init(""));
----------------
nit: a C++ source file => a source file

================
Comment at: clang/tools/clang-pseudo/ClangPseudo.cpp:33
                             desc("Print the LR table for the grammar"));
+static opt<std::string> ParseFile("parse", desc("Parse a C++ source file"),
+                                  init(""));
----------------
sammccall wrote:
> nit: a C++ source file => a source file
Why is this a new flag independent of `Source`?

It also seems inconsistent with other flags, which describe what output is requested rather than what computation should be done.

Maybe "-print-forest" and "-print-stats"? (which could also potentially control stats of other stages)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121150/new/

https://reviews.llvm.org/D121150