[PATCH] D121150: [pseudo][WIP] Implement a GLR parser.

Mon Mar 7 16:00:39 PST 2022

sammccall added a comment.

Nice! Some early comments, I haven't gotten deep into the GLR algorithm itself

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:9
+//
+// Parse forest is the output of the GLR parser.
+//
----------------
nit: i think we should reverse the emphasis: the one-sentence summary should say what this models (a set of possible parse trees) and then elaborate below that it's the output of GLR

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:12
+// For an ambiguous grammar, there might be multiple parse trees generated from
+// for the given input. Forest is a DAG which represent numberous possible in a
+// space-efficient manner. Common subtrees are shared -- if two or more trees
----------------
I think we should be a bit more verbose than "forest is a dag" since this is very confusing/surprising (at least to me).

Maybe

```
A parse forest represents a set of possible parse trees."
Despite the name, the data structure is a parse-tree-like DAG with a single root.
Multiple ways to parse the same tokens are represented as an Ambiguous node with the possible interpretations as children.
Common sub-parses are shared: if two interpretations both parse the token range "1 + 1" as expr := expr + expr, they will share a Sequence node representing this.
```

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:13
+// for the given input. Forest is a DAG which represent numberous possible in a
+// space-efficient manner. Common subtrees are shared -- if two or more trees
+// treat the token range [1, 3) as an Expression, then there is a single shared
----------------
nit: numberous -> numerous

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:13
+// for the given input. Forest is a DAG which represent numberous possible in a
+// space-efficient manner. Common subtrees are shared -- if two or more trees
+// treat the token range [1, 3) as an Expression, then there is a single shared
----------------
sammccall wrote:
> nit: numberous -> numerous
wording is a bit misleading here:
 - suggests that the primary virtue is being space efficient when really it eliminates other forms of redundancy too (like traversal)
 - calling these "subtrees" is a bit confusing since they're not trees (but represent sets of trees)

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:29
+
+// A node in a forest.
+class ForestNode {
----------------
We should say something more than this about node structure:
 - nodes represent ways to interpret a sequence of tokens
 - a node always interprets a fixed set of tokens as a fixed symbol kind
 - some nodes may have children, and have pointers to them
 - all nodes may have multiple parents, but do not know them

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:30
+// A node in a forest.
+class ForestNode {
+public:
----------------
I wonder if we should put these in a namespace `forest::Node` rather than giving them the "Forest" prefix?

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:46
+
+  // The parses for each element in the RHS of the rule.
+  // REQUIRES: this is a Sequence node.
----------------
this sentence doesn't make sense

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:57
+  };
+  uint32_t startLoc() const { return StartLoc; }
+
----------------
maybe startIndex or startToken? Loc reminds me too much of SourceLocation...

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/Forest.h:111
+  const ForestNode &terminal(Token::Index Index) const {
+    assert(Terminals && "Terminals are not intialized!");
+    return Terminals[Index];
----------------
it seems weird to have this state... can we replace init + terminal() with
`ArrayRef<ForestNode> createTerminals(const TokenStream&)`
and have the parser responsible for keeping track of pointers (as it does for other nodes)?

This would also make this fit the pure "arena" concept better.

I'm digging through notes on the prototype and I'm not convinced that my reasons for keeping track of these in the arena make sense.
(They were about ensuring equivalent nodes are shared between heuristic vs grammar parser, but I'm not sure why this only needs to be done for terminals, and there are other ways to do it)

================
Comment at: clang/include/clang/Tooling/Syntax/Pseudo/GLRParser.h:114
+
+  const Graph &getGSS() const { return GSS; }
+
----------------
this function never exposes an interesting graph, as it's mostly empty before + after parse().
It's only really useful for working out how much memory is allocated.

If we were to drop this, the public interface of GLRParser would be a single function (so it need not be a public class) and Graph would disappear entirely!

With this in mind, it seems like we could replace this header file with something like:

```
struct ParseStats {
  unsigned GSSBytes = 0;
};
ForestNode *glrParse(const TokenStream &, const LRTable&, const Grammar&, ForestArena&, ParseStats *Stats = nullptr);
```

It feels like hiding the GSS might be an obstacle to debugging/testing.
However this really would need a story/API around how to interrupt the parser.
LLVM_DEBUG seems reasonable to me for now.

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Forest.cpp:55
+                 llvm::Optional<SymbolID> ElidedParent) {
+        // absl::Span<const Node* const> children;
+        llvm::ArrayRef<const ForestNode *> children;
----------------
commented out code

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/Forest.cpp:74
+                    children[i], Level, End,
+                    /*elided_parent=*/ElidedParent.getValueOr(P->symbol()));
+              }
----------------
a few of these comments have the wrong case on the names

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:33
+  ParsedForest.init(Code);
+  const Token *Lookahead = &Code.tokens().front();
+  addActions(GSS.addNode(/*StartState*/ 0, nullptr, {}), *Lookahead);
----------------
It actually currently looks like it would be at least as natural to have the parser operate on the sequence of terminal ForestNodes rather than on the Tokens themselves...

================
Comment at: clang/lib/Tooling/Syntax/Pseudo/GLRParser.cpp:64
+  }
+  return nullptr;
+}
----------------
Returning null when the input doesn't match the grammar is not at all what we want.
I know it's early, but I'm worried we're defining an API for the GLR parser that matches the textbook rather than the one we need. (Similar to concerns about the Accept action - we're going to accept any stream of tokens).

I think we need to plan a little bit how error recovery is going to work here:
 - do we plan to call this recursively inside brackets, or handle lazy-brackets in the GLR parser itself?
 - will sequence recovery (after `,` or between declarations) happen in the GLR parser or externally? in this top loop? how will parsing continue?
 - where would splitting heuristics happen (like interpreting `x? = y?` as an assignment expression, or `foo(; bar;` as two statements)
 - in the worst case, when we're parsing something with no sequence recovery (say an expression) and we don't match the grammar, no splitting rule applies etc... are we going to create an opaque node? Does this function do it or the caller?

I suspect it would be a better fit to:
 a) have an Opaque node, if not in this patch then very soon
 b) operate on an ArrayRef<Token> or ArrayRef<ForestNode>, and stop relying on eof to delimit (so we can parse subranges if needed)
 c) stop relying on Accept actions, since we'll want to look at the stack when we run out of tokens in general, and Accept just models a trivial case of that

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121150/new/

https://reviews.llvm.org/D121150