[PATCH] D61637: [Syntax] Introduce syntax trees

Fri Jun 7 06:22:26 PDT 2019

sammccall added a comment.

In D61637#1527727 <https://reviews.llvm.org/D61637#1527727>, @ilya-biryukov wrote:

> I've addressed most of the comments, except the naming ones.
>  We need a convention for naming the language nodes and names for composite and leaf structural nodes.
>
> For "language" nodes, I suggest we use  `CompoundStatement`, `Recovery`, `TopLevelDeclaration`, `TemplateArgumentList`, `TypeTemplateArgument`, etc. That is, we spell out the words in full, no shorthands like `Stmt` or `Expr`. That would make things a bit more verbose, but hopefully that helps distinguish from clang AST.

SGTM.

================
Comment at: clang/include/clang/Tooling/Syntax/Tree.h:104
+
+  llvm::ArrayRef<syntax::Token> tokens() const { return Tokens; }
+
----------------
as discussed offline, having a leaf node store a range of tokens (rather than just one) looks attractive now, but as the syntax tree gets more detailed there are relatively few cases where multiple consecutive tokens should really be undifferentiated siblings.

Might be better to bite the bullet now and make leaf hold a single token, so our consecutive token ranges become a linked list. This will also flush out accidental assumptions that only tokens in the same Leaf are adjacent.

Given this, I'm not sure there's a better name than `syntax::Leaf`. You might consider making it a pointer-like object dereferenceable to Token&.

================
Comment at: clang/include/clang/Tooling/Syntax/Tree.h:111
+/// A composite tree node that has children.
+class TreeNode : public Node {
+public:
----------------
As discussed, I think `syntax::Tree` might actually be a better name here.

"A Node is either a Tree or a Leaf" is a bit weird, but not too hard to remember I think.

================
Comment at: clang/include/clang/Tooling/Syntax/Tree.h:129
+  /// by TreeBuilder.
+  void prependChildLowLevel(Node *Child);
+  friend class TreeBuilder;
----------------
(curious: why prepend rather than append?)

================
Comment at: clang/lib/Tooling/Syntax/BuildTree.cpp:26
+/// AST. The tree is built left-to-right and bottom-up. At each point of the
+/// traversal we maintain a list of currently processed nodes.
+class syntax::TreeBuilder {
----------------
I find "currently processed" a bit vague. "Pending"?

================
Comment at: clang/lib/Tooling/Syntax/BuildTree.cpp:37
+  /// and \p Last are added as children of the new node.
+  void learnNode(SourceLocation Fist, SourceLocation Last,
+                 syntax::TreeNode *New);
----------------
maybe formNode, formToken, formRoot()?

================
Comment at: clang/lib/Tooling/Syntax/BuildTree.cpp:37
+  /// and \p Last are added as children of the new node.
+  void learnNode(SourceLocation Fist, SourceLocation Last,
+                 syntax::TreeNode *New);
----------------
sammccall wrote:
> maybe formNode, formToken, formRoot()?
if syntax nodes strictly nest and we form left-to-right and bottom up, then why are there ever pending nodes that aren't in the range? Is it because we don't aggregate them as early as possible?

(edit: after offline discussion, there are precise invariants here that could be documented and asserted)

================
Comment at: clang/lib/Tooling/Syntax/BuildTree.cpp:41
+  /// Add a leaf node for a token starting at \p Loc.
+  void learnTokenNode(SourceLocation Loc, tok::TokenKind Kind);
+
----------------
Particularly in view of having tokens be 1:1 with Leaf, *constructing* the token nodes as part of higher level constructs / as part of recovery seems a little odd.

What if we constructed all the leaf nodes up front, forming a linked list:
`int -> a -> = -> 2 -> + -> 2 -> ; -> eof`

When you form a node that covers a range, you splice out the nodes in that range, replacing with the new node:

`int -> a -> = -> (2 + 2) -> ; -> eof`
`(int a = (2 + 2)) -> ; -> eof`
etc

Then the invariant is you have a forest, the roots form a linked list, and the trees' respective leaves are a left-to-right partition of the input tokens.

I think this would mean:
 - no separate vector<RangedNode> data structure (AFAICS we can reuse Node)
 - don't have the requirement that the formed node must claim a suffix of the pending nodes, which simplifies the recursive AST visitior

We lose the binary search, but I think tracking the last accessed root (current position) and linear searching left/right will be at least as good in practice, because tree traversal is fundamentally pretty local.

================
Comment at: clang/lib/Tooling/Syntax/BuildTree.cpp:48
+  /// Consume the root node.
+  syntax::TranslationUnit *root() && {
+    assert(Root);
----------------
any particular reason learnRoot() and root() are different functions?

If learnRoot() returned TranslationUnit*, then we avoid the need for the caller to know about the dependency, it would track the state itself.

================
Comment at: clang/lib/Tooling/Syntax/BuildTree.cpp:55
+private:
+  struct RangedNode {
+    RangedNode(llvm::ArrayRef<syntax::Token> Tokens, syntax::Node *Node)
----------------
or NodeForRange

================
Comment at: clang/lib/Tooling/Syntax/BuildTree.cpp:92
+  bool TraverseCompoundStmt(CompoundStmt *S) {
+    Builder.learnTokenNode(S->getLBracLoc(), tok::l_brace);
+    Builder.learnTokenNode(S->getRBracLoc(), tok::r_brace);
----------------
So explicitly claiming the primitive tokens, but implicitly claiming the subtrees, seems like a weird mix.
Having both explicit might be nicer:
 - It seems somewhat likely we want to record/tag their semantics (maybe in the child itself, or even the low bits of its pointer?), rather than having accessors scan around looking for something likely.
 - currently when expected subtrees fail to parse, their tokens get (implicitly) wrapped up in Recovery nodes. They're good targets for heuristic parsing, but this probably means we should record what an unexplained range of tokens is supposed to be for.

Thinking of something like:
```
builder.expectToken(l_brace, S->getLBracLoc());
builder.expectTree(Statement, S->getBody());
builder.expectToken(r_brace, S->getRBracLoc());
```
where the builder would react to non-null AST nodes by mapping the associated syntax node, and null AST nodes by trying to heuristically parse the tokens in between LBracLoc and RBracLoc.

But lots of unanswered questions here: body is a list of statements, how does that work? What if LBracLoc or RBracLoc is missing? etc.

================
Comment at: clang/lib/Tooling/Syntax/BuildTree.cpp:93
+    Builder.learnTokenNode(S->getLBracLoc(), tok::l_brace);
+    Builder.learnTokenNode(S->getRBracLoc(), tok::r_brace);
+
----------------
btw what if LBracLoc or RBracLoc are invalid here due to parser recovery?

================
Comment at: clang/lib/Tooling/Syntax/BuildTree.cpp:135
+
+void syntax::TreeBuilder::learnNodeImpl(const syntax::Token *Begin,
+                                        const syntax::Token *End,
----------------
this function needs some high-level implementation comments

================
Comment at: clang/lib/Tooling/Syntax/BuildTree.cpp:164
+  };
+  for (auto It = FirstChild; It != NodesInProgress.end(); ++It) {
+    // Add non-coverred ranges as recovery nodes.
----------------
why can It not point to a node that spans/is past End?

(edit after offline discussion: it's an important invariant that we're always consuming a suffix of the pending nodes)

================
Comment at: clang/lib/Tooling/Syntax/BuildTree.cpp:226
+  auto &SM = Arena.sourceManager();
+  auto It =
+      std::lower_bound(Tokens.begin(), Tokens.end(), TokLoc,
----------------
or It = bsearch(Tokens, [&](const Syntax::Token& L) { return !SM.isBeforeInTranslationUnit(L.location(), TokLoc); })

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61637/new/

https://reviews.llvm.org/D61637