[PATCH] D61637: [Syntax] Introduce syntax trees

Thu Jun 13 05:04:51 PDT 2019

ilya-biryukov added a comment.

This is not 100% ready yet, but wanted to send it out anyway, as I'll be on vacation until Tuesday.

I've addressed most major comments. In particular, `TreeBuilder` now looks simpler (and more structured)  to my taste.
One thing that's missing is adding children in arbitrary order. It won't be too complicated (would require some thought on how to properly create recovery nodes, though). I'd be tempted to land the current implementation as is and allow adding children in arbitrary order in a separate change (alongside more types of nodes), but let me know what you think.

================
Comment at: clang/include/clang/Tooling/Syntax/Tree.h:104
+
+  llvm::ArrayRef<syntax::Token> tokens() const { return Tokens; }
+
----------------
sammccall wrote:
> as discussed offline, having a leaf node store a range of tokens (rather than just one) looks attractive now, but as the syntax tree gets more detailed there are relatively few cases where multiple consecutive tokens should really be undifferentiated siblings.
> 
> Might be better to bite the bullet now and make leaf hold a single token, so our consecutive token ranges become a linked list. This will also flush out accidental assumptions that only tokens in the same Leaf are adjacent.
> 
> Given this, I'm not sure there's a better name than `syntax::Leaf`. You might consider making it a pointer-like object dereferenceable to Token&.
Leaf now stores a single token, that actually simplifies things quite a bit, thanks!
I'd avoid making it a pointer-like object, given that nodes are often passed as pointers on their own. Making them a pointer-like object would mean we can get code that does double deferences (Tok = **Leaf).

================
Comment at: clang/include/clang/Tooling/Syntax/Tree.h:129
+  /// by TreeBuilder.
+  void prependChildLowLevel(Node *Child);
+  friend class TreeBuilder;
----------------
sammccall wrote:
> (curious: why prepend rather than append?)
Appending to a linked list is `O(n)`. If we reverse it, traversing left-to-right order is `O(n)`.

================
Comment at: clang/include/clang/Tooling/Syntax/Tree.h:129
+  /// by TreeBuilder.
+  void prependChildLowLevel(Node *Child);
+  friend class TreeBuilder;
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > (curious: why prepend rather than append?)
> Appending to a linked list is `O(n)`. If we reverse it, traversing left-to-right order is `O(n)`.
> 
Append is `O(n)` in the current representation as it requires walking to the tail of the list.

================
Comment at: clang/lib/Tooling/Syntax/BuildTree.cpp:41
+  /// Add a leaf node for a token starting at \p Loc.
+  void learnTokenNode(SourceLocation Loc, tok::TokenKind Kind);
+
----------------
sammccall wrote:
> Particularly in view of having tokens be 1:1 with Leaf, *constructing* the token nodes as part of higher level constructs / as part of recovery seems a little odd.
> 
> What if we constructed all the leaf nodes up front, forming a linked list:
> `int -> a -> = -> 2 -> + -> 2 -> ; -> eof`
> 
> When you form a node that covers a range, you splice out the nodes in that range, replacing with the new node:
> 
> `int -> a -> = -> (2 + 2) -> ; -> eof`
> `(int a = (2 + 2)) -> ; -> eof`
> etc
> 
> Then the invariant is you have a forest, the roots form a linked list, and the trees' respective leaves are a left-to-right partition of the input tokens.
> 
> I think this would mean:
>  - no separate vector<RangedNode> data structure (AFAICS we can reuse Node)
>  - don't have the requirement that the formed node must claim a suffix of the pending nodes, which simplifies the recursive AST visitior
> 
> We lose the binary search, but I think tracking the last accessed root (current position) and linear searching left/right will be at least as good in practice, because tree traversal is fundamentally pretty local.
I went with a slightly different approach, similar to how parser does it. Please let me know what you think.
The new `Forest` helper struct ensures the tree structure invariants (all tokens must be covered, nodes must nest properly based on a syntax structure), the rest of the code in tree builder takes care of folding the nodes in a proper order and properly advancing the token stream (it's somewhat similar in the details of how parsers are implemented, except that instead of parsing we actually walk a pre-parsed AST).

It still needs some comments, but I think its intentions should be clear.
Let me know what you think, happy to discuss this offline too.

================
Comment at: clang/lib/Tooling/Syntax/BuildTree.cpp:48
+  /// Consume the root node.
+  syntax::TranslationUnit *root() && {
+    assert(Root);
----------------
sammccall wrote:
> any particular reason learnRoot() and root() are different functions?
> 
> If learnRoot() returned TranslationUnit*, then we avoid the need for the caller to know about the dependency, it would track the state itself.
`learnRoot` is called inside ast visitor when processing `TranslationUnitDecl` and `root()` is used to consume the result.

I guess we could just delay `learnRoot` until `consume()` is called, shouldn't be a big deal. I'll do this in the next iteration.

================
Comment at: clang/lib/Tooling/Syntax/BuildTree.cpp:93
+    Builder.learnTokenNode(S->getLBracLoc(), tok::l_brace);
+    Builder.learnTokenNode(S->getRBracLoc(), tok::r_brace);
+
----------------
sammccall wrote:
> btw what if LBracLoc or RBracLoc are invalid here due to parser recovery?
This will currently break and we should definitely fix this

================
Comment at: clang/lib/Tooling/Syntax/BuildTree.cpp:93
+    Builder.learnTokenNode(S->getLBracLoc(), tok::l_brace);
+    Builder.learnTokenNode(S->getRBracLoc(), tok::r_brace);
+
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > btw what if LBracLoc or RBracLoc are invalid here due to parser recovery?
> This will currently break and we should definitely fix this
We don't recover from errors properly here, I'd add a FIXME and figure this out later. Does that SG?

The general strategy I would propose is to just skip the tokens (we will return null from the corresponding accessors, etc) and create recovery nodes (`RecoveryExpression`, etc.) for composite nodes.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61637/new/

https://reviews.llvm.org/D61637