[PATCH] D61637: [Syntax] Introduce syntax trees

Mon Jun 3 09:59:18 PDT 2019

ilya-biryukov added a comment.

I've addressed most of the comments, except the naming ones.
We need a convention for naming the language nodes and names for composite and leaf structural nodes.

For "language" nodes, I suggest we use  `CompoundStatement`, `Recovery`, `TopLevelDeclaration`, `TemplateArgumentList`, `TypeTemplateArgument`, etc. That is, we spell out the words in full, no shorthands like `Stmt` or `Expr`. That would make things a bit more verbose, but hopefully that helps distinguish from clang AST.

For structural nodes, see the relevant comment threads.

================
Comment at: clang/include/clang/Tooling/Syntax/Arena.h:1
+//===- Arena.h - memory arena and bookkeeping for syntax trees --- C++ -*-===//
+//
----------------
sammccall wrote:
> From a user's point of view, this looks a lot like part of the tree structure in some sense.
> 
> If you expect that users will need to keep the arena rather than the TokenBuffer around (e.g. so nodes can be allocated), then it might make sense to declare it at the bottom of `Cascade.h`
Now part of `Tree.h`

================
Comment at: clang/include/clang/Tooling/Syntax/Corpus.h:38
+  std::pair<FileID, llvm::ArrayRef<syntax::Token>>
+  tokenizeBuffer(std::unique_ptr<llvm::MemoryBuffer> Buffer);
+
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > ilya-biryukov wrote:
> > > sammccall wrote:
> > > > Are you planning to have a way to add tokens directly? Having to turn them into text and re-lex them seems like it might be inconvenient.
> > > The tokens have source locations and refer to a text in some buffer. `tokenizeBuffer` makes is easier, not harder, to mock tokens.
> > Fair enough.
> > 
> > `lexBuffer` might be a slightly clearer name?
> > 
> > Who are the intended non-test users of this function? Are they better served by being able (and responsible) for constructing a MemoryBuffer with e.g. a sensible name and ownership, or would it be better to pass a StringRef and have the Arena come up with a sensible "anonymous" name?
> The only use-case in my prototype so far is creating token nodes for punctuation nodes, e.g. say you want to create an expr of the form `<a>+<b>`, where both `a` and `b` are existing expressions and you need to synthesize a leaf node for `+`.
> We use this function to synthesize a buffer with the corresponding token.
> 
> All the use-cases I can imagine are around synthesizing syntax trees (as opposed to constructing them from the AST).
Renamed to `lexBuffer`. This does not have usages (yet), we can also remove it from this patch if needed.

================
Comment at: clang/include/clang/Tooling/Syntax/Corpus.h:40
+
+  /// Construct a new syntax node of a specified kind. The memory for a node is
+  /// owned by the corpus and will be freed when the corpus is destroyed.
----------------
sammccall wrote:
> ilya-biryukov wrote:
> > sammccall wrote:
> > > Now there's two ways to do this: `new (C.allocator()) T(...)` or `C.construct<T>(...)`. Do we need both?
> > > 
> > > (If we do, is the syntax `new (C) T(...)` more natural?)
> > I think `C.construct<T>()` read better than `new (C) T(...)`. Not a big fan of placement new exprs.
> They are fairly consistently used in llvm/clang for this sort of thing, though.
> 
> I find it valuable because arena allocations look otherwise a lot like buggy ownership patterns. The dedicated syntax calls out the unusual case: we're creating a new thing, and someone else owns it, but won't do anything with it.
Removed in favour of placement new. I guess that also makes it a bit more natural to have separate storage for other nodes that have different lifetime (e.g. use a separate arena).

================
Comment at: clang/include/clang/Tooling/Syntax/Tree.h:11
+//
+// The code is divided in the following way:
+//   - 'Tree/Cascade.h' defines the basic structure of the syntax tree,
----------------
sammccall wrote:
> As discussed offline:
>  - I don't think (at this point) we need an umbrella header. Generic tree structure, specific node semantics, and operations like "build a tree" are distinct enough from a user POV that asking them to include headers is fine. 
>  - We may want an umbrella for the node types, if we end up splitting that, but no need yet.
>  - Splitting generic tree stuff vs specific node stuff sounds good, but I think having them be sibling headers in `Tooling/Syntax` is enough - not sure about the `Tree/` subdirectory.
> 
> So I'd suggest something like:
>  - `Tree/Cascade.h` + `Arena.h` --> `Tree.h`
>  - `Tree.h` -> `BuildTree.h`
>  - `Tree/Nodes.h` + `NodeKind.h` --> `Nodes.h`
> (have comments on some of these throughout)
Changed to the proposed header structure, it looks good.
Had to move `Leaf::classof` and `TreeNode::classof` into a `.cpp`, they need a definition of `NodeKind`. Keeping those in a header file was a micro-optimization anyway, that's probably not too important.

================
Comment at: clang/include/clang/Tooling/Syntax/Tree/Cascade.h:74
+/// A composite tree node that has children.
+class TreeNode : public Node {
+public:
----------------
sammccall wrote:
> ilya-biryukov wrote:
> > sammccall wrote:
> > > This use of "tree node" to mean specifically internal node seems confusing - is it common?
> > I don't think it's common, can use `CompositeNode` - seems like a better alternative
> What about Subtree?
`Subtree` seems ok, although `Composite` conveys the meaning better to my taste.

`Composite` does not seem to work without the `node` suffix, though, and we probably don't want the suffix in other nodes, so I'm torn on this.

================
Comment at: clang/include/clang/Tooling/Syntax/Tree/Cascade.h:59
+/// stream.
+class Leaf final : public Node {
+public:
----------------
sammccall wrote:
> `syntax::Leaf` vs `syntax::Struct` seems a little odd - it talks about the tree structure rather than the contents. (Unlike TreeNode/CompositeNode this is likely to be used for its specific semantics).
> 
> Maybe `syntax::Tokens` (though the s is subtle). `syntax::Text`?
`syntax::Tokens` actually looks good, but we should rename the `tokens()` accessor somehow in that case.
I have only super-generic variants in my head: `elements()`, `items()`. Any better ideas?

================
Comment at: clang/include/clang/Tooling/Syntax/Tree/Cascade.h:64
+
+  static bool classof(const Node *N) { return N->kind() == NodeKind::Leaf; }
+
----------------
sammccall wrote:
> you're going to want classof for each node type, a private copy constructor (for cloning), a friend statement to whatever does the cloning (or the clone() function itself, if it goes on the class...)
> 
> You may want to put this behind a DEFINE_NODE_BOILERPLATE(Leaf) macro :-(
I'd avoid using macros. As long as the amount of boilerplate is small, it's not too annoying to write. 

And it is small for now, we only have a single constructor and a `classof` per node, so keeping it explicit in this patch seems ok.

We can revisit if more stuff pops up, but I think we do it without extra boilerplate per node for clone, and hopefully for other things too.

================
Comment at: clang/include/clang/Tooling/Syntax/Tree/NodeKind.h:1
+//===- NodeKind.h - an enum listing nodes of a syntax tree ----*- C++ -*-=====//
+//
----------------
sammccall wrote:
> Why is this separate from Nodes.h?
Not anymore. The original reason was that `Tree.h` need `NodeKind`  for implementing casts and the implementation was in a header.

================
Comment at: clang/include/clang/Tooling/Syntax/Tree/NodeKind.h:22
+/// For debugging purposes.
+llvm::StringRef toString(NodeKind K);
+
----------------
sammccall wrote:
> if you make this `operator<<`, then it's slightly more flexible I think (llvm::to_string also works).
> It's not as fast, but I don't think that matters here?
For some reason I thought `gdb` does not show the enum value names, so I added a named method to simplify debugging.

Turns out it does show the enum names, not just int values, so I'm perfectly happy to the stream output operator.

================
Comment at: clang/include/clang/Tooling/Syntax/Tree/NodeList.h:1
+//===- NodeList.h ---------------------------------------------*- C++ -*-=====//
+//
----------------
sammccall wrote:
> Implementing custom containers is a bit sad.
> 
> Alternatives we discussed:
>  - adapt bumpptrallocactor to std allocator and use std::vector: wastes a pointer from vector->allocator
>  - use ArrayRef<Node*> or ArrayRef<Node> and require the whole list to be reallocated when children change (but *replacing* children is fine)
>  - use a linked-list representation instead: Node* Node::NextSibling, CompositeNode::FirstChild. This fits the allocation strategy more nicely. You probably need only single links, and can add an iterator if needed.
No more custom containers, explicit tree structure seems to be better.

================
Comment at: clang/lib/Tooling/Syntax/BuildFromAST.cpp:1
+//===- BuildFromAST.cpp ---------------------------------------*- C++ -*-=====//
+//
----------------
sammccall wrote:
> I haven't reviewed this file yet :-)
Please do!

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61637/new/

https://reviews.llvm.org/D61637