[PATCH] D61637: [Syntax] Introduce syntax trees

Tue May 21 07:25:00 PDT 2019

sammccall added a comment.

Definitely like the choice of CompositeNode owning the concrete storage!

================
Comment at: clang/include/clang/Tooling/Syntax/Arena.h:1
+//===- Arena.h - memory arena and bookkeeping for syntax trees --- C++ -*-===//
+//
----------------
>From a user's point of view, this looks a lot like part of the tree structure in some sense.

If you expect that users will need to keep the arena rather than the TokenBuffer around (e.g. so nodes can be allocated), then it might make sense to declare it at the bottom of `Cascade.h`

================
Comment at: clang/include/clang/Tooling/Syntax/Corpus.h:38
+  std::pair<FileID, llvm::ArrayRef<syntax::Token>>
+  tokenizeBuffer(std::unique_ptr<llvm::MemoryBuffer> Buffer);
+
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > Are you planning to have a way to add tokens directly? Having to turn them into text and re-lex them seems like it might be inconvenient.
> The tokens have source locations and refer to a text in some buffer. `tokenizeBuffer` makes is easier, not harder, to mock tokens.
Fair enough.

`lexBuffer` might be a slightly clearer name?

Who are the intended non-test users of this function? Are they better served by being able (and responsible) for constructing a MemoryBuffer with e.g. a sensible name and ownership, or would it be better to pass a StringRef and have the Arena come up with a sensible "anonymous" name?

================
Comment at: clang/include/clang/Tooling/Syntax/Corpus.h:40
+
+  /// Construct a new syntax node of a specified kind. The memory for a node is
+  /// owned by the corpus and will be freed when the corpus is destroyed.
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > Now there's two ways to do this: `new (C.allocator()) T(...)` or `C.construct<T>(...)`. Do we need both?
> > 
> > (If we do, is the syntax `new (C) T(...)` more natural?)
> I think `C.construct<T>()` read better than `new (C) T(...)`. Not a big fan of placement new exprs.
They are fairly consistently used in llvm/clang for this sort of thing, though.

I find it valuable because arena allocations look otherwise a lot like buggy ownership patterns. The dedicated syntax calls out the unusual case: we're creating a new thing, and someone else owns it, but won't do anything with it.

================
Comment at: clang/include/clang/Tooling/Syntax/Tree.h:11
+//
+// The code is divided in the following way:
+//   - 'Tree/Cascade.h' defines the basic structure of the syntax tree,
----------------
As discussed offline:
 - I don't think (at this point) we need an umbrella header. Generic tree structure, specific node semantics, and operations like "build a tree" are distinct enough from a user POV that asking them to include headers is fine. 
 - We may want an umbrella for the node types, if we end up splitting that, but no need yet.
 - Splitting generic tree stuff vs specific node stuff sounds good, but I think having them be sibling headers in `Tooling/Syntax` is enough - not sure about the `Tree/` subdirectory.

So I'd suggest something like:
 - `Tree/Cascade.h` + `Arena.h` --> `Tree.h`
 - `Tree.h` -> `BuildTree.h`
 - `Tree/Nodes.h` + `NodeKind.h` --> `Nodes.h`
(have comments on some of these throughout)

================
Comment at: clang/include/clang/Tooling/Syntax/Tree/Cascade.h:1
+//===- Tree.h - cascade of the syntax tree --------------------*- C++ -*-=====//
+//
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > sammccall wrote:
> > > sammccall wrote:
> > > > this is Cascade.h, not tree.h
> > > why "cascade"?
> > The Tree/ subdirectory seems superfluous - why are these separate from Syntax/?
> Cascade defines a few base nodes: a composite node (`TreeNode`) and a leaf node that holds tokens.
> I'd really like to isolate them from language-specific nodes, so language-specific nodes live in a separate file (`Nodes.h`).
> However, they need to see the definition of a composite node, hence the split.
> 
> Users are advised to use an umbrella header, `Tree.h`. The extra directory is to minimize the number of headers in the top-level directory, having too many is confusing.
I like the separation - my concern here was the specific word "Cascade".
I'd suggest "Tree" here as it really does define the structure.
The existing "Tree.h" defines operations, and can be named after them. As discussed, I think the design is clean and doesn't need to be hidden by an umbrella header.

================
Comment at: clang/include/clang/Tooling/Syntax/Tree/Cascade.h:74
+/// A composite tree node that has children.
+class TreeNode : public Node {
+public:
----------------
ilya-biryukov wrote:
> sammccall wrote:
> > This use of "tree node" to mean specifically internal node seems confusing - is it common?
> I don't think it's common, can use `CompositeNode` - seems like a better alternative
What about Subtree?

================
Comment at: clang/include/clang/Tooling/Syntax/Tree/Cascade.h:39
+public:
+  Node(NodeKind Kind) : Parent(nullptr), Kind(Kind) {}
+  NodeKind kind() const { return Kind; }
----------------
maybe add a comment - newly created nodes have no parent until added to one

================
Comment at: clang/include/clang/Tooling/Syntax/Tree/Cascade.h:59
+/// stream.
+class Leaf final : public Node {
+public:
----------------
`syntax::Leaf` vs `syntax::Struct` seems a little odd - it talks about the tree structure rather than the contents. (Unlike TreeNode/CompositeNode this is likely to be used for its specific semantics).

Maybe `syntax::Tokens` (though the s is subtle). `syntax::Text`?

================
Comment at: clang/include/clang/Tooling/Syntax/Tree/Cascade.h:64
+
+  static bool classof(const Node *N) { return N->kind() == NodeKind::Leaf; }
+
----------------
you're going to want classof for each node type, a private copy constructor (for cloning), a friend statement to whatever does the cloning (or the clone() function itself, if it goes on the class...)

You may want to put this behind a DEFINE_NODE_BOILERPLATE(Leaf) macro :-(

================
Comment at: clang/include/clang/Tooling/Syntax/Tree/NodeKind.h:1
+//===- NodeKind.h - an enum listing nodes of a syntax tree ----*- C++ -*-=====//
+//
----------------
Why is this separate from Nodes.h?

================
Comment at: clang/include/clang/Tooling/Syntax/Tree/NodeKind.h:22
+/// For debugging purposes.
+llvm::StringRef toString(NodeKind K);
+
----------------
if you make this `operator<<`, then it's slightly more flexible I think (llvm::to_string also works).
It's not as fast, but I don't think that matters here?

================
Comment at: clang/include/clang/Tooling/Syntax/Tree/NodeList.h:1
+//===- NodeList.h ---------------------------------------------*- C++ -*-=====//
+//
----------------
Implementing custom containers is a bit sad.

Alternatives we discussed:
 - adapt bumpptrallocactor to std allocator and use std::vector: wastes a pointer from vector->allocator
 - use ArrayRef<Node*> or ArrayRef<Node> and require the whole list to be reallocated when children change (but *replacing* children is fine)
 - use a linked-list representation instead: Node* Node::NextSibling, CompositeNode::FirstChild. This fits the allocation strategy more nicely. You probably need only single links, and can add an iterator if needed.

================
Comment at: clang/lib/Tooling/Syntax/BuildFromAST.cpp:1
+//===- BuildFromAST.cpp ---------------------------------------*- C++ -*-=====//
+//
----------------
I haven't reviewed this file yet :-)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61637/new/

https://reviews.llvm.org/D61637