[PATCH] D63835: [Syntax] Add nodes for most common statements

Wed Nov 6 01:55:16 PST 2019

ilya-biryukov added inline comments.

================
Comment at: clang/include/clang/Tooling/Syntax/Nodes.h:193
+  syntax::Statement *thenStatement();
+  syntax::Leaf *elseKeyword();
+  syntax::Statement *elseStatement();
----------------
sammccall wrote:
> ilya-biryukov wrote:
> > sammccall wrote:
> > > I think throughout it's important to mark which of these are:
> > >  - nullable in correct code
> > >  - nullable in code generated by recovery
> > I would suggest to only mark the nodes that are nullable in the correct code. For recovery, I would assume the following rule (please tell me if I'm wrong):
> > 
> > On a construct whose parsing involved recovery:
> > - if the node has an introducing token (`if`, `try`, etc.), the corresponding child cannot be null.
> > - any other child can be null.
> Agree with this strategy, and the fact that it doesn't need to be documented on every node/occurrence.
> 
> But it should definitely be documented somewhere at a high level! (With clang AST, this sort of thing feels like tribal knowledge)
Added a corresponding comment to the file header.

================
Comment at: clang/include/clang/Tooling/Syntax/Nodes.h:79
+  ExpressionStatement_expression,
+  CompoundStatement_statement
 };
----------------
sammccall wrote:
> As discussed offline, there's some options about how abstract/concrete these roles should be.
> 
> e.g. for a list of function args, this could be FunctionOpenParen/FunctionArgExpr/FunctionArgComma/FunctionCloseParam (specific) <-> OpenParen/Arg/Comma/CloseParen <-> Open/Item/Separator/Close.
> 
> The more specific ones are somewhat redundant with the parent/child type (but easy to assign systematically), and the more generic ones are more orthogonal (but require more design and may by hard to always make consistent).
> 
> The concrete advantage of the generic roles is being able to write code like `getTrailingSemicolon(Tree*)` or `findLoopBreak(Stmt*)` or `removeListItem(Tree*, int)` in a fairly generic way, without resorting to adding a `Loop` base class or handling each case with separate code.
> 
> This is up to you, though.
I definitely agree that writing generic functions is simpler with the proposed approach.

However, I am aiming for safer APIs here, albeit less generic. E.g. we'll have something like 
`removeFunctionArgument(ArgumentList*, int)` and `removeInitializer(InitializerList*, int)`
rather than `removeListItem(Tree*, int)` in the public API.

Reasons are discoverability of the operations for particular node types.

Generic functions might still make sense as an implementation detail to share the code.
I'll keep as is for now, but will keep the suggestion in mind.

================
Comment at: clang/include/clang/Tooling/Syntax/Nodes.h:265
+  syntax::Leaf *returnKeyword();
+  syntax::Expression *value();
+};
----------------
sammccall wrote:
> nullable, marked somehow
> 
> Optional<Expression*> is tempting as a systematic and hard-to-ignore way of documenting that.
> And it reflects the fact that there are three conceptual states for children: present, legally missing, brokenly missing.
> 
> At the same time, I'm not sure how to feel about the fact that in practice this can't be present but null, and the fact that *other* non-optional pointers can be null.
Having `Optional<Expression*>` models the problem space better, but is much harder to use on the client side.
I'd keep as is, the file comment explains that one should assume all accessors can return null.

Update the comment here to indicate both `return;` and `return <expr>;` are represented by this node.

================
Comment at: clang/lib/Tooling/Syntax/BuildTree.cpp:99
+  /// semicolon when needed.
   llvm::ArrayRef<syntax::Token> getRange(const Stmt *S) const {
+    auto Tokens = getRange(S->getBeginLoc(), S->getEndLoc());
----------------
sammccall wrote:
> since Expr : Stmt, we need to be a bit wary of overloading based on static type.
> 
> It's tempting to say it's correct here: if we statically know E is an Expr, then maybe it's never correct to consume the semicolon. But is the converse true? e.g. if we're traversing using RAV and call getRange() in visitstmt...
> 
> (The alternatives seem to be a) removing the expr version of the function, and having the stmt version take a `bool ConsumeSemi` or b) change the stmt version to have (dynamic) expr behave like the expr overload, and handle it specially when forming exprstmt. More verbose, genuinely conflicted here)
Using two functions with different names now.

================
Comment at: clang/lib/Tooling/Syntax/BuildTree.cpp:270
+
+  bool TraverseCXXForRangeStmt(CXXForRangeStmt *S) {
+    // We override to traverse range initializer as VarDecl.
----------------
sammccall wrote:
> maybe group with corresponding `WalkUpFromCXXForRangeStmt`?
> (Could also group all `Traverse*` together if you prefer. Current ordering seems a little random)
Went for grouping `Traverse*` and `Walk*` together.
Normally would also choose to put related methods (i.e. `WalkUpFromX` and `TraverseX`) together, but they have totally different meaning here: `TraverseX` creates a workaround for suboptimal RAV traversals and `WalkUpFromX` actually builds the syntax tree.

================
Comment at: clang/lib/Tooling/Syntax/BuildTree.cpp:272
+    // We override to traverse range initializer as VarDecl.
+    // RAT traverses it as a statement, we produce invalid node kinds in that
+    // case.
----------------
sammccall wrote:
> nit: RAV?
Right, thanks. `RAT` was funnier, though. Even if incorrect...

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D63835/new/

https://reviews.llvm.org/D63835