[cfe-dev] Should we build semantically invalid nodes?

Sun Oct 26 10:23:47 PDT 2008

Chris Lattner wrote:
> It seems to me that it comes down to the clients that are the ultimate  
> consumers of this information.  Since Sema is perfectly fine for  
> correct code, lets ignore all clients that require well-formed code  
> (e.g. codegen, refactoring, etc)

Refactoring, as I see it, doesn't require well-formed code, e.g. "rename 
this parameter name" doesn't particularly care about only the 
well-formed uses, it just wants to find all the appearances of the 
parameter in the function, even if the parameter is used in an invalid 
reinterpret_cast.

>  and those that aren't harmed by  
> requiring it (static analysis).  These clients are incidentally the  
> ones that are doing "deep analysis" of the AST and really benefit from  
> having a lot of invariants in the AST that absolutely must be true for  
> sanity.  Lack of these invariants would require sprinkling their  
> (incredibly non-trivial) code with lots of special cases and hacks  
> that I'd really like to avoid.
>   

Good point, I agree about Sema not accepting invalid nodes.

> Another set of clients are things like "indexers" that want to find  
> all the function definitions and global variables so you can "click on  
> a function and jump to its definition".  For this sort of use, a  
> simple actions module plugging into the parser is just fine.
>   

Hmm.. I don't quite understand how can this be simple, are you talking 
about only building declarations and not expressions ?
Say that you want all references of a global variable in the program, 
how are you going to find them without building a full AST of the 
program, including the expressions ?

> What sort of clients would benefit substantially from a broken and  
> partially formed AST?

There's a difference between a program with broken "syntax" (the Parser 
doesn't accept it), and broken "semantics" (the Sema rejects it).
"reinterpret_cast<int>(x)" is correct syntax but with broken semantics.
There's a lot of benefit found in getting a AST which is the 
representation of the syntax of the program, "reinterpret_cast<int>(x)" 
conveys the information that a reinterpret_cast is using the 'x' 
variable in this source location.

>   If we really wanted this sort of thing, it  
> seems like it would be cleanest to do what Steve said: define a new  
> actions module that just builds an AST (which can even use the same or  
> an extended set of nodes as Sema) but doesn't do any real checks,  
> doesn't assign types, etc.  At this point, you have more parse tree  
> than an AST.

This will be a maintainance burden; I'm pretty sure such an action 
module will eventually bitrot and become irrelevant since all the focus 
will be on the Sema AST.
The current AST has lots of syntactic information (apart from the 
missing "TypeSpecifier" node), there's no need for another one.
If it's possible to combine a ASTBuilder action with the Sema action 
like I suggest here:
http://lists.cs.uiuc.edu/pipermail/cfe-dev/2008-October/003125.html
it will result in an ASTBuilder that produces the syntactic AST, and a 
Sema that uses it and emits the necessary diagnostics and possible 
rejects invalid nodes. It may even help in the maintainability department.

-Argiris