[cfe-dev] Keeping invalid AST nodes

Tue Nov 6 14:27:29 PST 2012

On Nov 5, 2012, at 19:38, Douglas Gregor <dgregor at apple.com> wrote:

> 
> On Nov 5, 2012, at 10:02 AM, Eli Friedman <eli.friedman at gmail.com> wrote:
> 
>> On Mon, Nov 5, 2012 at 4:18 AM, Erik Verbruggen <erik.verbruggen at me.com> wrote:
>>> When using libclang for syntax highlighting, I noticed that no AST nodes get built for e.g. invalid expressions. For example, in the following C++ code:
>>> 
>>> int func(int i) {
>>>   int j = undefinedFunction(i) + i;
>>>   return j;
>>> }
>>> 
>>> the whole initialisation for int j gets dropped, including the (somewhat) valid references to i. I understand that this is done to keep the AST valid after parsing, but an IDE or refactoring tools cannot help out with fixing the problem in that snippet. My question: is there some way to keep the invalid AST nodes, or is somebody already working on this?
>> 
>> No, there's no way to keep them, and AFAIK nobody is working on this.
> 
> I agree with this…
> 
>> I'm also not sure it's worthwhile: it's a ton of work for a use case
>> which doesn't seem very important, given that the vast majority of
>> code people care about is valid.
> 
> 
> … but not necessarily this. You won't get any sympathy from me if you want to, say, refactor invalid code, since you could never validate the results. But syntax highlighting and related code queries do matter, and it makes sense to try to keep invalid AST nodes around.
> 
> There are several caveats. Obviously, we don't want to slow the compiler down. Moreover, we need to be careful not to keep invalid ASTs around for template instantiations, where we encounter invalid ASTs all the time but don't want to hold on to them.

Ok, I now understand that I phrased my question a bit too broad. I do agree that for a compiler, invalid AST nodes are not interesting, and should not be retained. Also, in C++ with templates, one has to be really careful about retaining invalid nodes. Also, clang does already do a great job with fix-its, including the one catching misspelled identifiers.

So I will narrow it down to the use-cases I am currently interested in, and that is for the use in an IDE. More specific, for files that are opened, possibly even unsaved. (So not for indexing, but in cases where a (lib)clang client could easily pass a flag.) And yes, quite a bit of that is incorrect code, maybe because it's not finished yet, or because of a typo, or possibly some misunderstanding or another human error.

1. Highlighting, finding declarations, definitions and/or usages: most of the "surprises" here seem to be caused by the choice to drop whole expressions when parts are invalid. For example, if an lhs in a binary expression is invalid, the rhs gets dropped too. Same for unary expression operators, and parameter lists in call expressions. With the exception of parameter lists, I don't think that the memory foot-print would increase too much if only the invalid piece of the AST would be dropped. Apart from the possibly visual aspect, an won't be able to help with finding back things like (defined but invalid) parameters, incompatible rhsses, etc.

2. Probably a specific case of the above: if, in C++, an expression consists of only a function call, and that function  is undeclared and undefined, all information about it gets dropped. If an IDE would want to offer a declare-/define-function fix-it, it cannot get back to any information about the call.

3. Invalid method/function calls: first of all, overloaded methods seem to do fine. But if a method is not declared/defined, and is accessed through a member access operator or a qualifier, then the valid parts of the member access and the valid parts of the qualifier get dropped. So the same as for item 2, offering an "implement it" is, well, tedious. Also, finding the definition of the qualifier/lhs is also impossible, so a user cannot "jump there to see what is available".

4. If a C++ method is defined without being declared, the qualifier gets dropped (!). Implementing an insert-declaration-from-definition fix-it would require some re-parsing and quite some very fuzzy logic.

5. Not fully verified, but it looks like clang drops at least some information when a method definition does not match up with any declaration, or vice-versa. Apart from the previous item/case, this can happen when the parameter list of a method gets changed. Now you could say that changing the method/function signature is a proper refactoring action, most people "just do it and fix the other afterwards". I admit that this one is very, erm, fuzy, but it happens so often that if an IDE can help out here, it is considered as a "cool thing".

A note about templates: all of the items mentioned above are without any templates involved. I know that templates do complicate things. But when talking with users, the general consensus seems to be that if templates cannot be handled for the cases mentioned above, it still covers more than three-quarters of the use-cases. And most seem to understand the complexity when templates are involved.

For anything else than item #1, I think that some flag is needed in Stmt, which indicates that, although a Stmt is invalid and utterly useless for compilation, it might still hold some useful information for (lib)clang clients although the node is erroneous and should not be used for code generation or proper refactoring.

So, again the question: feedback? :-) Where did I miss details, or are these too narrow use-cases to be useful to cover with clang as a front-end?

-- Erik.