[cfe-dev] Should we build semantically invalid nodes?

Thu Oct 23 09:18:34 PDT 2008

Doug Gregor wrote:
> Yes, there is. It means that consumers of these nodes have to be able
> to cope with potentially bad inputs. Then, each semantic-checking
> routine needs to cope with (1) all correct inputs, (2) all incorrect
> inputs that could come from a program that was well-formed up until
> this point, and (3) all broken inputs that come from previously
> ill-formed code. The standard always tells us what (1) and (2) can be
> (sometimes directly, sometimes through exclusion), but (3) is an
> almost unbounded set of bogus input depending on the twisted
> imagination of our users and our ability to mangle bad inputs into
> other bad inputs.
>   

A semantically invalid expression will produce a diagnostic. At this 
point the program is ill-formed, the worst that can happen is that 
semantic checks against this expression will produce more diagnostics.
This is exactly similar to the effect of invalid decls.

>   
>> It will not affect clients that care about semantics since they will not
>> analyse the AST if there are errors (there could be invalid decls).
>> The clients that only care about the textual representation of the program,
>> and not about the semantics, will get seriously impacted by not building the
>> nodes, e.g. a refactoring client will not be able to pickup the use of a
>> variable because the expression that references it is not added to the AST.
>>     
>
> If a client doesn't care about semantics, it can either (1) let Sema
> do its work and then ignore the resulting types, or (2) provide a
> different Action that doesn't do the type-checking.
>   

Sema is *huge* and the alternative Action option is not realistic 
(another Action that deals with templates ? ;), this is what almost all 
of clients will use.
There will be clients that care only about the syntax tree, Sema is 
fully capable of servicing them too.

>   
>> To mirror the decls marked as "invalid" we could have exprs marked as
>> "invalid" too.
>>     
>
> GCC does this, where essentially any node in the AST can be
> "error_mark_node" if there was an error. A *lot* of code in GCC is
> dedicated to checking for and propagating error_mark_node; it probably
> ends up costing them performance, but the real cost is that they end
> up fixing a lot of tiny "ice-on-invalid" bugs where someone forgot to
> check for some outlandish ill-formed code that results in a weird AST
> node or an error_mark_node where it isn't expected. It's always seemed
> like a losing battle to me, and the code is littered with
> error_mark_node checks.
>   

I think the correct Clang analogy here is DeclResult/ExprResult which 
gets propagated around and must be checked before use.

I don't see what is so bad about separating syntax from semantics.
-An expression node is produced for a syntactic construct.
-This expression node already conveys useful information about the 
program structure.
-Semantic checks are done to it and diagnostics are emitted.
-Now why should we discard the syntactic information ? This is a 
concrete expression with an actual type (even if it got it's type 
illegally according to the language rules), so it won't lead to crashes, 
just to possible more diagnostics.

AFAIK, Clang produces Decls even for semantically illegal declarations 
(they are even added to scope) and there doesn't seem to be issues 
attributed to them.

-Argiris