[cfe-dev] Should we build semantically invalid nodes?

Thu Oct 23 11:12:09 PDT 2008

On Thu, Oct 23, 2008 at 9:18 AM, Argiris Kirtzidis <akyrtzi at gmail.com> wrote:
> Doug Gregor wrote:
>>
>> Yes, there is. It means that consumers of these nodes have to be able
>> to cope with potentially bad inputs. Then, each semantic-checking
>> routine needs to cope with (1) all correct inputs, (2) all incorrect
>> inputs that could come from a program that was well-formed up until
>> this point, and (3) all broken inputs that come from previously
>> ill-formed code. The standard always tells us what (1) and (2) can be
>> (sometimes directly, sometimes through exclusion), but (3) is an
>> almost unbounded set of bogus input depending on the twisted
>> imagination of our users and our ability to mangle bad inputs into
>> other bad inputs.
>>
>
> A semantically invalid expression will produce a diagnostic. At this point
> the program is ill-formed, the worst that can happen is that semantic checks
> against this expression will produce more diagnostics.
> This is exactly similar to the effect of invalid decls.

No, the worst that can happen is that the compiler will crash, because
the semantic analysis has been given something that's completely
broken that it never should have to deal with.

Note how we handle invalid declarations: we mark them as invalid, and
then if we try to reference them in an expression, ActOnIdentifierExpr
refuses to build a DeclRefExpr. That keeps those invalid declarations
for causing malformed expression nodes from being constructed.

>> If a client doesn't care about semantics, it can either (1) let Sema
>> do its work and then ignore the resulting types, or (2) provide a
>> different Action that doesn't do the type-checking.
>>
>
> Sema is *huge* and the alternative Action option is not realistic (another
> Action that deals with templates ? ;), this is what almost all of clients
> will use.
> There will be clients that care only about the syntax tree, Sema is fully
> capable of servicing them too.

Okay, that's my option (1), then. We don't really want to cater to
clients trying to work on ill-formed code, do we?

>> GCC does this, where essentially any node in the AST can be
>> "error_mark_node" if there was an error. A *lot* of code in GCC is
>> dedicated to checking for and propagating error_mark_node; it probably
>> ends up costing them performance, but the real cost is that they end
>> up fixing a lot of tiny "ice-on-invalid" bugs where someone forgot to
>> check for some outlandish ill-formed code that results in a weird AST
>> node or an error_mark_node where it isn't expected. It's always seemed
>> like a losing battle to me, and the code is littered with
>> error_mark_node checks.
>>
>
> I think the correct Clang analogy here is DeclResult/ExprResult which gets
> propagated around and must be checked before use.

That's not the right analogy. DeclResult/ExprResult are only used to
communicate between the semantic analysis and the parser, and they
never get into the AST. If we have the notion of "invalid" expressions
in our ASTs, such that a Sema function can receive an invalid input,
then we'll be in the same boat that GCC developers are with
error_mark_node. I've been there, and it's a very bad place to be.

> I don't see what is so bad about separating syntax from semantics.
> -An expression node is produced for a syntactic construct.
> -This expression node already conveys useful information about the program
> structure.
> -Semantic checks are done to it and diagnostics are emitted.
> -Now why should we discard the syntactic information ? This is a concrete
> expression with an actual type (even if it got it's type illegally according
> to the language rules), so it won't lead to crashes, just to possible more
> diagnostics.

Oh, it will lead to crashes, because it opens up the semantic analysis
to inputs that can never make sense. An expression of function type
that isn't a reference to a function declaration? Nonsense, but it
could happen if we allow reinterpret_cast<int(void)>(blah) to get its
own AST node (even if it is marked as invalid). We can pretend that
we're good enough to build a compiler that's robust against all of
these bogus inputs, but it's just not going to happen. It's far better
to only build semantically well-formed AST nodes.

> AFAIK, Clang produces Decls even for semantically illegal declarations (they
> are even added to scope) and there doesn't seem to be issues attributed to
> them.

This is necessary for some level of error recovery, but it's not
something we should propagate through the compiler.

  - Doug
  - Doug