[cfe-dev] [PATCH] C++ decl/expr ambiguity resolution approach

Sun Aug 24 14:16:18 PDT 2008

On Aug 23, 2008, at 8:37 PM, Eli Friedman wrote:
> On Sat, Aug 23, 2008 at 6:22 PM, Chris Lattner <clattner at apple.com>  
> wrote:
>> This is conceptually an extremely clean approach, because it  
>> decouples
>> disambiguation from actual parsing.  However, this has two big
>> downsides: 1) maintenance: it  duplicates a lot of parsing rules,
>> because now we have to be able to "tentatively parse" and "really
>> parse" all the declaration stuff.  2) performance: in addition to the
>> basic cost of backtracking this has to duplicate parsing, and some
>> amount of sema (type lookups etc).
>>
>> To me, these are pretty big drawbacks, and I want to make sure we're
>> all on the same page and agree that this is the right approach before
>> we go much farther.
>
> (1) is definitely an issue... it's a balancing act.  I won't try to
> comment on it because I don't know the difficulty of the alternative.

Right, this whole area is a balancing act... there isn't really a  
clear "right" way to go, which is why discussing it is useful :)

> For (2), the current code doesn't try to avoid activating
> backtracking; we can easily avoid it for common cases depending on how
> expensive it is to activate backtracking.

Activating backtracking isn't hugely expensive, but it certainly isn't  
free.  Also, scanning ahead and buffering tokens certainly isn't free,  
so we should avoid it when possible.  One of the costs is the Sema/ 
Type resolution that has to happen in the preparser.  For example,  
silly things like:

   std::vector<bool>::iterator ...

require the preparser to do template specialization etc to look up  
what "iterator" in vector<bool> is.  Immediately after the preparser  
decides that it is a type, the decl parser kicks in and has to do all  
the same resolution stuff.

> The only cases which need
> the statement vs. expression disambiguation are those starting with T(
> and N::T(, where T is a type. We can catch all the cases which don't
> start with a complicated type with only a single token of lookahead.

Right.

> The current version is already quite optimized in terms of the
> distance it looks ahead; it tries to cut out as soon as possible, and
> most of the time it shouldn't end up looking very far ahead.

Yep, Argiris did a nice job with his implementation of the approach.

> If an identifier changes from a type to an identifier in the middle of
> a statement, and that affects disambiguation, the program is
> ill-formed per the standard; we can just assume for disambiguation
> purposes that anything that isn't a type is a valid variable.  And for
> initializers, I'm pretty sure the rules allow skipping forward to the
> next comma/semicolon once we see an = sign; what follows the = sign is
> just a regular assignment-expression, so it doesn't matter in terms of
> disambiguation.

Ok, great!  Who knew that there would be *some* sanity in C++? :-)

>> Also, if the ambiguous
>> statement ends up being an expression, your approach would be  
>> superior
>> in space and time.
>
> If we want to know what the balance is here, we'll really have to
> benchmark; hopefully the cases where we have to lookahead more than a
> few tokens are rare, in which case performance isn't so much of an
> issue.

It would be very interesting to hack the GCC front-end and run it over  
some code base to find out how much it ends up speculating wrong.

After thinking about this some more and sleeping on it, I think that  
GCC's approach (tentatively parsing something as a decl, then falling  
back to expr on error) is really the best approach.  The high level  
reason for this is that I think that almost all of the ambiguous cases  
will end up being declarations in practice, so doing a "lookahead"  
prepare is wasted effort compared to just parsing as a decl and  
undoing if it really is a statement.

The basic reason I think this is that I think almost all of the common  
expression cases can be easily disambiguated without any[1]  
lookahead.  For example, here are some silly but important cases:

   if (...)    // decl can't start with 'if' keyword
   X = 4;      // X isn't a typename
   foo(4);     // foo isn't a typename
   I->foo();   // I isn't a typename
   cout << ..  // cout isn't a typename

For fun, I just scanned through the source for instcombine, looking at  
the top level statements.  Almost all of them would be rejected as  
"not decls" very trivially.  This gives me good confidence that  
backtracking won't even enter the picture for most expressions and  
statements.  However, every variable definition (obviously) starts  
with a type:

   int X = ...
   Value *V = ..

This means that every variable definition will require starting  
backtrack bookkeeping, doing a preparse (even if not very far forward)  
then then deciding "yep, it's a decl", backtracking, and then  
reparsing as a decl.  This seems like a pretty significant cost to pay  
for these common cases, and I think the "speculatively parse as a decl  
if ambiguous and back off later" approach is better.

There is one more thing I want to bring (back) up, w.r.t. "[1]"  
above.  The problem is that there is a significant class of things  
where we need more than one token of lookahead to resolve and these  
are commonly exprs.  This is the qualified name case:

   std::cout << ...   // is an expr
   UndefValue::get(Ty) // static method is an expr

which is very ambiguous with the decl case:

   std::vector<int> Y
   BasicBlock::iterator X = ...
   ICmpInst::Predicate NewPred =

The LLVM codebase as a whole doesn't use *too* much namespace  
qualification, because everything is in the llvm namespace and we  
don't use std::cout at all.  However, a lot of codebases use  
namespaces heavily.  With the "speculate it is a decl" approach, we  
would handle the decl cases well (which are the majority of the cases  
for the llvm codebase) but we lose (compared to the 'preparse to  
disambiguate' approach) for the qualified expr case.

I think the solution to this (which I know Argiris doesn't like :) is  
to resolve the qualified type *before* making a decision about whether  
it is a statement or a decl.  This makes the logic a little more  
complex (we need to bring back the "parse expr with leading  
identifier" logic) but it would very neatly solve this problem in just  
about every real case, and it would mean that we aren't re-resolving  
and sema-ing types in any of the common cases here.

This is clearly something that we can discuss more and do as a later  
separate step, but I just wanted to throw it out there for kicks. :)

Argiris, what do you think of the "tentatively parse as a decl"  
approach?

-Chris