[cfe-dev] [PATCH] C++ decl/expr ambiguity resolution approach

Sun Aug 24 15:29:25 PDT 2008

Chris Lattner wrote:
>
> Activating backtracking isn't hugely expensive, but it certainly isn't 
> free.  Also, scanning ahead and buffering tokens certainly isn't free, 
> so we should avoid it when possible.  One of the costs is the 
> Sema/Type resolution that has to happen in the preparser.  For 
> example, silly things like:
>
>   std::vector<bool>::iterator ...
>
> require the preparser to do template specialization etc to look up 
> what "iterator" in vector<bool> is.  Immediately after the preparser 
> decides that it is a type, the decl parser kicks in and has to do all 
> the same resolution stuff.

The most efficient approach is to do template specialization and other 
resolution stuff once, right ? Even with the "tentatively parse as a 
decl" approach you'd prefer to not repeat resolutions when going for 
"expression parsing".
How about caching the resolution stuff. The way it would work is, before 
going to Sema to do type resolutions, the parser would check if the type 
resolution was made by previous calls to Sema.(instead of calling 
Action.isTypeName directly, there would be a Parser::IsTypename method 
to take care of this)
The preparser would do qualified type checks and template 
specializations but the normal parser would not have to re-do them again.

>
>
> After thinking about this some more and sleeping on it, I think that 
> GCC's approach (tentatively parsing something as a decl, then falling 
> back to expr on error) is really the best approach.  The high level 
> reason for this is that I think that almost all of the ambiguous cases 
> will end up being declarations in practice, so doing a "lookahead" 
> prepare is wasted effort compared to just parsing as a decl and 
> undoing if it really is a statement.
>
> The basic reason I think this is that I think almost all of the common 
> expression cases can be easily disambiguated without any[1] 
> lookahead.  For example, here are some silly but important cases:
>
>   if (...)    // decl can't start with 'if' keyword
>   X = 4;      // X isn't a typename
>   foo(4);     // foo isn't a typename
>   I->foo();   // I isn't a typename
>   cout << ..  // cout isn't a typename
>
> For fun, I just scanned through the source for instcombine, looking at 
> the top level statements.  Almost all of them would be rejected as 
> "not decls" very trivially.  This gives me good confidence that 
> backtracking won't even enter the picture for most expressions and 
> statements.  However, every variable definition (obviously) starts 
> with a type:
>
>   int X = ...
>   Value *V = ..
>
> This means that every variable definition will require starting 
> backtrack bookkeeping, doing a preparse (even if not very far forward) 
> then then deciding "yep, it's a decl", backtracking, and then 
> reparsing as a decl.  This seems like a pretty significant cost to pay 
> for these common cases, and I think the "speculatively parse as a decl 
> if ambiguous and back off later" approach is better.

I think that you have a misunderstanding about what the ambiguous cases 
are. The ambiguous cases are those where a type is followed by a '('. 
These cases:
  int X = ...
  Value *V = ..

and the vast majority of declarations are not ambiguous at all, so no 
preparsing, backtracking etc. is needed for them, just a one-token 
lookahead.
int X =  // not ambiguous
int (X) = // ambiguous
const int (X) // not ambiguous because it starts with 'const'

So your assumption that "almost all of the ambiguous cases will end up 
being declarations" is not true, in fact, if I had to guess, I'd say 
that the balance would probably lean towards the expression side.
Most of the declarations that are of the T(..) variety, in practice, are 
function pointer declarations and these are not that common inside 
functions.

>
> There is one more thing I want to bring (back) up, w.r.t. "[1]" 
> above.  The problem is that there is a significant class of things 
> where we need more than one token of lookahead to resolve and these 
> are commonly exprs.  This is the qualified name case:
>
>   std::cout << ...   // is an expr
>   UndefValue::get(Ty) // static method is an expr
>
> which is very ambiguous with the decl case:
>
>   std::vector<int> Y
>   BasicBlock::iterator X = ...
>   ICmpInst::Predicate NewPred =
>
> The LLVM codebase as a whole doesn't use *too* much namespace 
> qualification, because everything is in the llvm namespace and we 
> don't use std::cout at all.  However, a lot of codebases use 
> namespaces heavily.  With the "speculate it is a decl" approach, we 
> would handle the decl cases well (which are the majority of the cases 
> for the llvm codebase) but we lose (compared to the 'preparse to 
> disambiguate' approach) for the qualified expr case.
>
> I think the solution to this (which I know Argiris doesn't like :) is 
> to resolve the qualified type *before* making a decision about whether 
> it is a statement or a decl.  This makes the logic a little more 
> complex (we need to bring back the "parse expr with leading 
> identifier" logic) but it would very neatly solve this problem in just 
> about every real case, and it would mean that we aren't re-resolving 
> and sema-ing types in any of the common cases here.

The "having the parser cache sema resolutions" solves this.

>
> This is clearly something that we can discuss more and do as a later 
> separate step, but I just wanted to throw it out there for kicks. :)
>
>
> Argiris, what do you think of the "tentatively parse as a decl" approach?

As previously said, the preparser approach is non-intrusive and easily 
replaceable. We can go with a "tentatively parse as a decl" approach and 
muddle up the Parser+Sema for this C++ corner case (at the expense of 
the other languages) but we wouldn't know if it was really worth it. If 
we had the preparser approach in place we would be able to compare 
against it and see that it's a clear performance benefit or not. Or we 
could run against actual codebases and see what the ambiguous statements 
resolve to.
How would we feel if we went through the trouble for a "Parse+Sema 
rollback version" and see that most of the ambiguous statements resolve 
to expressions ? :-)

-Argiris