[cfe-dev] [PATCH] C++ decl/expr ambiguity resolution approach
Chris Lattner
clattner at apple.com
Sun Aug 24 14:16:18 PDT 2008
On Aug 23, 2008, at 8:37 PM, Eli Friedman wrote:
> On Sat, Aug 23, 2008 at 6:22 PM, Chris Lattner <clattner at apple.com>
> wrote:
>> This is conceptually an extremely clean approach, because it
>> decouples
>> disambiguation from actual parsing. However, this has two big
>> downsides: 1) maintenance: it duplicates a lot of parsing rules,
>> because now we have to be able to "tentatively parse" and "really
>> parse" all the declaration stuff. 2) performance: in addition to the
>> basic cost of backtracking this has to duplicate parsing, and some
>> amount of sema (type lookups etc).
>>
>> To me, these are pretty big drawbacks, and I want to make sure we're
>> all on the same page and agree that this is the right approach before
>> we go much farther.
>
> (1) is definitely an issue... it's a balancing act. I won't try to
> comment on it because I don't know the difficulty of the alternative.
Right, this whole area is a balancing act... there isn't really a
clear "right" way to go, which is why discussing it is useful :)
> For (2), the current code doesn't try to avoid activating
> backtracking; we can easily avoid it for common cases depending on how
> expensive it is to activate backtracking.
Activating backtracking isn't hugely expensive, but it certainly isn't
free. Also, scanning ahead and buffering tokens certainly isn't free,
so we should avoid it when possible. One of the costs is the Sema/
Type resolution that has to happen in the preparser. For example,
silly things like:
std::vector<bool>::iterator ...
require the preparser to do template specialization etc to look up
what "iterator" in vector<bool> is. Immediately after the preparser
decides that it is a type, the decl parser kicks in and has to do all
the same resolution stuff.
> The only cases which need
> the statement vs. expression disambiguation are those starting with T(
> and N::T(, where T is a type. We can catch all the cases which don't
> start with a complicated type with only a single token of lookahead.
Right.
> The current version is already quite optimized in terms of the
> distance it looks ahead; it tries to cut out as soon as possible, and
> most of the time it shouldn't end up looking very far ahead.
Yep, Argiris did a nice job with his implementation of the approach.
> If an identifier changes from a type to an identifier in the middle of
> a statement, and that affects disambiguation, the program is
> ill-formed per the standard; we can just assume for disambiguation
> purposes that anything that isn't a type is a valid variable. And for
> initializers, I'm pretty sure the rules allow skipping forward to the
> next comma/semicolon once we see an = sign; what follows the = sign is
> just a regular assignment-expression, so it doesn't matter in terms of
> disambiguation.
Ok, great! Who knew that there would be *some* sanity in C++? :-)
>> Also, if the ambiguous
>> statement ends up being an expression, your approach would be
>> superior
>> in space and time.
>
> If we want to know what the balance is here, we'll really have to
> benchmark; hopefully the cases where we have to lookahead more than a
> few tokens are rare, in which case performance isn't so much of an
> issue.
It would be very interesting to hack the GCC front-end and run it over
some code base to find out how much it ends up speculating wrong.
After thinking about this some more and sleeping on it, I think that
GCC's approach (tentatively parsing something as a decl, then falling
back to expr on error) is really the best approach. The high level
reason for this is that I think that almost all of the ambiguous cases
will end up being declarations in practice, so doing a "lookahead"
prepare is wasted effort compared to just parsing as a decl and
undoing if it really is a statement.
The basic reason I think this is that I think almost all of the common
expression cases can be easily disambiguated without any[1]
lookahead. For example, here are some silly but important cases:
if (...) // decl can't start with 'if' keyword
X = 4; // X isn't a typename
foo(4); // foo isn't a typename
I->foo(); // I isn't a typename
cout << .. // cout isn't a typename
For fun, I just scanned through the source for instcombine, looking at
the top level statements. Almost all of them would be rejected as
"not decls" very trivially. This gives me good confidence that
backtracking won't even enter the picture for most expressions and
statements. However, every variable definition (obviously) starts
with a type:
int X = ...
Value *V = ..
This means that every variable definition will require starting
backtrack bookkeeping, doing a preparse (even if not very far forward)
then then deciding "yep, it's a decl", backtracking, and then
reparsing as a decl. This seems like a pretty significant cost to pay
for these common cases, and I think the "speculatively parse as a decl
if ambiguous and back off later" approach is better.
There is one more thing I want to bring (back) up, w.r.t. "[1]"
above. The problem is that there is a significant class of things
where we need more than one token of lookahead to resolve and these
are commonly exprs. This is the qualified name case:
std::cout << ... // is an expr
UndefValue::get(Ty) // static method is an expr
which is very ambiguous with the decl case:
std::vector<int> Y
BasicBlock::iterator X = ...
ICmpInst::Predicate NewPred =
The LLVM codebase as a whole doesn't use *too* much namespace
qualification, because everything is in the llvm namespace and we
don't use std::cout at all. However, a lot of codebases use
namespaces heavily. With the "speculate it is a decl" approach, we
would handle the decl cases well (which are the majority of the cases
for the llvm codebase) but we lose (compared to the 'preparse to
disambiguate' approach) for the qualified expr case.
I think the solution to this (which I know Argiris doesn't like :) is
to resolve the qualified type *before* making a decision about whether
it is a statement or a decl. This makes the logic a little more
complex (we need to bring back the "parse expr with leading
identifier" logic) but it would very neatly solve this problem in just
about every real case, and it would mean that we aren't re-resolving
and sema-ing types in any of the common cases here.
This is clearly something that we can discuss more and do as a later
separate step, but I just wanted to throw it out there for kicks. :)
Argiris, what do you think of the "tentatively parse as a decl"
approach?
-Chris
More information about the cfe-dev
mailing list