[cfe-dev] [PATCH] C++ decl/expr ambiguity resolution approach
Argiris Kirtzidis
akyrtzi at gmail.com
Sun Aug 24 08:19:01 PDT 2008
Chris Lattner wrote:
>
> Hi Argiris,
>
> You have obviously put a lot of time and thought into this code, and
> it shows in its cleanliness and attention to detail. I have a few
> specific comments about the patch, but I'd like to first talk about
> the approach you've chosen to make sure we're on the same page.
>
> The problem is that we don't know whether something is a declaration
> or a statement without doing a significant amount of parsing to
> decide. If it is a declaration, we have to accept it as such,
> otherwise we should try parsing it as a statement. If neither work it
> is an error.
>
> At root, your approach basically boils down to a very simple approach:
> any time when we can't tell whether something is an obvious statement
> (an important optimization!), you fire off a special parser that looks
> ahead and sees if the tokens match the declaration grammar. This
> parser comes back with an answer, and based on that answer, the normal
> parser goes off and parses with one approach or another.
>
> This is conceptually an extremely clean approach, because it decouples
> disambiguation from actual parsing. However, this has two big
> downsides: 1) maintenance: it duplicates a lot of parsing rules,
> because now we have to be able to "tentatively parse" and "really
> parse" all the declaration stuff. 2) performance: in addition to the
> basic cost of backtracking this has to duplicate parsing, and some
> amount of sema (type lookups etc).
>
> To me, these are pretty big drawbacks, and I want to make sure we're
> all on the same page and agree that this is the right approach before
> we go much farther.
About maintenance:
I'd say that having separate "disambiguation parsing" offers
flexibility. For example, if we decide that disambiguating
comma-separated declarators as comma-separated expressions is almost
always not what the programmer intended, and it leads to less confusing
errors to "shortcut" to a declaration by the first declarator, we can do
that easily by changing a couple of lines.We can have a fine-grained
control over disambiguation.
The declaration parsing rules for this 'special' parser are
straightforward and restricted to one file, so it doesn't have huge
maintenance cost.
>
>
> Ignoring optimizations for common cases, the only other realistic
> solution that I know of is to make use of tentative parsing to
> actually *start parsing* the ambiguous cases as declarations, and then
> roll back and re-parse if an error occurs. The code basically looks
> like this:
>
>
> // Parse declaration tentatively.
> PP.StartTentativeParsing();
> Res = ParseDeclarationStatement();
> if (!Res.isError) {
> PP.Commit();
> return Res;
> }
> PP.Rollback();
>
> // Parse statement non-tentatively
> return ParseStatement();
> }
>
>
> This approach produces some very interesting tradeoffs. The drawback
> of this approach is that any action made while parsing and sema'ing
> the declaration statement *must be able to be rolled back*. This
> means that warnings need to be buffered up (errors stop the tentative
> parse) and that Sema needs to be able to undo any changes it makes to
> its internal state (e.g. remove declared variables, delete initializer
> expressions that are parsed and built, etc). Also, if the ambiguous
> statement ends up being an expression, your approach would be superior
> in space and time.
>
> However, the advantage of this approach is that it is much faster than
> doing a pre-parse when things turn out to actually be a declaration.
> Since many statement cases get filtered out before reaching the
> tentative case, I think this is a win. Additionally, this does not
> require duplicating parsing code for a bunch of grammar productions,
> which is my primary objection to your approach.
With that approach, we will make big architectural changes just to deal
with a very uncommon situation. I really don't think the tradeoffs worth it.
I don't have benchmarks, but I don't think there's such a big
performance cost with the 'special parser' approach:
--The situation where disambiguation is required is uncommon.
--We can do shortcuts, like disambiguating to a declaration by the first
declarator (GCC style), or when the first declarator has a '=' initializer.
--Lexer does lexing *only once*. The tokens are cached.
--The "disambiguation parser" is a tight "state-machine", no fancy stuff
except type checking.
--The only architectural change that I propose is caching type-checks which:
1) would also be needed for a "normal parser with roll-back"
2) is useful in general; even on C there are multiple type-checks
(i.e isDeclarationSpecifier calls)
A "normal parser with roll-back" approach would induce it's own
time/space overheads but my main concern is that it would induce
many-many changes and maintenance headaches for Parser+Sema, just to
accommodate a weird part of the C++ grammar.
In general, the "special disambiguation parser" is a very unintrusive
approach. If later on, using benchmarks, we determine that it has
unacceptable performance cost, we can easily replace it with a more
sophisticated "normal parser with rollback" approach.
>
> Did you consider the tentative parsing approach? What do you think of
> it? Will you be able to reasonably expand your preparser to handle
> cases like this?:
>
> int X = X;
>
> I think that this sort of example requires doing some amount of sema
> in the disambiguation parser, though maybe the grammar is regular
> enough to let you just skip over initializers completely (right now,
> this doesn't matter for you because you stop lookahead at the =).
As Eli, mentioned, (and is stated in the standard) only type-checking is
required for disambiguation. (Also, in the new patch I skip over
initializers completely; I agree with Eli here too)
>
> Once we agree on the overall approach, I'll hassle you about things
> like whether we should factor declspec processing out of the tentative
> part and try to find out if there are other important cases that can
> get filtered out as expressions early on (e.g. what about a statement
> that starts with an identifier that is known to be a
> variable/function, can that ever start a declaration?). :)
"can that ever start a declaration?": No, I don't think so.
-Argiris
More information about the cfe-dev
mailing list