[cfe-dev] [PATCH] C++ decl/expr ambiguity resolution approach

Sun Aug 24 08:19:01 PDT 2008

Chris Lattner wrote:
>
> Hi Argiris,
>
> You have obviously put a lot of time and thought into this code, and 
> it shows in its cleanliness and attention to detail.  I have a few 
> specific comments about the patch, but I'd like to first talk about 
> the approach you've chosen to make sure we're on the same page.
>
> The problem is that we don't know whether something is a declaration 
> or a statement without doing a significant amount of parsing to 
> decide.  If it is a declaration, we have to accept it as such, 
> otherwise we should try parsing it as a statement.  If neither work it 
> is an error.
>
> At root, your approach basically boils down to a very simple approach: 
> any time when we can't tell whether something is an obvious statement 
> (an important optimization!), you fire off a special parser that looks 
> ahead and sees if the tokens match the declaration grammar.  This 
> parser comes back with an answer, and based on that answer, the normal 
> parser goes off and parses with one approach or another.
>
> This is conceptually an extremely clean approach, because it decouples 
> disambiguation from actual parsing.  However, this has two big 
> downsides: 1) maintenance: it  duplicates a lot of parsing rules, 
> because now we have to be able to "tentatively parse" and "really 
> parse" all the declaration stuff.  2) performance: in addition to the 
> basic cost of backtracking this has to duplicate parsing, and some 
> amount of sema (type lookups etc).
>
> To me, these are pretty big drawbacks, and I want to make sure we're 
> all on the same page and agree that this is the right approach before 
> we go much farther.

About maintenance:
I'd say that having separate "disambiguation parsing" offers 
flexibility. For example, if we decide that disambiguating 
comma-separated declarators as comma-separated expressions is almost 
always not what the programmer intended, and it leads to less confusing 
errors to "shortcut" to a declaration by the first declarator, we can do 
that easily by changing a couple of lines.We can have a fine-grained 
control over disambiguation.
The declaration parsing rules for this 'special' parser are 
straightforward and restricted to one file, so it doesn't have huge 
maintenance cost.

>
>
> Ignoring optimizations for common cases, the only other realistic 
> solution that I know of is to make use of tentative parsing to 
> actually *start parsing* the ambiguous cases as declarations, and then 
> roll back and re-parse if an error occurs.  The code basically looks 
> like this:
>
>
>   // Parse declaration tentatively.
>   PP.StartTentativeParsing();
>   Res = ParseDeclarationStatement();
>   if (!Res.isError) {
>      PP.Commit();
>      return Res;
>   }
>   PP.Rollback();
>
>   // Parse statement non-tentatively
>   return ParseStatement();
> }
>
>
> This approach produces some very interesting tradeoffs.  The drawback 
> of this approach is that any action made while parsing and sema'ing 
> the declaration statement *must be able to be rolled back*.  This 
> means that warnings need to be buffered up (errors stop the tentative 
> parse) and that Sema needs to be able to undo any changes it makes to 
> its internal state (e.g. remove declared variables, delete initializer 
> expressions that are parsed and built, etc).  Also, if the ambiguous 
> statement ends up being an expression, your approach would be superior 
> in space and time.
>
> However, the advantage of this approach is that it is much faster than 
> doing a pre-parse when things turn out to actually be a declaration.  
> Since many statement cases get filtered out before reaching the 
> tentative case, I think this is a win.  Additionally, this does not 
> require duplicating parsing code for a bunch of grammar productions, 
> which is my primary objection to your approach.

With that approach, we will make big architectural changes just to deal 
with a very uncommon situation. I really don't think the tradeoffs worth it.
I don't have benchmarks, but I don't think there's such a big 
performance cost with the 'special parser' approach:

--The situation where disambiguation is required is uncommon.
--We can do shortcuts, like disambiguating to a declaration by the first 
declarator (GCC style), or when the first declarator has a '=' initializer.
--Lexer does lexing *only once*. The tokens are cached.
--The "disambiguation parser" is a tight "state-machine", no fancy stuff 
except type checking.
--The only architectural change that I propose is caching type-checks which:
       1) would also be needed for a "normal parser with roll-back"
       2) is useful in general; even on C there are multiple type-checks 
(i.e isDeclarationSpecifier calls)

A "normal parser with roll-back" approach would induce it's own 
time/space overheads but my main concern is that it would induce 
many-many changes and maintenance headaches for Parser+Sema, just to 
accommodate a weird part of the C++ grammar.

In general, the "special disambiguation parser" is a very unintrusive 
approach. If later on, using benchmarks, we determine that it has 
unacceptable performance cost, we can easily replace it with a more 
sophisticated "normal parser with rollback" approach.

>
> Did you consider the tentative parsing approach?  What do you think of 
> it?  Will you be able to reasonably expand your preparser to handle 
> cases like this?:
>
> int X = X;
>
> I think that this sort of example requires doing some amount of sema 
> in the disambiguation parser, though maybe the grammar is regular 
> enough to let you just skip over initializers completely (right now, 
> this doesn't matter for you because you stop lookahead at the =).

As Eli, mentioned, (and is stated in the standard) only type-checking is 
required for disambiguation. (Also, in the new patch I skip over 
initializers completely; I agree with Eli here too)

>
> Once we agree on the overall approach, I'll hassle you about things 
> like whether we should factor declspec processing out of the tentative 
> part and try to find out if there are other important cases that can 
> get filtered out as expressions early on (e.g. what about a statement 
> that starts with an identifier that is known to be a 
> variable/function, can that ever start a declaration?). :)

"can that ever start a declaration?": No, I don't think so.

-Argiris