[cfe-dev] [PATCH] C++ decl/expr ambiguity resolution approach

Sat Aug 23 18:22:40 PDT 2008

On Aug 22, 2008, at 8:16 PM, Argiris Kirtzidis wrote:
> The attached patch disambiguates between declaration or expression  
> statements. It uses tentative parsing and backtracking.
> For the common cases no tentative parsing is needed; for the  
> ambiguous ones, this will result in multiple type checks (for  
> tentative parsing and for actual parsing).
> As I've mentioned in a previous post, this can be made efficient in  
> a later patch by caching the type checking results of the Action  
> module.

Hi Argiris,

You have obviously put a lot of time and thought into this code, and  
it shows in its cleanliness and attention to detail.  I have a few  
specific comments about the patch, but I'd like to first talk about  
the approach you've chosen to make sure we're on the same page.

The problem is that we don't know whether something is a declaration  
or a statement without doing a significant amount of parsing to  
decide.  If it is a declaration, we have to accept it as such,  
otherwise we should try parsing it as a statement.  If neither work it  
is an error.

At root, your approach basically boils down to a very simple approach:  
any time when we can't tell whether something is an obvious statement  
(an important optimization!), you fire off a special parser that looks  
ahead and sees if the tokens match the declaration grammar.  This  
parser comes back with an answer, and based on that answer, the normal  
parser goes off and parses with one approach or another.

This is conceptually an extremely clean approach, because it decouples  
disambiguation from actual parsing.  However, this has two big  
downsides: 1) maintenance: it  duplicates a lot of parsing rules,  
because now we have to be able to "tentatively parse" and "really  
parse" all the declaration stuff.  2) performance: in addition to the  
basic cost of backtracking this has to duplicate parsing, and some  
amount of sema (type lookups etc).

To me, these are pretty big drawbacks, and I want to make sure we're  
all on the same page and agree that this is the right approach before  
we go much farther.

Ignoring optimizations for common cases, the only other realistic  
solution that I know of is to make use of tentative parsing to  
actually *start parsing* the ambiguous cases as declarations, and then  
roll back and re-parse if an error occurs.  The code basically looks  
like this:

   // Parse declaration tentatively.
   PP.StartTentativeParsing();
   Res = ParseDeclarationStatement();
   if (!Res.isError) {
      PP.Commit();
      return Res;
   }
   PP.Rollback();

   // Parse statement non-tentatively
   return ParseStatement();
}

This approach produces some very interesting tradeoffs.  The drawback  
of this approach is that any action made while parsing and sema'ing  
the declaration statement *must be able to be rolled back*.  This  
means that warnings need to be buffered up (errors stop the tentative  
parse) and that Sema needs to be able to undo any changes it makes to  
its internal state (e.g. remove declared variables, delete initializer  
expressions that are parsed and built, etc).  Also, if the ambiguous  
statement ends up being an expression, your approach would be superior  
in space and time.

However, the advantage of this approach is that it is much faster than  
doing a pre-parse when things turn out to actually be a declaration.   
Since many statement cases get filtered out before reaching the  
tentative case, I think this is a win.  Additionally, this does not  
require duplicating parsing code for a bunch of grammar productions,  
which is my primary objection to your approach.

Did you consider the tentative parsing approach?  What do you think of  
it?  Will you be able to reasonably expand your preparser to handle  
cases like this?:

int X = X;

I think that this sort of example requires doing some amount of sema  
in the disambiguation parser, though maybe the grammar is regular  
enough to let you just skip over initializers completely (right now,  
this doesn't matter for you because you stop lookahead at the =).

Once we agree on the overall approach, I'll hassle you about things  
like whether we should factor declspec processing out of the tentative  
part and try to find out if there are other important cases that can  
get filtered out as expressions early on (e.g. what about a statement  
that starts with an identifier that is known to be a variable/ 
function, can that ever start a declaration?). :)

Thoughts?

-Chris