[cfe-dev] [PATCH] C++ nested-name-specifier (Parser)

Sat Aug 9 21:08:06 PDT 2008

On Aug 9, 2008, at 6:29 PM, Argiris Kirtzidis wrote:
>> case tok::identifier:
>>  // eats the current identifier and related type stuff as a whole,
>>  D = ConsumeAndResolveIdentifier();
>>  if (Actions.isTypeName(D))
>>    ... it's a decl spec ...
>>  else
>>    .. it's an 'identifier expression' ..
>>
>> Is that the problem?
>
> Yes, exactly! I want to avoid the "leading stuff.." route.

I've been going back and forth on this issue and did a bit of research.

GCC's C++ front-end handles this in a couple of different ways.  To  
handle the ambiguity between declarations and statements at the top  
level of compound statements, it basically boils down to:

  cp_parser_parse_tentatively (parser);
  /* Try to parse the declaration-statement.  */
  cp_parser_declaration_statement (parser);
  /* If that worked, we're done.  */
  if (cp_parser_parse_definitely (parser))
     return;

  /* Look for an expression-statement instead.  */
  statement = cp_parser_expression_statement (parser,  
in_statement_expr);

I don't think there is a good way around this (in its full  
generality), other than some "peephole" look-ahead to try to  
disambiguate without fully tentative parsing.  This is worthwhile  
because the code above would run for simple things like { x = y; }  
because looking at 'x=' you can obviously tell whether it starts a  
declaration or not.

However, I think that the primary expression case can be better and is  
more important.  The GCC parser (afaict) does a ton of tentative  
parsing and other weird stuff when trying to resolve identifiers in  
primary expression and I think this is extremely bad news.

Consider some very common C++ cases:

   std::cout << ....
   std::string X;
   ... std::string::npos ...
   std::string("")
   std::vector<int>::iterator
   etc

The big problem that I have with backtracking is that use of it to  
handle primary expressions means that you do a huge amount of the same  
work, over and over again. In this case, analyzing "std::string::npos"  
will probably start tentatively parsing it as a type (because it could  
be the start of a 'constructor style cast'.  This will require lookup  
of 'std', then finding the 'string' typedef in it, then finding the  
definition of the string type (a record_decl for  
basic_string<char...>) then finding the npos member, then seeing the  
'(', then realizizing that npos is not a type and aborting/ 
backtracking.  In the 'retry' it will start parsing it again, which  
has to resolve 'std', 'string' and 'npos' all over again, just to  
continue on as a non-type.  It is far worse for things like  
vector<int> because it requires looking up the template and handling  
partial specialization *multiple times* which is slow slow slow :)

In the 'Old Clang C parser' we had a special case for identifier, and  
we had "parse expression with leading identifier".  You convinced me  
that you can make the preprocessor fast enough at single token  
lookahead to allow us to depend on it and simplify the  
implementation.  For C, this is sufficient, but for C++ it clearly is  
not.

Given the complexities of these C++ issues, I'm strongly inclined to  
go back to the old design for expressions (however, we should still  
rely on 1-token lookahead for labels).  This means that we resolve the  
qualified identifier streams exactly once in this case, and don't have  
to re-resolve them and don't have have backtracking in this (very,  
very, common case).  This does bring back some annoyingness in the  
clang implementation of the grammar, but it comes into a very simple  
place that doesn't require constant hacking (the primary/postfix/cast  
part of the grammar is pretty simple).

Places in the code that use the "is decl spec" predicate will also  
have to be changed (e.g. the compound statement case).  I think it  
would be interesting to consider making the 'is decl spec' predicate  
either be tri-state (yes/no/maybe) or see if the callers of this can  
be made to inline the two cases directly, like we can do for primary  
expressions.  I haven't looked at the clients of 'is decl spec' to see  
how insane this idea is.

>>  It also means that isDeclarationSpecifier would need backtracking  
>> and significant other stuff to work as a predicate.  In practice,  
>> this means we'd want to refactor all callers to not use it like it  
>> does.
>
> Here's another way to make it work, without any backtracking:
> Either Parser or Sema, keeps a scope-spec state (like Parser keeps a  
> 'current scope' state and Sema keeps 'current DeclContext' state),  
> which indicates the Decl that we need to do lookup into.
> As you suggested, the Parser, calls actions during parsing of scope  
> specifiers.
> For "A::T<int>::":
> -Parses 'A::' and calls ActOnNestedNameSpecifier for 'A' (updates  
> scope-spec)
> -Parses, instantiates, etc. 'T<int>' through action calls (updates  
> scope-spec)

Ok.  I like that this means the resolution only has to happen once,  
which is my goal.

> Sema actions like isTypeName and ActOnIdentifier, check the scope- 
> spec to see whether it needs to lookup a name inside the scope-spec  
> decl, or do a normal lookup.

This is interesting: doesn't this mean that everything else will have  
to check to make sure that this hasn't happened, in order to reject  
invalid code?

> Now, the questions is whether Sema should keep the scope-spec state,  
> or whether the Parser should keep the scope-spec state and pass it  
> along to the actions (isTypeName, etc.)
> When I say passing the scope-spec to actions, I don't mean like the  
> CXXScopeSpec in my patch which contained only parsed tokens, I mean  
> passing the decl that represents the scope where names should be  
> looked up into.

Ok.

> If the Sema keeps the scope-spec state, it is cleaner, but how will  
> the scope-state be cleared in case of an error like "A:: ;" ?
> Should the parser call an ActOnErrorAfterNestedName or something ?

That is possible.  I'm more concerned with the fact that Sema for ';'  
will have to check to make sure there is no current scope-spec that is  
active.  This approach seems to make it so that *all* sema actions  
would have to check for an empty scope-spec :(

>>> This also means that we can drop the rewinding part if the Parser  
>>> or Sema keep and look after a 'C++ scoping' state.
>>> If the Parser keeps the state, it will build and pass  
>>> 'CXXScopeSpec' objects to action methods, (like in my patch).
>>> If the Sema keeps the state, it will make error recovery awkward.
>>
>> I actually think that things are more awkward with having Sema work  
>> this out.  Specifically, if you have: "A :: B :: C" you might have  
>> one reference or it might be "A::B ::C" which can occur in a few  
>> places in the grammar:
>> http://www.cs.berkeley.edu/~smcpeak/elkhound/sources/elsa/doc/coloncolon.txt
>>
>> If the parser just passed all of A/B/C to Sema, I'm not sure how it  
>> would handle this.  It seems really easy if you invoke actions for  
>> "A::" then "B" (which returns a type).  When the parser got a type,  
>> it would naturally ratchet on and parse ::C as a qualified name.
>
> But "B" type could contain "C" as nested type, in which case you may  
> want "A::B::C".
> Anyway, I don't think we need to worry about whether it should be  
> "A::B   ::C" or  "A  ::B::C" or whatever.
> GCC treats it as "A::B::C" always:
>
> struct S {};
> namespace A { S f(); }
> S ::A::f();  // error: 'S::A' has not been declared
>
> I think this is correct since the spec says nothing about "'::'  
> associativity".

Sure, but this can apparently happen in the 'friend' case and some  
others, according to the Elsa doc.  I confess to not being an expert  
on these matters though.

-Chris