[cfe-dev] [PATCH] C++ nested-name-specifier (Parser)
Chris Lattner
clattner at apple.com
Sat Aug 9 21:08:06 PDT 2008
On Aug 9, 2008, at 6:29 PM, Argiris Kirtzidis wrote:
>> case tok::identifier:
>> // eats the current identifier and related type stuff as a whole,
>> D = ConsumeAndResolveIdentifier();
>> if (Actions.isTypeName(D))
>> ... it's a decl spec ...
>> else
>> .. it's an 'identifier expression' ..
>>
>> Is that the problem?
>
> Yes, exactly! I want to avoid the "leading stuff.." route.
I've been going back and forth on this issue and did a bit of research.
GCC's C++ front-end handles this in a couple of different ways. To
handle the ambiguity between declarations and statements at the top
level of compound statements, it basically boils down to:
cp_parser_parse_tentatively (parser);
/* Try to parse the declaration-statement. */
cp_parser_declaration_statement (parser);
/* If that worked, we're done. */
if (cp_parser_parse_definitely (parser))
return;
/* Look for an expression-statement instead. */
statement = cp_parser_expression_statement (parser,
in_statement_expr);
I don't think there is a good way around this (in its full
generality), other than some "peephole" look-ahead to try to
disambiguate without fully tentative parsing. This is worthwhile
because the code above would run for simple things like { x = y; }
because looking at 'x=' you can obviously tell whether it starts a
declaration or not.
However, I think that the primary expression case can be better and is
more important. The GCC parser (afaict) does a ton of tentative
parsing and other weird stuff when trying to resolve identifiers in
primary expression and I think this is extremely bad news.
Consider some very common C++ cases:
std::cout << ....
std::string X;
... std::string::npos ...
std::string("")
std::vector<int>::iterator
etc
The big problem that I have with backtracking is that use of it to
handle primary expressions means that you do a huge amount of the same
work, over and over again. In this case, analyzing "std::string::npos"
will probably start tentatively parsing it as a type (because it could
be the start of a 'constructor style cast'. This will require lookup
of 'std', then finding the 'string' typedef in it, then finding the
definition of the string type (a record_decl for
basic_string<char...>) then finding the npos member, then seeing the
'(', then realizizing that npos is not a type and aborting/
backtracking. In the 'retry' it will start parsing it again, which
has to resolve 'std', 'string' and 'npos' all over again, just to
continue on as a non-type. It is far worse for things like
vector<int> because it requires looking up the template and handling
partial specialization *multiple times* which is slow slow slow :)
In the 'Old Clang C parser' we had a special case for identifier, and
we had "parse expression with leading identifier". You convinced me
that you can make the preprocessor fast enough at single token
lookahead to allow us to depend on it and simplify the
implementation. For C, this is sufficient, but for C++ it clearly is
not.
Given the complexities of these C++ issues, I'm strongly inclined to
go back to the old design for expressions (however, we should still
rely on 1-token lookahead for labels). This means that we resolve the
qualified identifier streams exactly once in this case, and don't have
to re-resolve them and don't have have backtracking in this (very,
very, common case). This does bring back some annoyingness in the
clang implementation of the grammar, but it comes into a very simple
place that doesn't require constant hacking (the primary/postfix/cast
part of the grammar is pretty simple).
Places in the code that use the "is decl spec" predicate will also
have to be changed (e.g. the compound statement case). I think it
would be interesting to consider making the 'is decl spec' predicate
either be tri-state (yes/no/maybe) or see if the callers of this can
be made to inline the two cases directly, like we can do for primary
expressions. I haven't looked at the clients of 'is decl spec' to see
how insane this idea is.
>> It also means that isDeclarationSpecifier would need backtracking
>> and significant other stuff to work as a predicate. In practice,
>> this means we'd want to refactor all callers to not use it like it
>> does.
>
> Here's another way to make it work, without any backtracking:
> Either Parser or Sema, keeps a scope-spec state (like Parser keeps a
> 'current scope' state and Sema keeps 'current DeclContext' state),
> which indicates the Decl that we need to do lookup into.
> As you suggested, the Parser, calls actions during parsing of scope
> specifiers.
> For "A::T<int>::":
> -Parses 'A::' and calls ActOnNestedNameSpecifier for 'A' (updates
> scope-spec)
> -Parses, instantiates, etc. 'T<int>' through action calls (updates
> scope-spec)
Ok. I like that this means the resolution only has to happen once,
which is my goal.
> Sema actions like isTypeName and ActOnIdentifier, check the scope-
> spec to see whether it needs to lookup a name inside the scope-spec
> decl, or do a normal lookup.
This is interesting: doesn't this mean that everything else will have
to check to make sure that this hasn't happened, in order to reject
invalid code?
> Now, the questions is whether Sema should keep the scope-spec state,
> or whether the Parser should keep the scope-spec state and pass it
> along to the actions (isTypeName, etc.)
> When I say passing the scope-spec to actions, I don't mean like the
> CXXScopeSpec in my patch which contained only parsed tokens, I mean
> passing the decl that represents the scope where names should be
> looked up into.
Ok.
> If the Sema keeps the scope-spec state, it is cleaner, but how will
> the scope-state be cleared in case of an error like "A:: ;" ?
> Should the parser call an ActOnErrorAfterNestedName or something ?
That is possible. I'm more concerned with the fact that Sema for ';'
will have to check to make sure there is no current scope-spec that is
active. This approach seems to make it so that *all* sema actions
would have to check for an empty scope-spec :(
>>> This also means that we can drop the rewinding part if the Parser
>>> or Sema keep and look after a 'C++ scoping' state.
>>> If the Parser keeps the state, it will build and pass
>>> 'CXXScopeSpec' objects to action methods, (like in my patch).
>>> If the Sema keeps the state, it will make error recovery awkward.
>>
>> I actually think that things are more awkward with having Sema work
>> this out. Specifically, if you have: "A :: B :: C" you might have
>> one reference or it might be "A::B ::C" which can occur in a few
>> places in the grammar:
>> http://www.cs.berkeley.edu/~smcpeak/elkhound/sources/elsa/doc/coloncolon.txt
>>
>> If the parser just passed all of A/B/C to Sema, I'm not sure how it
>> would handle this. It seems really easy if you invoke actions for
>> "A::" then "B" (which returns a type). When the parser got a type,
>> it would naturally ratchet on and parse ::C as a qualified name.
>
> But "B" type could contain "C" as nested type, in which case you may
> want "A::B::C".
> Anyway, I don't think we need to worry about whether it should be
> "A::B ::C" or "A ::B::C" or whatever.
> GCC treats it as "A::B::C" always:
>
> struct S {};
> namespace A { S f(); }
> S ::A::f(); // error: 'S::A' has not been declared
>
> I think this is correct since the spec says nothing about "'::'
> associativity".
Sure, but this can apparently happen in the 'friend' case and some
others, according to the Elsa doc. I confess to not being an expert
on these matters though.
-Chris
More information about the cfe-dev
mailing list