[cfe-dev] decl/expr ambiguity

Sun Aug 24 18:48:42 PDT 2008

> On Sun, Aug 24, 2008 at 6:23 PM, Chris Lattner <clattner at apple.com>
> wrote:
> > Okay, here's another crazy idea.  If you boil it down, my objections
> > to preparsing are basically:
> >
> > 3. [minor] the maintenance cost of the second parser.
> 
> I'm not at all convinced that this is minor. There's a whole lot of
> pre-parsing logic to disambiguate this mess of a declaration:
> 
>   ::std::vector<int> v(istream_iterator<int>(fin), typename
> lambda<istream_iterator<_1>>::template  apply<T>::type());
> 
> I could have had a bit more fun if I used non-type template arguments
> in there, since that requires being able to pre-parse expressions. But
> the point is that the maintenance issue looks more minor than it is
> because Clang isn't parsing the tricky parts of simple-type-specifier
> and elaborated-type-specifier.
> 
> Having tentative parsing makes other parts of the C++ compiler easier
> to implement, especially error recovery where the parser wants to
> check whether tweaking the tokens could produce a valid parse. For
> example, one GCC recovers well from is the common error of writing a
> <: digraph when starting a template argument list, e.g.,
> 
>   namespace N { struct X { }; }
>   std::vector<::N::X> vec;
> 
> There are also places in C++ where we have to determine whether we're
> looking at a type or an expression (e.g., in sizeof(blah)). Will we
> have another pre-parser for these? Tentatively parsing solves this
> problem as well.
> 
> All that said, it might not matter so much whether we pre-parse now or
> not. As those trickier parsing bits for types go into the parser, they
> could probably be abstracted out to be useful for both the
> pre-parser(s?) and the parser, to eliminate unnecessary code/coding
> duplication.

I'm not sure if this has been discussed before (I didn't follow the whole
discussion), so please bear with me if this is repetitive.

A completely different approach would be to build a parser either for a
(non-ambiguous) superset of C++ or building a (GLR like) parser generating
not a single parse tree, but a parse forest covering the ambiguous parts,
and fixing the non-C++ parts/ambiguities in the resulting parse tree(s)
afterwards. As it turns out there are only a handful of quite simple rules
to follow to make sure only valid C++ parse trees remain after this
post-processing. This approach has been used and described in prior art, but
I'm not aware of any comparisons with regard to resource utilization, time
requirements, etc.

I'm aware of the fact that the current approach for writing the C++ parser
is to create a RD parser. For that reason I assume building a GLR parser is
not an option (additionally GLR parser tend to create really huge parse
trees). But the first option is still a possibility: parse a superset of
C++, disambiguating the parse tree during a second (relatively simple) pass.

Here are two papers I know of describing these approaches:

First option: http://www.computing.surrey.ac.uk/research/dsrg/fog/
Second option:
http://www.lrde.epita.fr/dload/20030521-Seminar/vasseur0503_transformers_rep
ort.pdf

Regards Hartmut