[cfe-dev] clang leveraging Elsa?

Sun Oct 7 21:50:38 PDT 2007

On Oct 7, 2007, at 9:07 PM, Scott McPeak wrote:
> Daniel Wilkerson pointed me at this thread, and thought I should
> respond to answer some questions people have about Elsa.  But before I
> do so, I want to be clear that I'm not in favor of nor against clang
> using Elsa.  I'm just hoping to help clear any misconceptions so the
> best choice can be made.

Welcome Scott!

>
>> What I meant is that Elsa (as I understand it) has enough semantic
>> analysis to parse, but is not enforcing all of the constraints
>> required by the language.
>
> This is true, and even stated as such on the Elsa page, but is quite
> misleading (for which I am partly to blame).  "enough semantic
> analysis to parse" is almost the whole language, because virtually
> every language feature must be handled properly to get an accurate
> parse: one must disintguish names of types from names of variables,
> and since names can be members of template classes, and the right
> specialization must be chosen, and specialization selection engages
> almost every other language feature directly or indirectly, there is
> no stopping short.  Believe me, I tried.  :)

Yep.  C++ is extremely annoying like that.  Handling knowing whether  
simple things like a<t>::y are a type or variable does require full  
template info and details like SFINAE make it hard to get everything  
right.  I was not trying to understate the effort and value of Elsa,  
I was just trying to explain how differing goals drive the projects  
priorities in different ways.
> The only feature that Elsa is completely lacking, because it genuinely
> is not required for a correct parse (thank goodness), is access
> control (public, private, etc.).  The only hard part about that is
> figuring out what the heck 'friend' means (but that *is* hard).

Sure, relative to everything else, it's not a big deal.

>> For example, it does not (AFAIK) correctly enforce things like
>> integer constant expressions, constraints on VLAs etc.
>
> It is true there are a lot of language restrictions that are not
> currently enforced.  However, access control aside, I claim that each
> such restriction could be enforced by writing some "leaf" code to do
> it; all the information is there, the rule simply needs to be coded.

Yep, anything can be added :)

>> Specifically, lexing, parsing, and AST building are not cleanly
>> separable as they are in clang.
>
> I would argue that Elsa separates these phases as well as can be done
> for C++; separation was among the early goals of the project.  Unlike
> every other C/C++ implementation I am aware of (I have not looked at
> clang), there is no "lexer hack" feedback loop from the symbol table
> to the lexer.  So lexing and parsing are each totally independent
> phases.  However, the output of the parser is a forest, not a tree;
> the price of separation is that much disambiguation has to be
> postponed until type checking.

This is not exactly what I meant, allow me to clarify.  :)  Elsa has  
a nice separation because it allows deferring certain decisions until  
the (ambiguous) parse tree is pruned down by later semantic  
analysis.  This means that the parser doesn't have to have any idea  
whether something is an variable, type, template etc in order to get  
the correct answer.

In clang, we use a form of the lexer hack.  The lexer itself is not  
actually "hacked", it just returns "identifier", never "type" or  
"variable".  Instead of having the lexer make the determination  
(which would require it to know about scope and many other things in C 
++), the *parser* actually makes the determination.  This design  
allows the lexer/preprocessor to be completely independent of the  
parser data structures, as in elsa.

Unlike Elsa, clang does do the "lexer hack" in the parser.  This  
means that the C parser (for example) needs to determine whether an  
identifier returned by the lexer is a typename or a variable, before  
it can correctly parse past the token.  In the case of C++, much more  
would be needed of course.

The major difference between clang and other AST-building "parsers"  
is that clang's parser does not build an AST at all.  The simplest  
way to think of it is that each production in the grammar invokes a  
virtual method on an "actions" module.  This actions module can then  
build an AST if it wants, or do other light-weight tracking.  In C,  
for example, the 'minimal actions' just keeps track of scoped typedef/ 
variable names: it doesn't build an AST at all otherwise.  It is  
obviously true that a C++ version of this will have to keep track of  
much more for templates.

When I said: "lexing, parsing, and AST building are not cleanly  
separable as they are in clang", I meant this separation between the  
parsing and AST building primarily.  The specific win of this is that  
they are in different libraries, which cleanly decouples them (AST  
building only depends on the parser, there is no circular dependence)  
and makes both of them simpler/more clear.

>> See above: accepting a superset of a language is much harder than
>> accepting the language properly.
>
> I think you meant to say that accepting a superset is much easier?

Yep.

> I probably would have agreed with that statement (the "much" part)
> when I first began writing Elsa, but I now do not.

Fair enough.  While I think I understand what needs to be done, you  
have actually done it :)

>> Reliance on GLR parsing makes Elsa fundamentally slower than a parser
>> that does well constrained look ahead and backtracking.  I am
>> actually a fan of GLR parsing, and I think that Elsa's implementation
>> is a pretty good one.  However, GLR parsing requires *building
>> speculative parse trees* and then eliminating the speculation later.
>> In my experience with clang, I've found that anything which does
>> memory allocation or touches the heap is orders of magnitude slower
>> than something that can avoid it.
>
> First, I don't see how one could separate lexing+parsing from type
> checking of C++ w/o using GLR.  So you're kind of stuck if you want
> both "as fast as possible" but also "clean separation".

See above, hopefully that clarifies what I mean by 'clean  
separation'.  The code is separate, in different libraries, but at  
runtime the execution is intertwined.

> Now, as for whether GLR is slow: with some work, I was able to make
> the GLR implementation used in Elsa perform within a few percent of a
> bison-generated parser, for a deterministic C grammar.  This is due
> (in part) to a technique that allows the parser to switch between LR
> and GLR dynamically depending on the grammar+input:
>
>    http://www.cs.berkeley.edu/~smcpeak/papers/elkhound_cc04.ps

Yep, of course, this is basically saying that "GLR is fast when you  
don't use its power" :)

You could clearly refactor the C++ grammar to be completely  
deterministic and avoid all ambiguous node allocations, but then  
there is little reason to use GLR.

> For the (nondeterministic) C++ grammar, I measured Elsa's performance
> as being around 30% slower than gcc-2 and around 2x faster than gcc-3
> (that was against gcc-3.2.x or so; gcc-3.4+ has a totally new parser,
> and I haven't compared to it).  I do not know if you consider gcc
> "fast enough".  (See the above citation for details.)

No, I don't.  I consider GCC to be ridiculously slow, particularly  
when parsing C++.  When parsing C, clang is already 2.5x faster than  
GCC at -fsyntax-only, and about 4.6x faster if you just count the  
parser (not the preprocessor/lexer) (http://clang.llvm.org/ 
clang_video-07-25-2007.html).  I expect clang to be an even higher  
multiple faster than GCC on C++ code, when we finally support it.   
This basically reflects the fact that the GCC C++ Front-end has a  
huge array of known performance problems and N^2 algorithms in it.

>> I don't see how GLR parsing can be done without a liberal amount of
>> heap traffic, but maybe I'm missing something.
>
> The GLR algorithm itself uses a graph-structured stack, but the
> Elkhound implementation uses a node pool that is nearly as fast as an
> ordinary LR stack like bison's.  Heap traffic can be fast if the
> locality is good and the general-purpose allocator is avoided.

Right.  The critical thing here is that Elsa's nondeterministic  
parser trades parse speed and memory (both due to heap allocations)  
for a more elegant implementation where semantic analysis is  
completely decoupled from parsing.  Unfortunately, for the goals I  
have for clang, this tradeoff isn't acceptable.  :(

> At the end of the day, there are basically three approaches to parsing
> C++ that I am aware of:
>
> * Lexer hack plus hand-crafted parsing.  This is what GCC-3.4+ does,
> and what EDG has done all along.  It ends up being better (IMO) than
> the LR approach b/c the unpredictability goes away, but it's still a
> lot of work to maintain since the parser does a lot of explicit token
> manipulation.  It's also difficult to separate the parser from its
> AST, which is unfortunate from the perspective of possibly reusing the
> GCC parser.

Right, this is the approach that I plan :)

-Chris