[cfe-dev] clang leveraging Elsa?

Sun Oct 7 21:07:24 PDT 2007

Hi,

Daniel Wilkerson pointed me at this thread, and thought I should
respond to answer some questions people have about Elsa.  But before I
do so, I want to be clear that I'm not in favor of nor against clang
using Elsa.  I'm just hoping to help clear any misconceptions so the
best choice can be made.

 >What I meant is that Elsa (as I understand it) has enough semantic
 >analysis to parse, but is not enforcing all of the constraints
 >required by the language.

This is true, and even stated as such on the Elsa page, but is quite
misleading (for which I am partly to blame).  "enough semantic
analysis to parse" is almost the whole language, because virtually
every language feature must be handled properly to get an accurate
parse: one must disintguish names of types from names of variables,
and since names can be members of template classes, and the right
specialization must be chosen, and specialization selection engages
almost every other language feature directly or indirectly, there is
no stopping short.  Believe me, I tried.  :)

The only feature that Elsa is completely lacking, because it genuinely
is not required for a correct parse (thank goodness), is access
control (public, private, etc.).  The only hard part about that is
figuring out what the heck 'friend' means (but that *is* hard).

 >For example, it does not (AFAIK) correctly enforce things like
 >integer constant expressions, constraints on VLAs etc.

It is true there are a lot of language restrictions that are not
currently enforced.  However, access control aside, I claim that each
such restriction could be enforced by writing some "leaf" code to do
it; all the information is there, the rule simply needs to be coded.

 >Specifically, lexing, parsing, and AST building are not cleanly
 >separable as they are in clang.

I would argue that Elsa separates these phases as well as can be done
for C++; separation was among the early goals of the project.  Unlike
every other C/C++ implementation I am aware of (I have not looked at
clang), there is no "lexer hack" feedback loop from the symbol table
to the lexer.  So lexing and parsing are each totally independent
phases.  However, the output of the parser is a forest, not a tree;
the price of separation is that much disambiguation has to be
postponed until type checking.

 >In clang, it is possible to implement an action module what uses the
 >parser but doesn't necessarily build an AST at all.

Unfortunately, that is simply not possible when parsing C++.  You
*have* to do template instantiation, and that requires building an
AST.  :(

 >See above: accepting a superset of a language is much harder than
 >accepting the language properly.

I think you meant to say that accepting a superset is much easier?

I probably would have agreed with that statement (the "much" part)
when I first began writing Elsa, but I now do not.

 >built elsa, so I can't play with it, but my guess is that the
 >diagnostics produced by elsa are not very good.

Yes and no.  :)  When it comes to syntax for which there is no
possible syntactic parse (meaning the parser itself is rejecting the
input), the diagnostics are awful.  Elkhound (the parser generator
underneath Elsa) has no error recovery and only very primitive
diagnostics ("unexpected token <blah> at <blah>, bailing").  I have a
longstanding TODO to add good diagnosis and recovery (I was planning
to use the Burke+Fisher algorithm) to Elkhound.  I would estimate it
would only take a couple weeks; adapting the algorithm to GLR is
nontrivial but should be possible.

However, once you get into the type checker, the diagnostics are
fairly good IMO.  I try to print out both what Elsa thinks is wrong
and the names, types, etc. of relevant entities in coherent error
messages.  Good diagnostics were also among the early goals.  Now,
that's not to say there is no room for improvement, but I haven't seen
much better (again, IMO).

 >> are the only crucial part that needs work. It fails on anything
 >> beyond simple template instantiation(lucky for me it's barely
 >> enough for Mozilla).

I think this characterization is a bit unfair, though I do understand
the frustration users feel when they hit Elsa template bugs, and
certainly acknowledge there are still plenty in there.  I knock them
off from time to time.

If I had my druthers, I'd rewrite the entire template mechanism (for
the fourth time).  But what's there gets the job done.  Just how much
of the job it does is hard to say, but I'm currently finding that
fixing bugs usually involves small-ish tweaks.

 >> Elsa was also designed with a performance in mind (but I agree that
 >> the authors could've done much better). An annoying part of elsa is
 >> hand-rolled data structures(one named string!), so I've been
 >> considering doing some sort of refactoring to get rid or change
 >> some of the obscure data structures.

It's true I avoided using STL, for a variety of reasons.  However, I
don't think that choice affected performance one way or another.

 >Reliance on GLR parsing makes Elsa fundamentally slower than a parser
 >that does well constrained look ahead and backtracking.  I am
 >actually a fan of GLR parsing, and I think that Elsa's implementation
 >is a pretty good one.  However, GLR parsing requires *building
 >speculative parse trees* and then eliminating the speculation later.
 >In my experience with clang, I've found that anything which does
 >memory allocation or touches the heap is orders of magnitude slower
 >than something that can avoid it.

First, I don't see how one could separate lexing+parsing from type
checking of C++ w/o using GLR.  So you're kind of stuck if you want
both "as fast as possible" but also "clean separation".

Now, as for whether GLR is slow: with some work, I was able to make
the GLR implementation used in Elsa perform within a few percent of a
bison-generated parser, for a deterministic C grammar.  This is due
(in part) to a technique that allows the parser to switch between LR
and GLR dynamically depending on the grammar+input:

   http://www.cs.berkeley.edu/~smcpeak/papers/elkhound_cc04.ps

For the (nondeterministic) C++ grammar, I measured Elsa's performance
as being around 30% slower than gcc-2 and around 2x faster than gcc-3
(that was against gcc-3.2.x or so; gcc-3.4+ has a totally new parser,
and I haven't compared to it).  I do not know if you consider gcc
"fast enough".  (See the above citation for details.)

 >I don't see how GLR parsing can be done without a liberal amount of
 >heap traffic, but maybe I'm missing something.

The GLR algorithm itself uses a graph-structured stack, but the
Elkhound implementation uses a node pool that is nearly as fast as an
ordinary LR stack like bison's.  Heap traffic can be fast if the
locality is good and the general-purpose allocator is avoided.

As for building AST, it's not merely speculative; much of the extra
AST actually survives parsing in the form of ambiguous alternatives.
But my measurements show that there are only 30-50% more nodes in the
ambiguous AST than in the unambiguous AST.  So while that's not a
trivial cost, you're going to have to work pretty hard to get more
than a factor of two by eliminating the extra AST building.

At the end of the day, there are basically three approaches to parsing
C++ that I am aware of:

* Lexer hack plus LR parsing.  This is what GCC did until 3.4.  It is
very painful to make changes b/c LALR(1) is so unpredictable.

* Lexer hack plus hand-crafted parsing.  This is what GCC-3.4+ does,
and what EDG has done all along.  It ends up being better (IMO) than
the LR approach b/c the unpredictability goes away, but it's still a
lot of work to maintain since the parser does a lot of explicit token
manipulation.  It's also difficult to separate the parser from its
AST, which is unfortunate from the perspective of possibly reusing the
GCC parser.

* GLR and post-parse disambiguation.  So far Elsa is the only example.
I think it's a good point in the design space, but it does trade some
speed (I think at most a factor of three) in order to gain
maintainability.

-Scott