[cfe-dev] token caching ("PCH")

Tue Oct 21 10:43:46 PDT 2008

On Oct 21, 2008, at 9:24 AM, Cédric Venet wrote:

>> Added the start of a prototype implementation of PCH based on token  
>> caching.
>
> I only use precompiled header with visual studio. VS requirement for  
> precompiled headers is that they must be included first and always  
> in the same order (so I usually use one stdafx.h file to group them  
> all). This seems like a minimal requirement for PCH and your  
> solution need to follow it too (since the token are cached after  
> been preprocessed isn't it?). Then, why not simply serialize the AST  
> at the end of the precompiled headers and reload it when needed? it  
> would allow to save parse and sema time instead of just file  
> reading, lex and preprocessing (Do you have some figure on what take  
> most of the time? I would have said sema, at least for C++, but  
> perhaps readings the files is the more time consuming task).
> Perhaps the reponse is simply that AST serialization is still  
> broken, then ignore me :)

Hi Cedric,

Redirecting to cfe-dev.

These are fair questions.  The short answer is that this is a stepping  
stone towards other forms of PCH support in Clang.

Here is the long answer.

In the long term the leading idea on the table is to implement PCH via  
lazy deserialization of ASTs.  This means you only deserialize the  
pieces of the header that you actually need.  Beyond requiring  
serialization support to be fully implemented, there are many details  
to get right.  It's a complicated solution that will take some time to  
implement.  In the meantime, we would like to have other forms of  
"PCH" available today (which also may be compatible with other PCH  
solutions we implement).

The approach we are investigating right now is "token caching".  The  
idea is to cache un-preprocessed (raw) tokens included from all the  
files sucked in from the prefix header.  The cached tokens can then be  
used in a manner similar to how precompiled headers are used today:  
instead of lexing from the headers, we simply read in the tokens.  The  
idea is to drastically reduce lexing time.

Here is the motivation for this approach.

We did some extensive timings of Clang.  Clang is 2x faster than GCC  
without PCH, and GCC with PCH is (on average) 2x faster than Clang.   
This is on Mac OS X where we have some really huge headers (including  
Cocoa.h is about 11MB of source code).

The majority of the work (50-60%) at -fsyntax-only is spent lexing.   
Thus simply speeding up lexing significantly will have a substantial  
performance win.  Moreover, a significant amount of time is spent just  
stat'ing header files, opening them, reading them.  Placing the cached  
tokens from a prefix header can remove a significant amount of time  
from lexing as well as remove a significant amount of overhead from  
system calls.  Further, caching the lexed tokens is a simple solution  
that is trivial to implement, which means we will have something that  
works in the short term while we investigate other (complementary)  
solutions.

As to your point about simply deserializing all the ASTs in a header,  
we (Daniel, Steve, Chris, and I) believe that would give roughly the  
same performance as the token caching approach.  There are three  
arguments for this.

First, the time spent in Sema itself is largely spent allocating  
objects, and performing a deserialization of all the ASTs will recover  
only a fraction of time from Sema as we need to allocate the same  
number of objects.  While it would also recover parse time, only 10%  
of -fsyntax-only is spent doing parsing.  Also, object deserialization  
isn't free (there is computation involved in reading VBR encoded  
integers, parsing bitcode records, etc.), so any time recovered from  
removing parsing would probably be replaced by work doing  
deserialization.

Second, deserializing all of the ASTs in the prefix header would also  
be far more complicated than caching tokens; not only would  
serialization need to be feature complete (which we plan on doing  
anyway) but more aspects of Clang, e.g. Sema, would need to be aware  
of our PCH solution.  The token caching approach means that only the  
Preprocessor needs to be aware that it is getting its tokens from  
another source; Parse and Sema remain unaffected.

Third, token caching is also something we can probably implement in a  
week or so, so it is something we can quickly evaluate to see if it  
buys us anything.  Further, token caching has some other incidental  
benefits; for example, we will be able to get better measurements of  
how much time we spend in preprocessing.

As I mentioned earlier, the ultimate solution (as currently  
conceptualized) for PCH in Clang is to indeed use AST serialization,  
but only to deserialize the pieces of the header that are needed.   
This requires the ability to do lazy deserialization of ASTs in Sema,  
something which our current infrastructure just doesn't really support  
right now.  Moreover, C++ will bring a whole host of other  
complexities when considering fast PCH support, so there is some  
benefit in waiting until C++ support progresses a little further so we  
can a better idea of the right design tradeoffs for PCH support for C++.

Ted