[cfe-dev] distcc implementation

Wed Feb 17 00:02:04 PST 2010

> 2. What stages of the compilation are worth parallelizing (at
> least for a first step)?

There are benchmarks out that show you how much time the compiler 
spend in which part (preprocessing, parsing, code gen).

You should also spend some time to understand how distcc (or 
ccache) for gcc works. AFAIK it goes this way: the source code 
get's run on the source machine throught the preprocessor. This 
preprocessor reads all the *.h files on the source machine and 
generates one huge file. The benefit: the other machines that 
help compiling the code won't need the same headers installed. 
They just get one file to process. Now they parse it, compile 
and, and produce some *.o file.

Please note that distcc 3 adds a new mode, where you don't 
preprocess the sources. This make the distribution process 
faster, but you need identical system headers on all boxes. But 
it's an optional mode. See http://distcc.org for more info.

And that get's transferred back to the source machine, which can 
the do the linking once all *.o files arrived.

ccache works similar, it just makes a hash over the preprocessed 
code and stores the resulting *.o into a database with this hash 
as key. Or, if a *.o with the same hash exists, it hands that 
*.o file quickly back, short-cutting the 
parsing/code-generation.

> 4. Are there any examples of code(preferably in real-world
> projects) which would lend themselves to parallel compilation
> which come to mind?

Almost all "big" source code bases. If you have a small code base 
with only 4 *.c files, it's hardly worth going via distcc. But 
if you have 1000 *.c files, it makes a difference :-)   E.g. 
compiling LLVM with distcc can greatly speed up the compilation, 
but the same is true for Qt, some KDE-Programs, Mozilla, 
OpenOffice etc.

I use ccache and distcc also when cross-compiling, with the 
OpenEmbedded.org build environment.

> 5. Where should I start? :). Obviously this is a pretty large
> undertaking, but is there any documentation that I should look
> at? Any particular source files that would be relevant?

I'd re-used most of distcc's work, e.g. learn about their 
protocol.

Then I would start with the preprocessed (the non-pump) method. 
Learn where you can intercept the pre-processed stream. That 
should be easy enought, because there's a compiler switch that 
does this.

Now you need to intercept this preprocessed stuff, transport it 
to the remote site, and compile it there. For this you'll need 
to write an llvm-distcc-daemon. You also need to transport the 
*.o back. As the real distcc found a solutions for this, you 
don't need to re-invent the wheel.

The "driver" on the local box could simply block while the remote 
compiles stuff, so someone can run "make -j10" when he has 10 
remote boxes (or 5 remote boxes with dual-cores).

Hey, but the fun of such a project is to make a plan by yourself. 
Otherwise it's dumb coding of other people's ideas :-)

-- 
http://www.holgerschurig.de