[LLVMdev] On LLD performance

Wed Mar 11 15:02:13 PDT 2015

I spent a week to optimize LLD performance and just wanted to share things
what I found. Also if there's anyone who have a good idea on how to make it
faster, I'd like to hear.

My focus is mainly on Windows, but my optimizations are generally platform
neutral. I aim both single-thread and multi-thread performance.

r231434 <http://reviews.llvm.org/rL231454> is a change that has the largest
impact. It greatly reduced linking time to link large binaries, as it
changed the order of number of symbols we need to process for files within
--start-group/--end-group. If you have many library files in a group, you
would see a significant performance gain. It's probably not often the case
on Unix. On the other hand, on Windows (and IIRC on Darwin), we move all
library files to end of input file list and group them together, so it's
effective. It improves single-thread performance.

r231454 <http://reviews.llvm.org/rL231454> is to apply relocations in
parallel in the writer using parallel_for. Because number of relocations
are usually pretty large, and each application is independent,
you basically get linear performance gain by using threads. Previously it
took about 2 seconds to apply 10M relocations (the number in the commit
message is wrong) on my machine, but it's now 300 milliseconds, for
example. This technique should be applicable to other ports.

r231585 <http://reviews.llvm.org/rL231585> changes the algorithm to create
base relocations so that we can use parallel_sort. Unfortunately, base
relocations are Windows specific, I don't think this technique is
applicable to other ports.

r231584 <http://reviews.llvm.org/rL231584> and r231432
<http://reviews.llvm.org/rL231432> are effective but minor improvements.

At this point, the resolver is bottleneck. Readers introduce surprisingly
small delay when linking large binaries, probably thanks to parallel file
loading and archive member preloading. Or just that file parsing is an easy
task. Preloading hit rate is >99%, so when you need a symbol from an
archive file, its member is almost always parsed and ready to be used.
ArchiveFile::find() returns immediately with a result. Writers and other
post-resolver passes seem reasonably fast. The dominant factor is the
resolver.

What the resolver does is, roughly speaking, reading files until it
resolves all symbols, and put all symbols received from files to a hash
table. That's a kind of tight loop. In r231549
<http://reviews.llvm.org/rL231549> I cut number of hash table lookup, but
looks like it's hard to optimize that more than this.

An idea to make the resolver faster would be to use a concurrent hash map
to insert new symbols in parallel. Assuming symbols from the same file
don't conflict each other (I think it's a valid assumption), this can be
parallelized. I wanted single-thread performance gain, though. (Also,
concurrent hash maps are not currently available in LLVM.)

Another idea is to eliminate a preprocess to create reverse edges of the
atom graph. Some edges in the graph need to be treated as bi-directional,
so that all connected edges become live or dead as a group regardless of
direction of edges (so that depending symbols are not reclaimed by garbage
collector). We probably should always add two edges for each bi-directional
edge in the first place to eliminate the need of preprocessing entirely.
It's definitely doable.

An interesting idea that unfortunately didn't work is interning symbol
names. I thought that computing a hash value of a symbol name is bottleneck
in hash table, as C++ symbols can be very long. So I wrote a thread-safe
string pool to intern (unique-fy) symbol strings, so that we can compare
symbol equivalence by pointer comparison. String interning was done in the
reader, which is parallelized, and I expected it would improve
single-thread performance of the resolver. But I didn't observe meaningful
difference in terms of performance.

Bottlenecks were not there where I expected, as always (I learned about
that again and again, but I was surprised every time). Maybe we need to
revisit "optimized" code in LLD to see if it's not really premature
optimization. If it is, we should rewrite with simple code.

I'm currently trying two ideas above. This mail is just FYI, but if you
have any recommendations or ideas or whatever, hit "reply".
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150311/d4850540/attachment.html>