[LLVMdev] On LLD performance
Nick Kledzik
kledzik at apple.com
Fri Mar 13 18:52:17 PDT 2015
Rui,
Thanks for this write up and the work you are doing to improve performance. A couple of thoughts:
* In the old days, linking was I/O bound so the linker run tended to be the same speed as cat’ing all the input files together. Now that spinning disks are less common, I/O may no longer be the bottleneck.
* How does the link time compare to the Windows system linker?
* Where exactly in the Resolving phase is the bottleneck? Is it the symbol table or the other work? (Is much time spent growing (and rehashing) the DenseMap? If so, would setting a larger initial bucket size help?)
* Regarding the Rafeal’s performance comparison with gold (where lld is about half the speed). The old binutil’s ld was a cross platform linker that supported multiple formats. The goal of gold was to have better performance by only supporting ELF. So I would expect gold to be faster than the general purpose lld.
-Nick
On Mar 11, 2015, at 3:02 PM, Rui Ueyama <ruiu at google.com> wrote:
> I spent a week to optimize LLD performance and just wanted to share things what I found. Also if there's anyone who have a good idea on how to make it faster, I'd like to hear.
>
> My focus is mainly on Windows, but my optimizations are generally platform neutral. I aim both single-thread and multi-thread performance.
>
> r231434 is a change that has the largest impact. It greatly reduced linking time to link large binaries, as it changed the order of number of symbols we need to process for files within --start-group/--end-group. If you have many library files in a group, you would see a significant performance gain. It's probably not often the case on Unix. On the other hand, on Windows (and IIRC on Darwin), we move all library files to end of input file list and group them together, so it's effective. It improves single-thread performance.
>
> r231454 is to apply relocations in parallel in the writer using parallel_for. Because number of relocations are usually pretty large, and each application is independent, you basically get linear performance gain by using threads. Previously it took about 2 seconds to apply 10M relocations (the number in the commit message is wrong) on my machine, but it's now 300 milliseconds, for example. This technique should be applicable to other ports.
>
> r231585 changes the algorithm to create base relocations so that we can use parallel_sort. Unfortunately, base relocations are Windows specific, I don't think this technique is applicable to other ports.
>
> r231584 and r231432 are effective but minor improvements.
>
> At this point, the resolver is bottleneck. Readers introduce surprisingly small delay when linking large binaries, probably thanks to parallel file loading and archive member preloading. Or just that file parsing is an easy task. Preloading hit rate is >99%, so when you need a symbol from an archive file, its member is almost always parsed and ready to be used. ArchiveFile::find() returns immediately with a result. Writers and other post-resolver passes seem reasonably fast. The dominant factor is the resolver.
>
> What the resolver does is, roughly speaking, reading files until it resolves all symbols, and put all symbols received from files to a hash table. That's a kind of tight loop. In r231549 I cut number of hash table lookup, but looks like it's hard to optimize that more than this.
>
> An idea to make the resolver faster would be to use a concurrent hash map to insert new symbols in parallel. Assuming symbols from the same file don't conflict each other (I think it's a valid assumption), this can be parallelized. I wanted single-thread performance gain, though. (Also, concurrent hash maps are not currently available in LLVM.)
>
> Another idea is to eliminate a preprocess to create reverse edges of the atom graph. Some edges in the graph need to be treated as bi-directional, so that all connected edges become live or dead as a group regardless of direction of edges (so that depending symbols are not reclaimed by garbage collector). We probably should always add two edges for each bi-directional edge in the first place to eliminate the need of preprocessing entirely. It's definitely doable.
>
> An interesting idea that unfortunately didn't work is interning symbol names. I thought that computing a hash value of a symbol name is bottleneck in hash table, as C++ symbols can be very long. So I wrote a thread-safe string pool to intern (unique-fy) symbol strings, so that we can compare symbol equivalence by pointer comparison. String interning was done in the reader, which is parallelized, and I expected it would improve single-thread performance of the resolver. But I didn't observe meaningful difference in terms of performance.
>
> Bottlenecks were not there where I expected, as always (I learned about that again and again, but I was surprised every time). Maybe we need to revisit "optimized" code in LLD to see if it's not really premature optimization. If it is, we should rewrite with simple code.
>
> I'm currently trying two ideas above. This mail is just FYI, but if you have any recommendations or ideas or whatever, hit "reply".
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150313/08d483a6/attachment.html>
More information about the llvm-dev
mailing list