[LLVMdev] On LLD performance

Sun Mar 15 00:36:59 PDT 2015

On Sat, Mar 14, 2015 at 2:52 AM, Nick Kledzik <kledzik at apple.com> wrote:
> Rui,
>
> Thanks for this write up and the work you are doing to improve performance.
> A  couple of thoughts:
>
> * In the old days, linking was I/O bound so the linker run tended to be the
> same speed as cat’ing all the input files together.  Now that spinning disks
> are less common, I/O may no longer be the bottleneck.
>
> * How does the link time compare to the Windows system linker?
>
> * Where exactly in the Resolving phase is the bottleneck?  Is it the symbol
> table or the other work?  (Is much time spent growing (and rehashing) the
> DenseMap?  If so, would setting a larger initial bucket size help?)
>
> * Regarding the Rafeal’s performance comparison with gold (where lld is
> about half the speed).  The old binutil’s ld was a cross platform linker
> that supported multiple formats.  The goal of gold was to have better
> performance by only supporting ELF.  So I would expect gold to be faster
> than the general purpose lld.
>
>
> -Nick
>

After my wild and wrong guess from a couple of days ago, I realized it
was better spending some time analyzing code and profiling, rather
than speculating.
The machine where I tried to link clang using lld is an 8-core Xeon,
but I hardly was able to use more than two cores.
I ran my tests on UFS + 7200rpm disk ( I can provide details, if
needed). The OS is FreeBSD-11 from a month ago or such.

Flamegraph results (now zoomable):
https://people.freebsd.org/~davide/llvm/lld.svg

Nick, to answer your question DenseMap ops are a generally an hotspot
for this kind of workload, although grow() operations aren't
particularly a big problem (at least in this case). I see them showing
up only for addReferencetoSymbol() and they account for large part of
the overall samples. In order to convince myself of my theory I
started tinkering and modified the initial size of the map but I
wasn't able to notice any non-negligible improvement.

What instead seems to be a bigger problem is symbolTable::replacement,
DenseMap operations are about 10% of the overall samples. It seems
that in order to find the replacement for a given atom we need to
traverse a chain of atoms until the last one is found. This results in
O(n) calls to find() where n is the length of the chain. I will try to
investigate and see if we can make something differently.

Please let me know if there's something else that shows up from the
graph, and/or you have any idea on how to improve. Sorry for the
output of symbols being a little bit verbose, the collector wasn't
written with C++ in mind.

--
Davide