[LLVMdev] On LLD performance
davide at freebsd.org
Sun Mar 15 00:36:59 PDT 2015
On Sat, Mar 14, 2015 at 2:52 AM, Nick Kledzik <kledzik at apple.com> wrote:
> Thanks for this write up and the work you are doing to improve performance.
> A couple of thoughts:
> * In the old days, linking was I/O bound so the linker run tended to be the
> same speed as cat’ing all the input files together. Now that spinning disks
> are less common, I/O may no longer be the bottleneck.
> * How does the link time compare to the Windows system linker?
> * Where exactly in the Resolving phase is the bottleneck? Is it the symbol
> table or the other work? (Is much time spent growing (and rehashing) the
> DenseMap? If so, would setting a larger initial bucket size help?)
> * Regarding the Rafeal’s performance comparison with gold (where lld is
> about half the speed). The old binutil’s ld was a cross platform linker
> that supported multiple formats. The goal of gold was to have better
> performance by only supporting ELF. So I would expect gold to be faster
> than the general purpose lld.
After my wild and wrong guess from a couple of days ago, I realized it
was better spending some time analyzing code and profiling, rather
The machine where I tried to link clang using lld is an 8-core Xeon,
but I hardly was able to use more than two cores.
I ran my tests on UFS + 7200rpm disk ( I can provide details, if
needed). The OS is FreeBSD-11 from a month ago or such.
Flamegraph results (now zoomable):
Nick, to answer your question DenseMap ops are a generally an hotspot
for this kind of workload, although grow() operations aren't
particularly a big problem (at least in this case). I see them showing
up only for addReferencetoSymbol() and they account for large part of
the overall samples. In order to convince myself of my theory I
started tinkering and modified the initial size of the map but I
wasn't able to notice any non-negligible improvement.
What instead seems to be a bigger problem is symbolTable::replacement,
DenseMap operations are about 10% of the overall samples. It seems
that in order to find the replacement for a given atom we need to
traverse a chain of atoms until the last one is found. This results in
O(n) calls to find() where n is the length of the chain. I will try to
investigate and see if we can make something differently.
Please let me know if there's something else that shows up from the
graph, and/or you have any idea on how to improve. Sorry for the
output of symbols being a little bit verbose, the collector wasn't
written with C++ in mind.
More information about the llvm-dev