[LLVMdev] On LLD performance

Fri Mar 13 10:53:14 PDT 2015

On Fri, Mar 13, 2015 at 10:15 AM, Rafael Espíndola
<rafael.espindola at gmail.com> wrote:
>>> Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why
>>> it is not included in .got.
>>
>> I have a fix for this. Will merge it.
>
> Thanks.
>
>>> .rodata: 9 449 278 bytes bigger
>>> .eh_frame: 438 376 bytes bigger
>>> .comment: 77 797 bytes bigger
>>> .data.rel.ro: 48 056 bytes bigger
>>
>> Did you try --merge-strings with lld ? --gc-sections
>
>
> I got
>
> warning: ignoring unknown argument: --gc-sections
>
> I will do a run with --merge-strings. This should probably the the
> default to match other ELF linkers.
>

Unfortunately, --gc-sections isn't implemented on the GNU driver. I
tried to enable it but I hit quite a few issues I'm slowly fixing. At
the time of writing the Resolver reclaims live atoms.

>>> The biggest difference that shows up is that lld has 1,152 context
>>> switches, but the cpu utilization is still < 1. Maybe there is just a
>>> threading bug somewhere?
>>
>> lld apparently is highly multithreaded, but I see your point.  May be trying
>> to do this exercise on /dev/shm can show more cpu utilization ?
>
> Yes, the number just under 1 cpu utilized is very suspicious. As Rui
> points out, there is probably some issue in the threading
> implementation on linux. One interesting experiment would be timing
> gold and lld linking ELF on windows (but I have only a windows VM and
> no idea what the "perf" equivalent is on windows.
>
> I forgot to mention, the tests were run on tmpfs already.
>

I think we can make an effort to reduce the number of context
switches. In particular, we might try to switch to a model where task
is the basic unit of computation and a thread pool of worker(s)
responsible for executing these tasks.
This way we can tune the number of threads fighting at the same time
for the CPU, maybe with a reasonable default, that can be overriden by
the user using cmdline options.
That said, as long as this would require some substantial changes I
wouldn't go for that path until we have some strong evidence that the
change is gonna improve the performances significantly. I feel like
that while context switches may have some impact on the final numbers,
they hardly will account for large part of the performance loss.

Another thing that come to my mind is that the number of context
switches being relatively high might be the effect of lock contention.
If somebody has access to a VTune license and can run 'lock analysis'
on it that would be greatly appreciated. I don't have a Linux
laptop/setup but I'll try to collect some numbers on FreeBSD and
investigate further over the weekend.

Thanks,

-- 
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare