[LLVMdev] On LLD performance

Fri Mar 13 11:01:55 PDT 2015

On Fri, Mar 13, 2015 at 10:53 AM, Davide Italiano <davide at freebsd.org>
wrote:

> On Fri, Mar 13, 2015 at 10:15 AM, Rafael Espíndola
> <rafael.espindola at gmail.com> wrote:
> >>> Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why
> >>> it is not included in .got.
> >>
> >> I have a fix for this. Will merge it.
> >
> > Thanks.
> >
> >>> .rodata: 9 449 278 bytes bigger
> >>> .eh_frame: 438 376 bytes bigger
> >>> .comment: 77 797 bytes bigger
> >>> .data.rel.ro: 48 056 bytes bigger
> >>
> >> Did you try --merge-strings with lld ? --gc-sections
> >
> >
> > I got
> >
> > warning: ignoring unknown argument: --gc-sections
> >
> > I will do a run with --merge-strings. This should probably the the
> > default to match other ELF linkers.
> >
>
> Unfortunately, --gc-sections isn't implemented on the GNU driver. I
> tried to enable it but I hit quite a few issues I'm slowly fixing. At
> the time of writing the Resolver reclaims live atoms.
>
>
> >>> The biggest difference that shows up is that lld has 1,152 context
> >>> switches, but the cpu utilization is still < 1. Maybe there is just a
> >>> threading bug somewhere?
> >>
> >> lld apparently is highly multithreaded, but I see your point.  May be
> trying
> >> to do this exercise on /dev/shm can show more cpu utilization ?
> >
> > Yes, the number just under 1 cpu utilized is very suspicious. As Rui
> > points out, there is probably some issue in the threading
> > implementation on linux. One interesting experiment would be timing
> > gold and lld linking ELF on windows (but I have only a windows VM and
> > no idea what the "perf" equivalent is on windows.
> >
> > I forgot to mention, the tests were run on tmpfs already.
> >
>
> I think we can make an effort to reduce the number of context
> switches. In particular, we might try to switch to a model where task
> is the basic unit of computation and a thread pool of worker(s)
> responsible for executing these tasks.
> This way we can tune the number of threads fighting at the same time
> for the CPU, maybe with a reasonable default, that can be overriden by
> the user using cmdline options.
>

We do split tasks that way. Please take a look at
include/lld/Core/Parallel.h. ThreadExecutor is a class to execute tasks,
which you can submit by calling add() method. Tasks are any callable
objects. The number of threads we spawn for each ThreadExecutor is the same
as std::thread::hardware_concurrency(), and we only instantiate one
ThreadExecutor. They shouldn't compete against each other for processor
time slots (unless there's a bug).

That said, as long as this would require some substantial changes I
> wouldn't go for that path until we have some strong evidence that the
> change is gonna improve the performances significantly. I feel like
> that while context switches may have some impact on the final numbers,
> they hardly will account for large part of the performance loss.
>
> Another thing that come to my mind is that the number of context
> switches being relatively high might be the effect of lock contention.
> If somebody has access to a VTune license and can run 'lock analysis'
> on it that would be greatly appreciated. I don't have a Linux
> laptop/setup but I'll try to collect some numbers on FreeBSD and
> investigate further over the weekend.
>
> Thanks,
>
>
> --
> Davide
>
> "There are no solved problems; there are only problems that are more
> or less solved" -- Henri Poincare
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150313/b5dad7ff/attachment.html>