[llvm-dev] LLD: time to enable --threads by default

Sean Silva via llvm-dev llvm-dev at lists.llvm.org
Wed Nov 23 16:53:14 PST 2016


On Wed, Nov 23, 2016 at 6:31 AM, Rafael EspĂ­ndola <
rafael.espindola at gmail.com> wrote:

> Interesting. Might be worth giving a try again to the idea of creating
> the file in anonymous memory and using a write to output it.
>

I'm not sure that will help. Even the kernel can't escape some of these
costs; in modern 64-bit operating systems when you do a syscall you don't
actually change the mappings (TLB flush would be expensive), so the cost of
populating the page tables in order to read the pages is still there (and
hence the serialization point remains). One alternative is to use multiple
processes instead of multiple threads which would remove serialization
point by definition (it also seems like it might be less invasive of a
change, at least for the copying+relocating step).

One experiment might be to add a hack to pre-fault all the files that are
used, so that you can isolate that cost from the rest of the link. That
will give you an upper bound on the speedup that there is to get from
optimizing this.
Pre-faulting the allocations removes the serialization bottleneck on the
kernel VA, since after the page tables are fully populated, they become a
read-only data structure and each core's hardware TLB walker can read it
independently.

For example, you could change elf::link to optionally take a map from file
paths to buffers, which will override the native filesystem. Then in main()
(before calling elf::link) you can map and pre-fault all the input files
(which can be found from a trace of a previous run or whatever). By timing
the subsequent call to elf::link you can get the desired measurement.

The ability to pass in a map of buffers would allow other experiments that
would be interesting. For example, the experiment above could be repeated
with all the input buffers copied into a handful of 1GB pages. This would
allow entirely eliminating the hardware TLB walking overhead for input
buffers.

-- Sean Silva


>
> Cheers,
> Rafael
>
> On 23 November 2016 at 02:41, Sean Silva via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> >
> >
> > On Wed, Nov 16, 2016 at 12:44 PM, Rui Ueyama via llvm-dev
> > <llvm-dev at lists.llvm.org> wrote:
> >>
> >> LLD supports multi-threading, and it seems to be working well as you can
> >> see in a recent result. In short, LLD runs 30% faster with --threads
> option
> >> and more than 50% faster if you are using --build-id (your mileage may
> vary
> >> depending on your computer). However, I don't think most users even
> don't
> >> know about that because --threads is not a default option.
> >>
> >> I'm thinking to enable --threads by default. We now have real users, and
> >> they'll be happy about the performance boost.
> >>
> >> Any concerns?
> >>
> >> I can't think of problems with that, but I want to write a few notes
> about
> >> that:
> >>
> >>  - We still need to focus on single-thread performance rather than
> >> multi-threaded one because it is hard to make a slow program faster
> just by
> >> using more threads.
> >>
> >>  - We shouldn't do "too clever" things with threads. Currently, we are
> >> using multi-threads only at two places where they are highly
> parallelizable
> >> by nature (namely, copying and applying relocations for each input
> section,
> >> and computing build-id hash). We are using parallel_for_each, and that
> is
> >> very simple and easy to understand. I believe this was a right design
> >> choice, and I don't think we want to have something like
> workqueues/tasks in
> >> GNU gold, for example.
> >
> >
> > Sorry for the late response.
> >
> > Copying and applying relocations is actually are not as parallelizable as
> > you would imagine in current LLD. The reason is that there is an implicit
> > serialization when mutating the kernel's VA map (which happens any time
> > there is a minor page fault, i.e. the first time you touch a page of an
> > mmap'd input). Since threads share the same VA, there is an implicit
> > serialization across them. Separate processes are needed to avoid this
> > overhead (note that the separate processes would still have the same
> output
> > file mapped; so (at least with fixed partitioning) there is no need for
> > complex IPC).
> >
> > For `ld.lld -O0` on Mac host, I measured <1GB/s copying speed, even
> though
> > the machine I was running on had like 50 GB/s DRAM bandwidth; so the VA
> > overhead is on the order of a 50x slowdown for this copying operation in
> > this extreme case, so Amdahl's law indicates that there will be
> practically
> > no speedup for this copy operation by adding multiple threads. I've also
> > DTrace'd this to see massive contention on the VA lock. LInux will be
> better
> > but no matter how good, it is still a serialization point and Amdahl's
> law
> > will limit your speedup significantly.
> >
> > -- Sean Silva
> >
> >>
> >>
> >>  - Run benchmarks with --no-threads if you are not focusing on
> >> multi-thread performance.
> >>
> >>
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev at lists.llvm.org
> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>
> >
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161123/850773eb/attachment.html>


More information about the llvm-dev mailing list