[llvm-dev] LLD: time to enable --threads by default

Wed Nov 23 17:00:26 PST 2016

On Wed, Nov 23, 2016 at 4:53 PM, Sean Silva <chisophugis at gmail.com> wrote:

>
>
> On Wed, Nov 23, 2016 at 6:31 AM, Rafael Espíndola <
> rafael.espindola at gmail.com> wrote:
>
>> Interesting. Might be worth giving a try again to the idea of creating
>> the file in anonymous memory and using a write to output it.
>>
>
> I'm not sure that will help. Even the kernel can't escape some of these
> costs; in modern 64-bit operating systems when you do a syscall you don't
> actually change the mappings (TLB flush would be expensive), so the cost of
> populating the page tables in order to read the pages is still there (and
> hence the serialization point remains). One alternative is to use multiple
> processes instead of multiple threads which would remove serialization
> point by definition (it also seems like it might be less invasive of a
> change, at least for the copying+relocating step).
>
> One experiment might be to add a hack to pre-fault all the files that are
> used, so that you can isolate that cost from the rest of the link. That
> will give you an upper bound on the speedup that there is to get from
> optimizing this.
>

I experimented to add MAP_POPULATE to LLVM's mmap in the hope that it would
do what you, but it made LLD 10% slower, and I cannot explain why.

> Pre-faulting the allocations removes the serialization bottleneck on the
> kernel VA, since after the page tables are fully populated, they become a
> read-only data structure and each core's hardware TLB walker can read it
> independently.
>
> For example, you could change elf::link to optionally take a map from file
> paths to buffers, which will override the native filesystem. Then in main()
> (before calling elf::link) you can map and pre-fault all the input files
> (which can be found from a trace of a previous run or whatever). By timing
> the subsequent call to elf::link you can get the desired measurement.
>
> The ability to pass in a map of buffers would allow other experiments that
> would be interesting. For example, the experiment above could be repeated
> with all the input buffers copied into a handful of 1GB pages. This would
> allow entirely eliminating the hardware TLB walking overhead for input
> buffers.
>
> -- Sean Silva
>
>
>>
>> Cheers,
>> Rafael
>>
>> On 23 November 2016 at 02:41, Sean Silva via llvm-dev
>> <llvm-dev at lists.llvm.org> wrote:
>> >
>> >
>> > On Wed, Nov 16, 2016 at 12:44 PM, Rui Ueyama via llvm-dev
>> > <llvm-dev at lists.llvm.org> wrote:
>> >>
>> >> LLD supports multi-threading, and it seems to be working well as you
>> can
>> >> see in a recent result. In short, LLD runs 30% faster with --threads
>> option
>> >> and more than 50% faster if you are using --build-id (your mileage may
>> vary
>> >> depending on your computer). However, I don't think most users even
>> don't
>> >> know about that because --threads is not a default option.
>> >>
>> >> I'm thinking to enable --threads by default. We now have real users,
>> and
>> >> they'll be happy about the performance boost.
>> >>
>> >> Any concerns?
>> >>
>> >> I can't think of problems with that, but I want to write a few notes
>> about
>> >> that:
>> >>
>> >>  - We still need to focus on single-thread performance rather than
>> >> multi-threaded one because it is hard to make a slow program faster
>> just by
>> >> using more threads.
>> >>
>> >>  - We shouldn't do "too clever" things with threads. Currently, we are
>> >> using multi-threads only at two places where they are highly
>> parallelizable
>> >> by nature (namely, copying and applying relocations for each input
>> section,
>> >> and computing build-id hash). We are using parallel_for_each, and that
>> is
>> >> very simple and easy to understand. I believe this was a right design
>> >> choice, and I don't think we want to have something like
>> workqueues/tasks in
>> >> GNU gold, for example.
>> >
>> >
>> > Sorry for the late response.
>> >
>> > Copying and applying relocations is actually are not as parallelizable
>> as
>> > you would imagine in current LLD. The reason is that there is an
>> implicit
>> > serialization when mutating the kernel's VA map (which happens any time
>> > there is a minor page fault, i.e. the first time you touch a page of an
>> > mmap'd input). Since threads share the same VA, there is an implicit
>> > serialization across them. Separate processes are needed to avoid this
>> > overhead (note that the separate processes would still have the same
>> output
>> > file mapped; so (at least with fixed partitioning) there is no need for
>> > complex IPC).
>> >
>> > For `ld.lld -O0` on Mac host, I measured <1GB/s copying speed, even
>> though
>> > the machine I was running on had like 50 GB/s DRAM bandwidth; so the VA
>> > overhead is on the order of a 50x slowdown for this copying operation in
>> > this extreme case, so Amdahl's law indicates that there will be
>> practically
>> > no speedup for this copy operation by adding multiple threads. I've also
>> > DTrace'd this to see massive contention on the VA lock. LInux will be
>> better
>> > but no matter how good, it is still a serialization point and Amdahl's
>> law
>> > will limit your speedup significantly.
>> >
>> > -- Sean Silva
>> >
>> >>
>> >>
>> >>  - Run benchmarks with --no-threads if you are not focusing on
>> >> multi-thread performance.
>> >>
>> >>
>> >> _______________________________________________
>> >> LLVM Developers mailing list
>> >> llvm-dev at lists.llvm.org
>> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> >>
>> >
>> >
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > llvm-dev at lists.llvm.org
>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> >
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161123/147f86ce/attachment.html>