[llvm-dev] LLD: time to enable --threads by default

Sean Silva via llvm-dev llvm-dev at lists.llvm.org
Wed Nov 23 17:29:50 PST 2016


On Wed, Nov 23, 2016 at 5:00 PM, Rui Ueyama <ruiu at google.com> wrote:

> On Wed, Nov 23, 2016 at 4:53 PM, Sean Silva <chisophugis at gmail.com> wrote:
>
>>
>>
>> On Wed, Nov 23, 2016 at 6:31 AM, Rafael EspĂ­ndola <
>> rafael.espindola at gmail.com> wrote:
>>
>>> Interesting. Might be worth giving a try again to the idea of creating
>>> the file in anonymous memory and using a write to output it.
>>>
>>
>> I'm not sure that will help. Even the kernel can't escape some of these
>> costs; in modern 64-bit operating systems when you do a syscall you don't
>> actually change the mappings (TLB flush would be expensive), so the cost of
>> populating the page tables in order to read the pages is still there (and
>> hence the serialization point remains). One alternative is to use multiple
>> processes instead of multiple threads which would remove serialization
>> point by definition (it also seems like it might be less invasive of a
>> change, at least for the copying+relocating step).
>>
>> One experiment might be to add a hack to pre-fault all the files that are
>> used, so that you can isolate that cost from the rest of the link. That
>> will give you an upper bound on the speedup that there is to get from
>> optimizing this.
>>
>
> I experimented to add MAP_POPULATE to LLVM's mmap in the hope that it
> would do what you, but it made LLD 10% slower, and I cannot explain why.
>

It may be that LLD does not touch every page of the input files, so
MAP_POPULATE is causing the kernel to do unnecessary work faulting in
things that LLD will never look at.

The purpose of the experiment I described however is not for those changes
to make LLD faster per se, but to allow measuring how much faster LLD could
be by optimizing this; with something like ftrace or dtrace you could
measure the kernel time spent in MAP_POPULATE and subtract it to get a
similar measurement.

-- Sean Silva


>
>
>> Pre-faulting the allocations removes the serialization bottleneck on the
>> kernel VA, since after the page tables are fully populated, they become a
>> read-only data structure and each core's hardware TLB walker can read it
>> independently.
>>
>> For example, you could change elf::link to optionally take a map from
>> file paths to buffers, which will override the native filesystem. Then in
>> main() (before calling elf::link) you can map and pre-fault all the input
>> files (which can be found from a trace of a previous run or whatever). By
>> timing the subsequent call to elf::link you can get the desired measurement.
>>
>> The ability to pass in a map of buffers would allow other experiments
>> that would be interesting. For example, the experiment above could be
>> repeated with all the input buffers copied into a handful of 1GB pages.
>> This would allow entirely eliminating the hardware TLB walking overhead for
>> input buffers.
>>
>> -- Sean Silva
>>
>>
>>>
>>> Cheers,
>>> Rafael
>>>
>>> On 23 November 2016 at 02:41, Sean Silva via llvm-dev
>>> <llvm-dev at lists.llvm.org> wrote:
>>> >
>>> >
>>> > On Wed, Nov 16, 2016 at 12:44 PM, Rui Ueyama via llvm-dev
>>> > <llvm-dev at lists.llvm.org> wrote:
>>> >>
>>> >> LLD supports multi-threading, and it seems to be working well as you
>>> can
>>> >> see in a recent result. In short, LLD runs 30% faster with --threads
>>> option
>>> >> and more than 50% faster if you are using --build-id (your mileage
>>> may vary
>>> >> depending on your computer). However, I don't think most users even
>>> don't
>>> >> know about that because --threads is not a default option.
>>> >>
>>> >> I'm thinking to enable --threads by default. We now have real users,
>>> and
>>> >> they'll be happy about the performance boost.
>>> >>
>>> >> Any concerns?
>>> >>
>>> >> I can't think of problems with that, but I want to write a few notes
>>> about
>>> >> that:
>>> >>
>>> >>  - We still need to focus on single-thread performance rather than
>>> >> multi-threaded one because it is hard to make a slow program faster
>>> just by
>>> >> using more threads.
>>> >>
>>> >>  - We shouldn't do "too clever" things with threads. Currently, we are
>>> >> using multi-threads only at two places where they are highly
>>> parallelizable
>>> >> by nature (namely, copying and applying relocations for each input
>>> section,
>>> >> and computing build-id hash). We are using parallel_for_each, and
>>> that is
>>> >> very simple and easy to understand. I believe this was a right design
>>> >> choice, and I don't think we want to have something like
>>> workqueues/tasks in
>>> >> GNU gold, for example.
>>> >
>>> >
>>> > Sorry for the late response.
>>> >
>>> > Copying and applying relocations is actually are not as parallelizable
>>> as
>>> > you would imagine in current LLD. The reason is that there is an
>>> implicit
>>> > serialization when mutating the kernel's VA map (which happens any time
>>> > there is a minor page fault, i.e. the first time you touch a page of an
>>> > mmap'd input). Since threads share the same VA, there is an implicit
>>> > serialization across them. Separate processes are needed to avoid this
>>> > overhead (note that the separate processes would still have the same
>>> output
>>> > file mapped; so (at least with fixed partitioning) there is no need for
>>> > complex IPC).
>>> >
>>> > For `ld.lld -O0` on Mac host, I measured <1GB/s copying speed, even
>>> though
>>> > the machine I was running on had like 50 GB/s DRAM bandwidth; so the VA
>>> > overhead is on the order of a 50x slowdown for this copying operation
>>> in
>>> > this extreme case, so Amdahl's law indicates that there will be
>>> practically
>>> > no speedup for this copy operation by adding multiple threads. I've
>>> also
>>> > DTrace'd this to see massive contention on the VA lock. LInux will be
>>> better
>>> > but no matter how good, it is still a serialization point and Amdahl's
>>> law
>>> > will limit your speedup significantly.
>>> >
>>> > -- Sean Silva
>>> >
>>> >>
>>> >>
>>> >>  - Run benchmarks with --no-threads if you are not focusing on
>>> >> multi-thread performance.
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> LLVM Developers mailing list
>>> >> llvm-dev at lists.llvm.org
>>> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> >>
>>> >
>>> >
>>> > _______________________________________________
>>> > LLVM Developers mailing list
>>> > llvm-dev at lists.llvm.org
>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> >
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161123/9f476f17/attachment.html>


More information about the llvm-dev mailing list