[llvm-dev] LLD: time to enable --threads by default

Wed Nov 23 19:13:14 PST 2016

On Wed, Nov 23, 2016 at 4:53 PM, Sean Silva <chisophugis at gmail.com> wrote:

>
>
> On Wed, Nov 23, 2016 at 6:31 AM, Rafael Espíndola <
> rafael.espindola at gmail.com> wrote:
>
>> Interesting. Might be worth giving a try again to the idea of creating
>> the file in anonymous memory and using a write to output it.
>>
>
> I'm not sure that will help. Even the kernel can't escape some of these
> costs; in modern 64-bit operating systems when you do a syscall you don't
> actually change the mappings (TLB flush would be expensive), so the cost of
> populating the page tables in order to read the pages is still there (and
> hence the serialization point remains).
>

I looked a bit more closely at this. In current linux, the actual faulting
doesn't seem like it will have contention unless it needs to do memory
allocation for the page table structures (but even then, linux's memory
allocation is hopefully pretty optimized and scalable). The serialization
point is when the kernel VM data structures need to be updated (NOT the
hardware page tables which are usually populated lazily), which generally
will only happen inside the mmap call.

This should mean that the primary serialization points in the kernel will
be:
1. during the mmap call on any file
2. depending on where the kernel maps each input file, during the first
accesses to a file to allocate page table structures (but that should be
pretty scalable)
3. allocating the output files (this should be pretty scalable too since it
is just memory allocation)

A simple test on my machine at home (i7-6700HQ CPU @ 2.60GHz) shows about
90ms to fill a 1GB buffer when it is freshly allocated (i.e., while the
buffer is unfaulted and the kernel has to fault the memory during the
memset) and about 45ms subsequent times (when the buffer is already
faulted). Linux is using transparent huge pages though so that is with 2M
pages. Turning off transparent hugepages so that 4k pages are used, the
initial memset time goes up to about 280ms and subsequent times are still
about 45ms. So with 4k pages, the slowdown is more than 6x for the initial
faulting. Adding threads helps a bit, but cannot hide the fundamental work
that the kernel has to do to allocate the memory.

4kpages-1thread-unfaulted 280ms
4kpages-2thread-unfaulted 160ms
4kpages-3thread-unfaulted 110ms
4kpages-4thread-unfaulted 100ms

2Mpages-1thread-unfaulted 90ms
2Mpages-2thread-unfaulted 60ms
2Mpages-3thread-unfaulted 65ms
2Mpages-4thread-unfaulted 70ms

4kpages-1thread-faulted 45ms
2Mpages-1thread-faulted 45ms

(note: these are pretty rough numbers; they are just to get a feel for the
scaling)

Also, adding threads for this test case *after the pages are faulted in*
only helps a little bit since it is mostly bottlenecked on DRAM bandwidth
(it is just a simple memset test though, so that is expected).

This was on Linux. The OS overhead will probably be larger (potentially
much larger) for Mac and Windows.

-- Sean Silva

> One alternative is to use multiple processes instead of multiple threads
> which would remove serialization point by definition (it also seems like it
> might be less invasive of a change, at least for the copying+relocating
> step).
>
> One experiment might be to add a hack to pre-fault all the files that are
> used, so that you can isolate that cost from the rest of the link. That
> will give you an upper bound on the speedup that there is to get from
> optimizing this.
> Pre-faulting the allocations removes the serialization bottleneck on the
> kernel VA, since after the page tables are fully populated, they become a
> read-only data structure and each core's hardware TLB walker can read it
> independently.
>
> For example, you could change elf::link to optionally take a map from file
> paths to buffers, which will override the native filesystem. Then in main()
> (before calling elf::link) you can map and pre-fault all the input files
> (which can be found from a trace of a previous run or whatever). By timing
> the subsequent call to elf::link you can get the desired measurement.
>
> The ability to pass in a map of buffers would allow other experiments that
> would be interesting. For example, the experiment above could be repeated
> with all the input buffers copied into a handful of 1GB pages. This would
> allow entirely eliminating the hardware TLB walking overhead for input
> buffers.
>
> -- Sean Silva
>
>
>>
>> Cheers,
>> Rafael
>>
>> On 23 November 2016 at 02:41, Sean Silva via llvm-dev
>> <llvm-dev at lists.llvm.org> wrote:
>> >
>> >
>> > On Wed, Nov 16, 2016 at 12:44 PM, Rui Ueyama via llvm-dev
>> > <llvm-dev at lists.llvm.org> wrote:
>> >>
>> >> LLD supports multi-threading, and it seems to be working well as you
>> can
>> >> see in a recent result. In short, LLD runs 30% faster with --threads
>> option
>> >> and more than 50% faster if you are using --build-id (your mileage may
>> vary
>> >> depending on your computer). However, I don't think most users even
>> don't
>> >> know about that because --threads is not a default option.
>> >>
>> >> I'm thinking to enable --threads by default. We now have real users,
>> and
>> >> they'll be happy about the performance boost.
>> >>
>> >> Any concerns?
>> >>
>> >> I can't think of problems with that, but I want to write a few notes
>> about
>> >> that:
>> >>
>> >>  - We still need to focus on single-thread performance rather than
>> >> multi-threaded one because it is hard to make a slow program faster
>> just by
>> >> using more threads.
>> >>
>> >>  - We shouldn't do "too clever" things with threads. Currently, we are
>> >> using multi-threads only at two places where they are highly
>> parallelizable
>> >> by nature (namely, copying and applying relocations for each input
>> section,
>> >> and computing build-id hash). We are using parallel_for_each, and that
>> is
>> >> very simple and easy to understand. I believe this was a right design
>> >> choice, and I don't think we want to have something like
>> workqueues/tasks in
>> >> GNU gold, for example.
>> >
>> >
>> > Sorry for the late response.
>> >
>> > Copying and applying relocations is actually are not as parallelizable
>> as
>> > you would imagine in current LLD. The reason is that there is an
>> implicit
>> > serialization when mutating the kernel's VA map (which happens any time
>> > there is a minor page fault, i.e. the first time you touch a page of an
>> > mmap'd input). Since threads share the same VA, there is an implicit
>> > serialization across them. Separate processes are needed to avoid this
>> > overhead (note that the separate processes would still have the same
>> output
>> > file mapped; so (at least with fixed partitioning) there is no need for
>> > complex IPC).
>> >
>> > For `ld.lld -O0` on Mac host, I measured <1GB/s copying speed, even
>> though
>> > the machine I was running on had like 50 GB/s DRAM bandwidth; so the VA
>> > overhead is on the order of a 50x slowdown for this copying operation in
>> > this extreme case, so Amdahl's law indicates that there will be
>> practically
>> > no speedup for this copy operation by adding multiple threads. I've also
>> > DTrace'd this to see massive contention on the VA lock. LInux will be
>> better
>> > but no matter how good, it is still a serialization point and Amdahl's
>> law
>> > will limit your speedup significantly.
>> >
>> > -- Sean Silva
>> >
>> >>
>> >>
>> >>  - Run benchmarks with --no-threads if you are not focusing on
>> >> multi-thread performance.
>> >>
>> >>
>> >> _______________________________________________
>> >> LLVM Developers mailing list
>> >> llvm-dev at lists.llvm.org
>> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> >>
>> >
>> >
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > llvm-dev at lists.llvm.org
>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> >
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161123/55ced382/attachment-0001.html>