<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Nov 23, 2016 at 5:00 PM, Rui Ueyama <span dir="ltr"><<a href="mailto:ruiu@google.com" target="_blank">ruiu@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><span class="">On Wed, Nov 23, 2016 at 4:53 PM, Sean Silva <span dir="ltr"><<a href="mailto:chisophugis@gmail.com" target="_blank">chisophugis@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><span>On Wed, Nov 23, 2016 at 6:31 AM, Rafael Espíndola <span dir="ltr"><<a href="mailto:rafael.espindola@gmail.com" target="_blank">rafael.espindola@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Interesting. Might be worth giving a try again to the idea of creating<br>

the file in anonymous memory and using a write to output it.<br></blockquote><div><br></div></span><div>I'm not sure that will help. Even the kernel can't escape some of these costs; in modern 64-bit operating systems when you do a syscall you don't actually change the mappings (TLB flush would be expensive), so the cost of populating the page tables in order to read the pages is still there (and hence the serialization point remains). One alternative is to use multiple processes instead of multiple threads which would remove serialization point by definition (it also seems like it might be less invasive of a change, at least for the copying+relocating step).</div><div><br></div><div>One experiment might be to add a hack to pre-fault all the files that are used, so that you can isolate that cost from the rest of the link. That will give you an upper bound on the speedup that there is to get from optimizing this.</div></div></div></div></blockquote><div><br></div></span><div>I experimented to add MAP_POPULATE to LLVM's mmap in the hope that it would do what you, but it made LLD 10% slower, and I cannot explain why.</div></div></div></div></blockquote><div><br></div><div>It may be that LLD does not touch every page of the input files, so MAP_POPULATE is causing the kernel to do unnecessary work faulting in things that LLD will never look at.</div><div><br></div><div>The purpose of the experiment I described however is not for those changes to make LLD faster per se, but to allow measuring how much faster LLD could be by optimizing this; with something like ftrace or dtrace you could measure the kernel time spent in MAP_POPULATE and subtract it to get a similar measurement.  <br></div><div><br></div><div>-- Sean Silva</div><div> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><div class="h5"><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>Pre-faulting the allocations removes the serialization bottleneck on the kernel VA, since after the page tables are fully populated, they become a read-only data structure and each core's hardware TLB walker can read it independently.<br></div><div><br></div><div>For example, you could change elf::link to optionally take a map from file paths to buffers, which will override the native filesystem. Then in main() (before calling elf::link) you can map and pre-fault all the input files (which can be found from a trace of a previous run or whatever). By timing the subsequent call to elf::link you can get the desired measurement.</div><div><br></div><div>The ability to pass in a map of buffers would allow other experiments that would be interesting. For example, the experiment above could be repeated with all the input buffers copied into a handful of 1GB pages. This would allow entirely eliminating the hardware TLB walking overhead for input buffers.</div><span class="m_-2003654782425564098HOEnZb"><font color="#888888"><div><br></div><div>-- Sean Silva</div></font></span><div><div class="m_-2003654782425564098h5"><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Cheers,<br>

Rafael<br>

<br>

On 23 November 2016 at 02:41, Sean Silva via llvm-dev<br>

<div class="m_-2003654782425564098m_4891626280382965010gmail-HOEnZb"><div class="m_-2003654782425564098m_4891626280382965010gmail-h5"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br>

><br>

><br>

> On Wed, Nov 16, 2016 at 12:44 PM, Rui Ueyama via llvm-dev<br>

> <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br>

>><br>

>> LLD supports multi-threading, and it seems to be working well as you can<br>

>> see in a recent result. In short, LLD runs 30% faster with --threads option<br>

>> and more than 50% faster if you are using --build-id (your mileage may vary<br>

>> depending on your computer). However, I don't think most users even don't<br>

>> know about that because --threads is not a default option.<br>

>><br>

>> I'm thinking to enable --threads by default. We now have real users, and<br>

>> they'll be happy about the performance boost.<br>

>><br>

>> Any concerns?<br>

>><br>

>> I can't think of problems with that, but I want to write a few notes about<br>

>> that:<br>

>><br>

>>  - We still need to focus on single-thread performance rather than<br>

>> multi-threaded one because it is hard to make a slow program faster just by<br>

>> using more threads.<br>

>><br>

>>  - We shouldn't do "too clever" things with threads. Currently, we are<br>

>> using multi-threads only at two places where they are highly parallelizable<br>

>> by nature (namely, copying and applying relocations for each input section,<br>

>> and computing build-id hash). We are using parallel_for_each, and that is<br>

>> very simple and easy to understand. I believe this was a right design<br>

>> choice, and I don't think we want to have something like workqueues/tasks in<br>

>> GNU gold, for example.<br>

><br>

><br>

> Sorry for the late response.<br>

><br>

> Copying and applying relocations is actually are not as parallelizable as<br>

> you would imagine in current LLD. The reason is that there is an implicit<br>

> serialization when mutating the kernel's VA map (which happens any time<br>

> there is a minor page fault, i.e. the first time you touch a page of an<br>

> mmap'd input). Since threads share the same VA, there is an implicit<br>

> serialization across them. Separate processes are needed to avoid this<br>

> overhead (note that the separate processes would still have the same output<br>

> file mapped; so (at least with fixed partitioning) there is no need for<br>

> complex IPC).<br>

><br>

> For `ld.lld -O0` on Mac host, I measured <1GB/s copying speed, even though<br>

> the machine I was running on had like 50 GB/s DRAM bandwidth; so the VA<br>

> overhead is on the order of a 50x slowdown for this copying operation in<br>

> this extreme case, so Amdahl's law indicates that there will be practically<br>

> no speedup for this copy operation by adding multiple threads. I've also<br>

> DTrace'd this to see massive contention on the VA lock. LInux will be better<br>

> but no matter how good, it is still a serialization point and Amdahl's law<br>

> will limit your speedup significantly.<br>

><br>

> -- Sean Silva<br>

><br>

>><br>

>><br>

>>  - Run benchmarks with --no-threads if you are not focusing on<br>

>> multi-thread performance.<br>

>><br>

>><br>

>> ______________________________<wbr>_________________<br>

>> LLVM Developers mailing list<br>

>> <a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

>> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-dev</a><br>

>><br>

><br>

><br>

> ______________________________<wbr>_________________<br>

> LLVM Developers mailing list<br>

> <a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-dev</a><br>

><br>

</div></div></blockquote></div></div></div><br></div></div>

</blockquote></div></div></div><br></div></div>

</blockquote></div><br></div></div>