<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, May 25, 2016 at 4:07 PM, Sean Silva <span dir="ltr"><<a href="mailto:chisophugis@gmail.com" target="_blank">chisophugis@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">silvas added a comment.<br>

<span><br>

In <a href="http://reviews.llvm.org/D20645#440099" rel="noreferrer" target="_blank">http://reviews.llvm.org/D20645#440099</a>, @ruiu wrote:<br>

<br>

> I was surprised you said 5% overall improvement isn't worth it. The number seems really significant to me. It is not a microbenchmark<br>

<br>

<br>

</span>It is. It is basically a microbenchmark of an ad-hoc std::upper_bound-based associative container. This patch adds a hash table in front and makes this be only 20% of the total link time instead of 25% on one data set that you looked at. It's a band-aid and adds complexity which makes it more difficult to actually optimize the linker long-term.<br></blockquote><div><br></div><div>It is not. Microbenchmarks are benchmarks that exercise only the code that we are interested in. I didn't make object files with lots of mergeable sections to make up the number. This is a real patch that cuts the overall link time by 5% for a general case by adding 41 lines of simple code.</div><div><br></div><div>The cost of calculating offsets cannot be zero. The function is hot because it is one of the most frequently called function in LLD. Obtaining an output offset is needed to apply relocations to get locations in output to know where they need to be applied to. For relocations that refer merged strings, it needs one more function call for each relocation.</div><div><br></div><div>Did you know the number of relocations? It's huge. Anything that are called for each relocation is very expensive. <a href="http://lld.llvm.org/NewLLD.html#numbers-you-want-to-know" target="_blank">http://lld.llvm.org/NewLLD.html#numbers-you-want-to-know</a><br></div><div><br></div><div>Offset computation for mergeable sections cannot be an addition unlike normal sections. It is inherently more expensive because of the fact that section contents is not contiguous.</div><div><br></div><div>So 5% improvement is a pretty good one. And this patch is not complex nor tricky. You misunderstood that point.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">

Also, it punts on threads, so it doesn't stack with any gains we get there (hence not a long-term solution).<br></blockquote><div><br></div><div>It doesn't punt on threads -- you misunderstood it. It works fine with and without multi-threading. The code you are looking at is a minor path which is not executed for most function calls, and in the minor path, we do caching "just in case" if there's no other threads. It is a minor optimization so it doesn't affect the number that much, but it wouldn't hurt either, so it's still nice to have.</div><div><br></div><div>Speaking of threads, this patch opens up an opportunity to utilize more threads. Now that we compute offsets for section pieces all at once instead of computing them as-needed basis (they will be needed eventually.) Previously, we cannot parallelize offset computation, but now we can do it per-input-section basis just by rewriting the for loop with a parallel_for.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">

Have you tried just completely eliminating this code? E.g. creating separate "input sections" for each of the elements in the mergeable input section? Think about it like this: this code is currently 25% of the link time. Are there other solutions that are less than 25% of the link time? I think it is reasonable to expect to find a solution that only costs us 5%.<br>

<span><br>

> but is a real result of linking clang. Just +41 lines improves the overall link time by 5%? To me, it is between "excellent" and "it's too good to be true so I'll apply this and verify it myself."<br>

<br>

><br>

<br>

> You can see the drop of number of branch mispredictions in the perf numbers.<br>

<br>

<br>

</span>This kind of thinking will get you stuck in a local optimum. Think of the future and how much faster LLD will be than the current LLD. Is this patch part of that future? How much faster do you want LLD to be? How much faster can it be?<br>

<br>

Note: on windows by default (and also trivially on tmpfs; relevant to google's internal build system) LLD does not have to wait on writing to a disk. LLD has a long way to go until it is even heavily bottlenecked by disk IO; let alone being limited by DRAM bandwidth (e.g. 50GB/s on a typical low-end machine).<br></blockquote><div><br></div><div>If you come up with an alternative algorithm to make it faster, please suggest. But in any case this doesn't hurt any future improvements.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">

<br>

<a href="http://reviews.llvm.org/D20645" rel="noreferrer" target="_blank">http://reviews.llvm.org/D20645</a><br>

<br>

<br>

<br>

</blockquote></div><br></div></div>