[PATCH] D20645: Avoid doing binary search.

Thu May 26 11:51:06 PDT 2016

Thank you for doing this. These numbers are interesting, although some
numbers seem to be different from what I saw. At least the number I saw for
this patch is positive.

Anyways, even though there might be room to improve it further, this patch
should be fine. It doesn't hurt other improvements. If there's a better
way, we can simply do it (that would probably be replacing the code in this
patch, but that's fine). I don't see any reason why this change is
particularly controversial. I'd say that the patch before this one (which
shrinks the size of SectionPiece) would be more controversial as it is
microscopic optimization.

On Thu, May 26, 2016 at 3:11 AM, Sean Silva <chisophugis at gmail.com> wrote:

> silvas added a comment.
>
> I tried out this patch on a build of one of our games with debug info
> (output binary is around 300M; getOffset is called around 11.5M times), and
> I can't reproduce the significant performance difference you are seeing
> with this patch: http://reviews.llvm.org/F1986081 (vertical axis is run
> time, horizontal axis is run number)
> F1986081: Screen Shot 2016-05-25 at 10.07.36 PM.png <
> http://reviews.llvm.org/F1986081>
> This was on a mac. I think the extra memory cost of the OffsetMap might be
> hurting (and it may be worse on Windows since the virtual memory system
> there is even poorer usually). OffsetMap adds 2 * sizeof(uintX_t) * #pieces
> * (1/hashTableLoad) total memory, which for ELF64 is over 16 bytes per
> piece; for some string tables this densemap may be on the order of the size
> of the entire section.
> On the profile, I see MergeInputSection::getOffset go from 20.5% to 10.4%
> with this patch (good), but I now see a new entry for
> MergeInputSection::finalizePieces at 10.3% of total link time,
> counteracting that. MergeInputSection::finalizePieces is composed of 4.3%
> (of total link time) in DenseMap's InsertIntoBucketImpl and 3.4% is
> DenseMap::grow (the rest is 2.6% self-time of finalizePieces).
>

finalizePieces can be parallelized so it's going to be overall win.

> As a point of comparison, consider this simple patch:
> http://reviews.llvm.org/P4309
> P4309 <http://reviews.llvm.org/P4309P4309> (An Untitled Masterwork) <
> http://reviews.llvm.org/P4309>
> Compared to the control and your current patch (blue and yellow), this
> patch (green) gives a pretty substantial speedup on this test case with
> this compiler (apple clang) on this os (mac):
> http://reviews.llvm.org/F1986091
> F1986091 <http://reviews.llvm.org/F1986091F1986091>: Screen Shot
> 2016-05-25 at 10.17.51 PM.png <http://reviews.llvm.org/F1986091>
> Zoomed in view, normalized to median of "a":
> http://reviews.llvm.org/F1986092
> F1986092 <http://reviews.llvm.org/F1986092F1986092>: Screen Shot
> 2016-05-25 at 10.19.01 PM.png <http://reviews.llvm.org/F1986092>
> Just the `CachedLastPieceIndex + 1` case gets about 5% speedup on the test
> case I was looking at. Adding the `CachedLastPieceIndex` case brings it up
> to about 8%.
>

Funny you should mention that; I tried the same thing myself before this
patch. I didn't see that much speedup, but that was a positive result. I
think I can describe why this can improve the performance. When you are
applying relocations to a mergeable section, it needs to get locations to
which relocations are applied. That access pattern is sequential increasing
order.

Obviously this approach is thread-hostile and probably not worth
> implementing, but hopefully it shows a couple things:
>

You can make the cache thread-local variable.

> 1. When so much time is spent on so few lines of code, even relatively
> trivial changes can cause large performance swings.
> 2. Performance improvements on one test case/platform/compiler (I mean the
> compiler used to build LLD itself) may be neutral (or even pessimizations)
> on other test cases/platforms/compilers.
>
> To be sure I reran those measurements again, this time with the calls to
> the 3 different versions interleaved (reference, this patch ("rui"), and
> the silly alternative I posted ("alt")) which confirms the results:
> http://reviews.llvm.org/F1986202
> F1986202 <http://reviews.llvm.org/F1986202F1986202>: Screen Shot
> 2016-05-26 at 12.03.31 AM.png <http://reviews.llvm.org/F1986202>
>
> Since I have a profiler open again (this time OSX Instruments), here are
> some more notes that might be useful:
>
> For this testcase, about 200k of the 11.5M calls to getOffset are for
> fixed-size (non-SHF_STRINGS) sections where we can directly index instead
> of do binary search. If we wanted we could do a special case check and
> speed up those cases we could do that (but I don't think that would have a
> huge effect on the overall performance for this kind of test case so I
> didn't bother trying).
>
> Also, one thing to note is that if we already have a sorted list of
> relocations referencing the merge section, then resolving their offset to
> their piece index can be done almost for free in
> InputSection.cpp:splitStrings since we are linearly scanning it anyway.
> InputSection.cpp:splitStrings is actually 11% of the profile so avoiding
> it, optimizing it, or at least doing other useful work (e.g. hashing all
> the strings or resolving offsets to pieces) while we do that is probably
> worth it. About 4.4% of total runtime seems to be in std::vector's
> __emplace_back_slow_path as called by InputSection.cpp:splitStrings
> suggesting that we may want to look into tuning our memory allocation.
>
> We seem to spend about 15% of the link as self-time in
> InputSectionBase<ELFT>::relocateNonAlloc which can probably be dramatically
> reduced by manually pipelining the loop (or adding some prefetches to
> mitigate the serial dependencies through memory there). Also hoisting out
> some hot cases from the various switches and appropriate __builtin_expect
> will likely help a bit too.
> There is about 5% self time in SymbolBody::getVA (it is about 25% of the
> total time, but 20% of this is MergeInputSection::getOffset) which may be
> the serial dependence through memory in the expression SC->OutSec->getVA();
> there also might be some dispatching overhead through some switches and
> could probably benefit from `__builtin_expect`  or similar (I haven't
> looked in detail).
>
> We spend about 12% in memcpy/memmove when writing to the output, likely
> bottlenecked on the kernel virtual memory subsystem (besides servicing the
> page fault, the kernel has to zerofill every page we touch there; also the
> kernel has locking contention while servicing the page fault, explaining
> the poor scalability with threading). This is 332ms time which for a 300MB
> output works out to 0.9GB/s, at least an order of magnitude below DRAM
> bandwidth (which memcpy/memmove should easily max out), so by tuning that
> (e.g. ensuring we are using 2MB pages when available, spawning threads
> early on to start faulting it all in, maybe use pwrite instead of mmap,
> etc.) that cost can be reduced to almost nothing. Note that in an -O0 link
> the output is 421MB and 418ms in memcpy/memmove, which is about 1GB/s (i.e.
> not maxing out DRAM); but in the -O0 link about 30% of time is spent in
> memcpy/memmove for this testcase which means that the speedup for -O0 to be
> expected from tuning this is dramatic.
> Mac's VM subsystem is pretty bad though; I expect Linux to do better here
> (e.g. automatically use 2MB pages, etc.). Worth measuring.
>
> Overall, merge sections seem to account for about 50% of the total runtime
> in this test case. I see on the profile:
> MergeInputSection::getOffset - 20.5%
> MergeOutputSection::addSection - 16.8%
> MergeInputSection::splitIntoPieces - 11.1%
> At -O0, we manage to recover most of this of course (we are about 40%
> faster on this test case based on a quick measurement).
>

Yes, mergeable sections are really costly for the linker if it actually do
merging, and as we optimize the linker for the other parts, it stands out
even more. I imagine that it needs some drastic design change and new
algorithms (and probably a new on-disk section format?) if we aim to down
it to close to zero.

One other interesting thing I measured (by just adding a printf): about 35%
> of the calls to MergeInputSection::getOffset are duplicated. I.e. 35% of
> all calls to getOffset on a particular MergeInputSection are for offsets
> that have already had getOffset called on them. (this is easy to test by
> printing a line for each call to MergeInputSection::getOffset with the pair
> ((uintptr_t)this,Offset) and then sort|uniq them to see how many are
> actually distinct).
>
> The idea for caching the last piece index came from a quick observation of
> the pattern of offsets passed to MergeInputSection::getOffset for each
> MergeInputSection. The access pattern is pretty structured (and somewhat
> pretty; this only shows the pattern for 100 MergeInputSection's (out of
> over 7000)): http://reviews.llvm.org/F1986332
> F1986332 <http://reviews.llvm.org/F1986332F1986332>: Screen Shot
> 2016-05-26 at 2.18.20 AM.png <http://reviews.llvm.org/F1986332>
> Drilling down I found that the differences between successive piece
> indexes was 1 (i.e. sequential) in 50% of cases, and 0 in 7% of cases
> http://reviews.llvm.org/F1986337.
> F1986337 <http://reviews.llvm.org/F1986337.F1986337>: Screen Shot
> 2016-05-26 at 2.22.01 AM.png <http://reviews.llvm.org/F1986337>
> Besides those, there is a very long tail and adding more cases (besides
> `CachedLastPieceIndex + 1` and `CachedLastPieceIndex`) didn't speed things
> up in my empirical testing. Also, using CachedLastPieceIndex as an initial
> "seed" to partition the binary search in the case of "cache miss" didn't
> help much in my measurements (in fact, it actually hurt; likely because it
> harmed locality (i.e. highly different sequences of elements were searched
> each time instead of always starting at the beginning and end of the pieces
> array, testing the same middle every time, etc.)).
>
> Also, at least for this test case, there is an interesting distribution of
> calls to MergeInputSection::getOffset made per input section:
> http://reviews.llvm.org/F1986325
> F1986325 <http://reviews.llvm.org/F1986325F1986325>: Screen Shot
> 2016-05-26 at 2.07.23 AM.png <http://reviews.llvm.org/F1986325>
> The area under the histogram is the total number of calls to
> MergeInputSection::getOffset. The horizontal axis is the number of calls to
> MergeInputSection::getOffset for a given MergeInputSection (e.g. there is
> an outlier where we called MergeInputSection::getOffset 30,000 times on the
> MergeInputSection). The vertical axis is the total number of calls to
> MergeInputSection::getOffset within each bucket of the histogram.
> The thing that surprised me is that sections with a smaller number of
> calls to getOffset made into them become increasingly abundant so that the
> graph is relatively flat (varying only 2-3x) up to about 15,000 on the
> horizontal axis.
>
>
> http://reviews.llvm.org/D20645
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160526/dae2f326/attachment.html>