[PATCH] D20645: Avoid doing binary search.

Thu May 26 03:11:57 PDT 2016

silvas added a comment.

I tried out this patch on a build of one of our games with debug info (output binary is around 300M; getOffset is called around 11.5M times), and I can't reproduce the significant performance difference you are seeing with this patch: http://reviews.llvm.org/F1986081 (vertical axis is run time, horizontal axis is run number)
F1986081: Screen Shot 2016-05-25 at 10.07.36 PM.png <http://reviews.llvm.org/F1986081>
This was on a mac. I think the extra memory cost of the OffsetMap might be hurting (and it may be worse on Windows since the virtual memory system there is even poorer usually). OffsetMap adds 2 * sizeof(uintX_t) * #pieces * (1/hashTableLoad) total memory, which for ELF64 is over 16 bytes per piece; for some string tables this densemap may be on the order of the size of the entire section.
On the profile, I see MergeInputSection::getOffset go from 20.5% to 10.4% with this patch (good), but I now see a new entry for MergeInputSection::finalizePieces at 10.3% of total link time, counteracting that. MergeInputSection::finalizePieces is composed of 4.3% (of total link time) in DenseMap's InsertIntoBucketImpl and 3.4% is DenseMap::grow (the rest is 2.6% self-time of finalizePieces).

As a point of comparison, consider this simple patch: http://reviews.llvm.org/P4309
P4309 (An Untitled Masterwork) <http://reviews.llvm.org/P4309>
Compared to the control and your current patch (blue and yellow), this patch (green) gives a pretty substantial speedup on this test case with this compiler (apple clang) on this os (mac): http://reviews.llvm.org/F1986091
F1986091: Screen Shot 2016-05-25 at 10.17.51 PM.png <http://reviews.llvm.org/F1986091>
Zoomed in view, normalized to median of "a": http://reviews.llvm.org/F1986092
F1986092: Screen Shot 2016-05-25 at 10.19.01 PM.png <http://reviews.llvm.org/F1986092>
Just the `CachedLastPieceIndex + 1` case gets about 5% speedup on the test case I was looking at. Adding the `CachedLastPieceIndex` case brings it up to about 8%.
Obviously this approach is thread-hostile and probably not worth implementing, but hopefully it shows a couple things:

1. When so much time is spent on so few lines of code, even relatively trivial changes can cause large performance swings.
2. Performance improvements on one test case/platform/compiler (I mean the compiler used to build LLD itself) may be neutral (or even pessimizations) on other test cases/platforms/compilers.

To be sure I reran those measurements again, this time with the calls to the 3 different versions interleaved (reference, this patch ("rui"), and the silly alternative I posted ("alt")) which confirms the results: http://reviews.llvm.org/F1986202
F1986202: Screen Shot 2016-05-26 at 12.03.31 AM.png <http://reviews.llvm.org/F1986202>

Since I have a profiler open again (this time OSX Instruments), here are some more notes that might be useful:

For this testcase, about 200k of the 11.5M calls to getOffset are for fixed-size (non-SHF_STRINGS) sections where we can directly index instead of do binary search. If we wanted we could do a special case check and speed up those cases we could do that (but I don't think that would have a huge effect on the overall performance for this kind of test case so I didn't bother trying).

Also, one thing to note is that if we already have a sorted list of relocations referencing the merge section, then resolving their offset to their piece index can be done almost for free in InputSection.cpp:splitStrings since we are linearly scanning it anyway. InputSection.cpp:splitStrings is actually 11% of the profile so avoiding it, optimizing it, or at least doing other useful work (e.g. hashing all the strings or resolving offsets to pieces) while we do that is probably worth it. About 4.4% of total runtime seems to be in std::vector's __emplace_back_slow_path as called by InputSection.cpp:splitStrings suggesting that we may want to look into tuning our memory allocation.

We seem to spend about 15% of the link as self-time in InputSectionBase<ELFT>::relocateNonAlloc which can probably be dramatically reduced by manually pipelining the loop (or adding some prefetches to mitigate the serial dependencies through memory there). Also hoisting out some hot cases from the various switches and appropriate __builtin_expect will likely help a bit too.
There is about 5% self time in SymbolBody::getVA (it is about 25% of the total time, but 20% of this is MergeInputSection::getOffset) which may be the serial dependence through memory in the expression SC->OutSec->getVA(); there also might be some dispatching overhead through some switches and could probably benefit from `__builtin_expect`  or similar (I haven't looked in detail).

We spend about 12% in memcpy/memmove when writing to the output, likely bottlenecked on the kernel virtual memory subsystem (besides servicing the page fault, the kernel has to zerofill every page we touch there; also the kernel has locking contention while servicing the page fault, explaining the poor scalability with threading). This is 332ms time which for a 300MB output works out to 0.9GB/s, at least an order of magnitude below DRAM bandwidth (which memcpy/memmove should easily max out), so by tuning that (e.g. ensuring we are using 2MB pages when available, spawning threads early on to start faulting it all in, maybe use pwrite instead of mmap, etc.) that cost can be reduced to almost nothing. Note that in an -O0 link the output is 421MB and 418ms in memcpy/memmove, which is about 1GB/s (i.e. not maxing out DRAM); but in the -O0 link about 30% of time is spent in memcpy/memmove for this testcase which means that the speedup for -O0 to be expected from tuning this is dramatic.
Mac's VM subsystem is pretty bad though; I expect Linux to do better here (e.g. automatically use 2MB pages, etc.). Worth measuring.

Overall, merge sections seem to account for about 50% of the total runtime in this test case. I see on the profile:
MergeInputSection::getOffset - 20.5%
MergeOutputSection::addSection - 16.8%
MergeInputSection::splitIntoPieces - 11.1%
At -O0, we manage to recover most of this of course (we are about 40% faster on this test case based on a quick measurement).

One other interesting thing I measured (by just adding a printf): about 35% of the calls to MergeInputSection::getOffset are duplicated. I.e. 35% of all calls to getOffset on a particular MergeInputSection are for offsets that have already had getOffset called on them. (this is easy to test by printing a line for each call to MergeInputSection::getOffset with the pair ((uintptr_t)this,Offset) and then sort|uniq them to see how many are actually distinct).

The idea for caching the last piece index came from a quick observation of the pattern of offsets passed to MergeInputSection::getOffset for each MergeInputSection. The access pattern is pretty structured (and somewhat pretty; this only shows the pattern for 100 MergeInputSection's (out of over 7000)): http://reviews.llvm.org/F1986332
F1986332: Screen Shot 2016-05-26 at 2.18.20 AM.png <http://reviews.llvm.org/F1986332>
Drilling down I found that the differences between successive piece indexes was 1 (i.e. sequential) in 50% of cases, and 0 in 7% of cases http://reviews.llvm.org/F1986337.
F1986337: Screen Shot 2016-05-26 at 2.22.01 AM.png <http://reviews.llvm.org/F1986337>
Besides those, there is a very long tail and adding more cases (besides `CachedLastPieceIndex + 1` and `CachedLastPieceIndex`) didn't speed things up in my empirical testing. Also, using CachedLastPieceIndex as an initial "seed" to partition the binary search in the case of "cache miss" didn't help much in my measurements (in fact, it actually hurt; likely because it harmed locality (i.e. highly different sequences of elements were searched each time instead of always starting at the beginning and end of the pieces array, testing the same middle every time, etc.)).

Also, at least for this test case, there is an interesting distribution of calls to MergeInputSection::getOffset made per input section: http://reviews.llvm.org/F1986325
F1986325: Screen Shot 2016-05-26 at 2.07.23 AM.png <http://reviews.llvm.org/F1986325>
The area under the histogram is the total number of calls to MergeInputSection::getOffset. The horizontal axis is the number of calls to MergeInputSection::getOffset for a given MergeInputSection (e.g. there is an outlier where we called MergeInputSection::getOffset 30,000 times on the MergeInputSection). The vertical axis is the total number of calls to MergeInputSection::getOffset within each bucket of the histogram.
The thing that surprised me is that sections with a smaller number of calls to getOffset made into them become increasingly abundant so that the graph is relatively flat (varying only 2-3x) up to about 15,000 on the horizontal axis.

http://reviews.llvm.org/D20645