[PATCH] D27146: Merge strings using a probabilistic algorithm to reduce latency.

Mon Nov 28 09:38:09 PST 2016

Sorry, I mean https://reviews.llvm.org/D27155

On Mon, Nov 28, 2016 at 9:37 AM, Rui Ueyama <ruiu at google.com> wrote:

> I eventually wrote three patches for this, and https://reviews.llvm.org/D
> 27146 is most promising. (If you are not aware of that / haven't reached
> to the top your mail inbox yet.)
>
> On Mon, Nov 28, 2016 at 9:33 AM, David Blaikie <dblaikie at gmail.com> wrote:
>
>> (not much to add except that I kind of love this - really neat
>> idea/direction to pursue/play with possibilities)
>>
>> As for making this stable though probabilistic: any chance of seeding the
>> RNG with a known value to get stability? (possibly using some of the input
>> contents as the seed, if that's helpful) - still risks pathological cases,
>> I suppose, but should be OK?
>>
>> On Sat, Nov 26, 2016 at 9:11 PM Rui Ueyama via Phabricator via
>> llvm-commits <llvm-commits at lists.llvm.org> wrote:
>>
>>> ruiu created this revision.
>>> ruiu added reviewers: rafael, silvas.
>>> ruiu added a subscriber: llvm-commits.
>>>
>>> I'm sending this patch to get fedback. I haven't convince even myself
>>> that this is the right thing to do. But this should be interesting
>>> to those who want to see what we can do to improve linker's latency.
>>>
>>> String merging is one of the slowest passes in LLD because of the
>>> sheer number of mergeable strings. For example, Clang with debug info
>>> contains 30 millions of mergeable strings (average length is about 50
>>> bytes). They need to be uniquified, and uniquified strings need to
>>> get consecutive offsets in the resulting string table.
>>>
>>> Currently, we are using a (single-threaded, regular) dense map for
>>> string unification. Merging the 30 million strings takes about 2
>>> seconds on my machine.
>>>
>>> This patch implements one of my ideas about how to reduce latency by
>>> parallelizing it. This algorithm is probabilistic, meaining that
>>> even though duplicated strings are likely to be merged, that's not
>>> guaranteed. As a result, it produces larger string table quickly.
>>> (If you need to optimize in size, you could still pass -O2 which
>>> does tail-merging.)
>>>
>>> Here's how it works.
>>>
>>> In the first step, we take 10% of input string set to create a small
>>> string table. The resulting string table is very unlikely to contain
>>> all strings of the entire set, but it is likely to contain most of
>>> duplicated strings, because duplicated strings are repeated many times.
>>>
>>> The second step processes the remaining 90% in parallel. In this step,
>>> we do not merge strings. So, if a string is not in the small string
>>> table we created in the first step, that will just be appended to end
>>> of the string table. This step completes the string table.
>>>
>>> Here are some numbers of resulting clang executables:
>>>
>>>   Size of .debug_str section:
>>>   Current            108,049,822   (+0%)
>>>   Probabilistic      154,089,550   (+42.6%)
>>>   No string merging  1,591,388,940 (+1472.8%)
>>>
>>>   Size of resulting file:
>>>   Current            1,440,453,528 (+0%)
>>>   Probabilistic      1,490,597,448 (+3.5%)
>>>   No string merging  2,945,020,808 (+204.5%)
>>>
>>> The probabilistic algorithm produces larger string table, but that's
>>> much smaller than that without string merging. Compared to the entire
>>> executable size, the loss is only 3.5%.
>>>
>>> Here is a speedup in latency:
>>>
>>>   Before:
>>>
>>>      36098.025468 task-clock (msec)         #    5.256 CPUs utilized
>>>         ( +-  0.95% )
>>>           190,770 context-switches          #    0.005 M/sec
>>>         ( +-  0.25% )
>>>             7,609 cpu-migrations            #    0.211 K/sec
>>>         ( +- 11.40% )
>>>         2,378,416 page-faults               #    0.066 M/sec
>>>         ( +-  0.07% )
>>>    99,645,202,279 cycles                    #    2.760 GHz
>>>         ( +-  0.94% )
>>>    81,128,226,367 stalled-cycles-frontend   #   81.42% frontend cycles
>>> idle     ( +-  1.10% )
>>>   <not supported> stalled-cycles-backend
>>>    45,662,681,567 instructions              #    0.46  insns per cycle
>>>                                             #    1.78  stalled cycles
>>> per insn  ( +-  0.14% )
>>>     8,864,616,311 branches                  #  245.571 M/sec
>>>         ( +-  0.22% )
>>>       146,360,227 branch-misses             #    1.65% of all branches
>>>         ( +-  0.06% )
>>>
>>>       6.868559257 seconds time elapsed
>>>         ( +-  0.50% )
>>>
>>>   After:
>>>
>>>      36905.733802 task-clock (msec)         #    7.061 CPUs utilized
>>>         ( +-  0.84% )
>>>           159,813 context-switches          #    0.004 M/sec
>>>         ( +-  0.24% )
>>>             8,079 cpu-migrations            #    0.219 K/sec
>>>         ( +- 12.67% )
>>>         2,296,298 page-faults               #    0.062 M/sec
>>>         ( +-  0.21% )
>>>   102,178,380,224 cycles                    #    2.769 GHz
>>>         ( +-  0.83% )
>>>    83,846,653,367 stalled-cycles-frontend   #   82.06% frontend cycles
>>> idle     ( +-  0.96% )
>>>   <not supported> stalled-cycles-backend
>>>    46,138,345,206 instructions              #    0.45  insns per cycle
>>>                                             #    1.82  stalled cycles
>>> per insn  ( +-  0.15% )
>>>     8,824,763,690 branches                  #  239.116 M/sec
>>>         ( +-  0.24% )
>>>       142,482,338 branch-misses             #    1.61% of all branches
>>>         ( +-  0.05% )
>>>
>>>       5.227024403 seconds time elapsed
>>>         ( +-  0.43% )
>>>
>>> In terms of latency, this algorithm is a clear win.
>>>
>>> With these results, I have a feeling that this algorithm could be
>>> a reasonable addition to LLD. Only for a few percent of loss in size,
>>> it reduces latency by about 25%, so it might be a good option for
>>> daily edit-build-test cycles (on the other hand, disabling string
>>> merging with -O0 creates 2x larger executables, which is sometimes
>>> inconvenient even for daily development cycle.) You can still pass
>>> -O2 to produce production binaries.
>>>
>>> I have another idea to reduce string merging latency, so I'll
>>> implement that later for comparison.
>>>
>>>
>>> https://reviews.llvm.org/D27146
>>>
>>> Files:
>>>   ELF/InputSection.h
>>>   ELF/OutputSections.cpp
>>>   ELF/OutputSections.h
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161128/0c002a48/attachment.html>