[PATCH] D27146: Merge strings using a probabilistic algorithm to reduce latency.
David Blaikie via llvm-commits
llvm-commits at lists.llvm.org
Mon Nov 28 09:44:22 PST 2016
Ah, awesome - looks great. Thanks for the reference!
On Mon, Nov 28, 2016 at 9:38 AM Rui Ueyama <ruiu at google.com> wrote:
> Sorry, I mean https://reviews.llvm.org/D27155
>
> On Mon, Nov 28, 2016 at 9:37 AM, Rui Ueyama <ruiu at google.com> wrote:
>
> I eventually wrote three patches for this, and
> https://reviews.llvm.org/D27146 is most promising. (If you are not aware
> of that / haven't reached to the top your mail inbox yet.)
>
> On Mon, Nov 28, 2016 at 9:33 AM, David Blaikie <dblaikie at gmail.com> wrote:
>
> (not much to add except that I kind of love this - really neat
> idea/direction to pursue/play with possibilities)
>
> As for making this stable though probabilistic: any chance of seeding the
> RNG with a known value to get stability? (possibly using some of the input
> contents as the seed, if that's helpful) - still risks pathological cases,
> I suppose, but should be OK?
>
> On Sat, Nov 26, 2016 at 9:11 PM Rui Ueyama via Phabricator via
> llvm-commits <llvm-commits at lists.llvm.org> wrote:
>
> ruiu created this revision.
> ruiu added reviewers: rafael, silvas.
> ruiu added a subscriber: llvm-commits.
>
> I'm sending this patch to get fedback. I haven't convince even myself
> that this is the right thing to do. But this should be interesting
> to those who want to see what we can do to improve linker's latency.
>
> String merging is one of the slowest passes in LLD because of the
> sheer number of mergeable strings. For example, Clang with debug info
> contains 30 millions of mergeable strings (average length is about 50
> bytes). They need to be uniquified, and uniquified strings need to
> get consecutive offsets in the resulting string table.
>
> Currently, we are using a (single-threaded, regular) dense map for
> string unification. Merging the 30 million strings takes about 2
> seconds on my machine.
>
> This patch implements one of my ideas about how to reduce latency by
> parallelizing it. This algorithm is probabilistic, meaining that
> even though duplicated strings are likely to be merged, that's not
> guaranteed. As a result, it produces larger string table quickly.
> (If you need to optimize in size, you could still pass -O2 which
> does tail-merging.)
>
> Here's how it works.
>
> In the first step, we take 10% of input string set to create a small
> string table. The resulting string table is very unlikely to contain
> all strings of the entire set, but it is likely to contain most of
> duplicated strings, because duplicated strings are repeated many times.
>
> The second step processes the remaining 90% in parallel. In this step,
> we do not merge strings. So, if a string is not in the small string
> table we created in the first step, that will just be appended to end
> of the string table. This step completes the string table.
>
> Here are some numbers of resulting clang executables:
>
> Size of .debug_str section:
> Current 108,049,822 (+0%)
> Probabilistic 154,089,550 (+42.6%)
> No string merging 1,591,388,940 (+1472.8%)
>
> Size of resulting file:
> Current 1,440,453,528 (+0%)
> Probabilistic 1,490,597,448 (+3.5%)
> No string merging 2,945,020,808 (+204.5%)
>
> The probabilistic algorithm produces larger string table, but that's
> much smaller than that without string merging. Compared to the entire
> executable size, the loss is only 3.5%.
>
> Here is a speedup in latency:
>
> Before:
>
> 36098.025468 task-clock (msec) # 5.256 CPUs utilized
> ( +- 0.95% )
> 190,770 context-switches # 0.005 M/sec
> ( +- 0.25% )
> 7,609 cpu-migrations # 0.211 K/sec
> ( +- 11.40% )
> 2,378,416 page-faults # 0.066 M/sec
> ( +- 0.07% )
> 99,645,202,279 cycles # 2.760 GHz
> ( +- 0.94% )
> 81,128,226,367 stalled-cycles-frontend # 81.42% frontend cycles
> idle ( +- 1.10% )
> <not supported> stalled-cycles-backend
> 45,662,681,567 instructions # 0.46 insns per cycle
> # 1.78 stalled cycles per
> insn ( +- 0.14% )
> 8,864,616,311 branches # 245.571 M/sec
> ( +- 0.22% )
> 146,360,227 branch-misses # 1.65% of all branches
> ( +- 0.06% )
>
> 6.868559257 seconds time elapsed
> ( +- 0.50% )
>
> After:
>
> 36905.733802 task-clock (msec) # 7.061 CPUs utilized
> ( +- 0.84% )
> 159,813 context-switches # 0.004 M/sec
> ( +- 0.24% )
> 8,079 cpu-migrations # 0.219 K/sec
> ( +- 12.67% )
> 2,296,298 page-faults # 0.062 M/sec
> ( +- 0.21% )
> 102,178,380,224 cycles # 2.769 GHz
> ( +- 0.83% )
> 83,846,653,367 stalled-cycles-frontend # 82.06% frontend cycles
> idle ( +- 0.96% )
> <not supported> stalled-cycles-backend
> 46,138,345,206 instructions # 0.45 insns per cycle
> # 1.82 stalled cycles per
> insn ( +- 0.15% )
> 8,824,763,690 branches # 239.116 M/sec
> ( +- 0.24% )
> 142,482,338 branch-misses # 1.61% of all branches
> ( +- 0.05% )
>
> 5.227024403 seconds time elapsed
> ( +- 0.43% )
>
> In terms of latency, this algorithm is a clear win.
>
> With these results, I have a feeling that this algorithm could be
> a reasonable addition to LLD. Only for a few percent of loss in size,
> it reduces latency by about 25%, so it might be a good option for
> daily edit-build-test cycles (on the other hand, disabling string
> merging with -O0 creates 2x larger executables, which is sometimes
> inconvenient even for daily development cycle.) You can still pass
> -O2 to produce production binaries.
>
> I have another idea to reduce string merging latency, so I'll
> implement that later for comparison.
>
>
> https://reviews.llvm.org/D27146
>
> Files:
> ELF/InputSection.h
> ELF/OutputSections.cpp
> ELF/OutputSections.h
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161128/8e4c4c8a/attachment.html>
More information about the llvm-commits
mailing list