[PATCH] D36351: [lld][ELF] Add profile guided section layout

Thu Feb 8 16:04:50 PST 2018

Looking a bit more on why I might not be measuring a performance
improvement I noticed that the callgraph I was using was missing
conditional calls. The attached script fixes that.

I have uploaded a new version of the test with the complete call graph
to https://s3-us-west-2.amazonaws.com/linker-tests/t2.tar.xz.

I also noticed that we were not considering the case of multiple symbols
in the same section. The attached patch fixes that.

Even with these changes I still get a iTLB regression.

I am now going to try building hfsort and compare its results.

Please upload a new patch on top of tree and include the attached fixes.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: get-call-graph.py
Type: application/octet-stream
Size: 1347 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20180208/da95f323/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: t.diff
Type: text/x-patch
Size: 637 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20180208/da95f323/attachment.bin>
-------------- next part --------------

Thanks,
Rafael

Michael Spencer <bigcheesegs at gmail.com> writes:

> On Thu, Feb 8, 2018 at 10:41 AM, Rafael Avila de Espindola <
> rafael.espindola at gmail.com> wrote:
>
>> Michael Spencer <bigcheesegs at gmail.com> writes:
>>
>> > On Tue, Feb 6, 2018 at 6:53 PM, Rafael Avila de Espindola <
>> > rafael.espindola at gmail.com> wrote:
>> >
>> >> I have benchmarked this by timing lld ltoing FileCheck. The working set
>> >> is much larger this time. The old callgraph had 4079 calls, this one has
>> >> 30616.
>> >>
>> >> The results are somewhat similar:
>> >>
>> >>  Performance counter stats for '../default-ld.lld @response.txt' (10
>> runs):
>> >>
>> >>            498,771      iTLB-load-misses
>> >>             ( +-  0.10% )
>> >>        224,751,360      L1-icache-load-misses
>> >>            ( +-  0.00% )
>> >>
>> >>        2.339864606 seconds time elapsed
>> >>       ( +-  0.06% )
>> >>
>> >>  Performance counter stats for '../sorted-ld.lld @response.txt' (10
>> runs):
>> >>
>> >>            556,999      iTLB-load-misses
>> >>             ( +-  0.17% )
>> >>        216,788,838      L1-icache-load-misses
>> >>            ( +-  0.01% )
>> >>
>> >>        2.326596163 seconds time elapsed
>> >>       ( +-  0.04% )
>> >>
>> >> As with the previous test iTLB gets worse and L1 gets better. The net
>> >> result is a very small speedup.
>> >>
>> >> Do you know how big the chromium call graph is?
>> >>
>> >
>> > Not sure, but the call graph for a high profile internal game I tested is
>> > about 10k functions and 17 MiB of .text, and I got a %2-%4 speedup.
>> Given
>> > that it's a game it runs a decent portion of that 17MiB 60 times a
>> second,
>> > while llvm is heavily pass based, so I don't expect the instruction
>> working
>> > set over a small period of time to be that high.
>>
>> One difference from the paper and the script I am using to create the
>> call graph is that the script I have records every call the exact number
>> of times. The script is attached.
>>
>> With sampling, a call foo->long_running_bar would be recorded multiple
>> times and show up as multiple calls.
>>
>> The first seems better, but I wonder if sampling somehow produces a
>> better result.
>>
>> With instrumentation (which I assume is what you used in the game), you
>> also get an exact callgraph, no?
>>
>
> You get an exact callgraph minus indirect calls as those currently aren't
> captured.
>
> - Michael Spencer
>
>
>>
>> >
>> > I am however surprised by the 10% increase in iTLB misses.
>>
>>
>>
>>
>> Cheers,
>> Rafael
>>
>>