[PATCH] D36351: [lld][ELF] Add profile guided section layout

Thu Feb 8 16:35:22 PST 2018

On Thu, Feb 8, 2018 at 4:04 PM, Rafael Avila de Espindola <
rafael.espindola at gmail.com> wrote:

> Looking a bit more on why I might not be measuring a performance
> improvement I noticed that the callgraph I was using was missing
> conditional calls. The attached script fixes that.
>
> I have uploaded a new version of the test with the complete call graph
> to https://s3-us-west-2.amazonaws.com/linker-tests/t2.tar.xz.
>
> I also noticed that we were not considering the case of multiple symbols
> in the same section. The attached patch fixes that.
>

This is already handled in `CallGraphSort::CallGraphSort()` here:

    NodeIndex From = GetOrCreateNode(FromSB);
    NodeIndex To = GetOrCreateNode(ToSB);

    Nodes[To].Weight = SaturatingAdd(Nodes[To].Weight, Weight);

    if (From == To)
      continue;

It's specifically done here after the node weight adjustment so that
density calculation later on takes this into account.

- Michael Spencer

>
> Even with these changes I still get a iTLB regression.
>
> I am now going to try building hfsort and compare its results.
>
> Please upload a new patch on top of tree and include the attached fixes.
>
>
> diff --git a/ELF/Driver.cpp b/ELF/Driver.cpp
> index 1fc9a0ad5..a8d28de4c 100644
> --- a/ELF/Driver.cpp
> +++ b/ELF/Driver.cpp
> @@ -590,8 +590,12 @@ static void readCallGraph(MemoryBufferRef MB) {
>        fatal("parse error");
>      InputSectionBase *FromSec = SymbolSection.lookup(Fields[0]);
>      InputSectionBase *ToSec = SymbolSection.lookup(Fields[1]);
> -    if (FromSec && ToSec)
> -      Config->CallGraphProfile[std::make_pair(FromSec, ToSec)] = Count;
> +    if (FromSec == ToSec)
> +      continue;
> +    if (FromSec && ToSec) {
> +      uint64_t &V = Config->CallGraphProfile[std::make_pair(FromSec,
> ToSec)];
> +      V += Count;
> +    }
>    }
>  }
>
>
>
> Thanks,
> Rafael
>
> Michael Spencer <bigcheesegs at gmail.com> writes:
>
> > On Thu, Feb 8, 2018 at 10:41 AM, Rafael Avila de Espindola <
> > rafael.espindola at gmail.com> wrote:
> >
> >> Michael Spencer <bigcheesegs at gmail.com> writes:
> >>
> >> > On Tue, Feb 6, 2018 at 6:53 PM, Rafael Avila de Espindola <
> >> > rafael.espindola at gmail.com> wrote:
> >> >
> >> >> I have benchmarked this by timing lld ltoing FileCheck. The working
> set
> >> >> is much larger this time. The old callgraph had 4079 calls, this one
> has
> >> >> 30616.
> >> >>
> >> >> The results are somewhat similar:
> >> >>
> >> >>  Performance counter stats for '../default-ld.lld @response.txt' (10
> >> runs):
> >> >>
> >> >>            498,771      iTLB-load-misses
> >> >>             ( +-  0.10% )
> >> >>        224,751,360      L1-icache-load-misses
> >> >>            ( +-  0.00% )
> >> >>
> >> >>        2.339864606 seconds time elapsed
> >> >>       ( +-  0.06% )
> >> >>
> >> >>  Performance counter stats for '../sorted-ld.lld @response.txt' (10
> >> runs):
> >> >>
> >> >>            556,999      iTLB-load-misses
> >> >>             ( +-  0.17% )
> >> >>        216,788,838      L1-icache-load-misses
> >> >>            ( +-  0.01% )
> >> >>
> >> >>        2.326596163 seconds time elapsed
> >> >>       ( +-  0.04% )
> >> >>
> >> >> As with the previous test iTLB gets worse and L1 gets better. The net
> >> >> result is a very small speedup.
> >> >>
> >> >> Do you know how big the chromium call graph is?
> >> >>
> >> >
> >> > Not sure, but the call graph for a high profile internal game I
> tested is
> >> > about 10k functions and 17 MiB of .text, and I got a %2-%4 speedup.
> >> Given
> >> > that it's a game it runs a decent portion of that 17MiB 60 times a
> >> second,
> >> > while llvm is heavily pass based, so I don't expect the instruction
> >> working
> >> > set over a small period of time to be that high.
> >>
> >> One difference from the paper and the script I am using to create the
> >> call graph is that the script I have records every call the exact number
> >> of times. The script is attached.
> >>
> >> With sampling, a call foo->long_running_bar would be recorded multiple
> >> times and show up as multiple calls.
> >>
> >> The first seems better, but I wonder if sampling somehow produces a
> >> better result.
> >>
> >> With instrumentation (which I assume is what you used in the game), you
> >> also get an exact callgraph, no?
> >>
> >
> > You get an exact callgraph minus indirect calls as those currently aren't
> > captured.
> >
> > - Michael Spencer
> >
> >
> >>
> >> >
> >> > I am however surprised by the 10% increase in iTLB misses.
> >>
> >>
> >>
> >>
> >> Cheers,
> >> Rafael
> >>
> >>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20180208/b0c5a001/attachment.html>