[LLVMdev] On LLD performance
chisophugis at gmail.com
Tue Mar 17 12:39:47 PDT 2015
On Mon, Mar 16, 2015 at 11:39 PM, Davide Italiano <davide at freebsd.org>
> On Tue, Mar 17, 2015 at 7:17 AM, Sean Silva <chisophugis at gmail.com> wrote:
> > On Mon, Mar 16, 2015 at 10:52 PM, Davide Italiano <davide at freebsd.org>
> > wrote:
> >> On Mon, Mar 16, 2015 at 1:54 AM, Davide Italiano <davide at freebsd.org>
> >> wrote:
> >> >
> >> > Shankar's parallel for per-se didn't introduce any performance benefit
> >> > (or regression).
> >> > If the change I propose is safe, I would like to see Shankar's change
> >> > in (and this on top of it).
> >> > I have other related changes coming next, but I would like to tackle
> >> > them one at a time.
> >> >
> >> Here's an update.
> >> After http://reviews.llvm.org/D8372 , I updated the profiling data.
> >> https://people.freebsd.org/~davide/llvm/lld-03162015.svg
> >> It seems now 85% of CPU time is spent inside
> >> FileArchive::buildTableOfContents().
> > I'm rather amazed that that patch changed the total CPU time. Just doing
> > work in parallel shouldn't reduce the total CPU time spent on the task. A
> > reduction in CPU time would happen though if parallelizing it increased
> > single-threaded performance of the tasks being done in parallel. Perhaps
> > using multiple cores means we are using multiple caches, so each thread
> > getting much better single-threaded performance due to reduced memory
> > bottlenecking?
> > -- Sean Silva
> >> In particular, 35% of the samples are spent inserting into
> >> unordered_map, so there's maybe something we can do differently there
> >> (e.g. , Rui's proposal of a concurrent map doesn't seem that bad).
> >> Thanks,
> >> --
> >> Davide
> >> "There are no solved problems; there are only problems that are more
> >> or less solved" -- Henri Poincare
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> David, Thanks for the input. I'll try DenseMap tomorrow and report results.
> Sean, I personally was amazed by that too. I cannot exclude some
> errors in the sampling for hwpmc,
If you just measure cache misses with hardware counter (no sampling), it
should be zero-overhead. Just getting comparative total counts of cache
misses should give some insight.
> I'll try to repeat the profiling
> and/or use another profiler to see if I can confirm the results.
> About your other answer, I guess that would require a more
> fine-grained analysis which includes memory bandwidth, cache misses
> etc.. I'll try to get to it later this week or in the weekend. For
> now, I'm just focusing on CPU profiling.
A simple non-fine-grained way to sanity-check the hypothesis is to
enable/disable hyperthreading and/or restrict LLD to run on cores that
share/don't share hardware cache resources. The hypothesis is that the
total CPU time should be relatively insensitive to adding/removing extra
execution resources that don't also add cache resources, while it should be
relatively sensitive to adding/removing cache resources that don't change
execution resources (e.g. pin LLD to 2 cores that share a cache vs pin LLD
to two cores that don't share that cache; or pin LLD to 8 threads, one on
each core vs. pinning LLD to 8 threads, two per core (hyperthreading)).
-- Sean Silva
> "There are no solved problems; there are only problems that are more
> or less solved" -- Henri Poincare
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-dev