<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Mar 16, 2015 at 11:39 PM, Davide Italiano <span dir="ltr"><<a href="mailto:davide@freebsd.org" target="_blank">davide@freebsd.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On Tue, Mar 17, 2015 at 7:17 AM, Sean Silva <<a href="mailto:chisophugis@gmail.com">chisophugis@gmail.com</a>> wrote:<br>

><br>

><br>

> On Mon, Mar 16, 2015 at 10:52 PM, Davide Italiano <<a href="mailto:davide@freebsd.org">davide@freebsd.org</a>><br>

> wrote:<br>

>><br>

>> On Mon, Mar 16, 2015 at 1:54 AM, Davide Italiano <<a href="mailto:davide@freebsd.org">davide@freebsd.org</a>><br>

>> wrote:<br>

>> ><br>

>> > Shankar's parallel for per-se didn't introduce any performance benefit<br>

>> > (or regression).<br>

>> > If the change I propose is safe, I would like to see Shankar's change<br>

>> > in (and this on top of it).<br>

>> > I have other related changes coming next, but I would like to tackle<br>

>> > them one at a time.<br>

>> ><br>

>><br>

>> Here's an update.<br>

>><br>

>> After <a href="http://reviews.llvm.org/D8372" target="_blank">http://reviews.llvm.org/D8372</a> , I updated the profiling data.<br>

>><br>

>> <a href="https://people.freebsd.org/~davide/llvm/lld-03162015.svg" target="_blank">https://people.freebsd.org/~davide/llvm/lld-03162015.svg</a><br>

>> It seems now 85% of CPU time is spent inside<br>

>> FileArchive::buildTableOfContents().<br>

><br>

><br>

> I'm rather amazed that that patch changed the total CPU time. Just doing the<br>

> work in parallel shouldn't reduce the total CPU time spent on the task. A<br>

> reduction in CPU time would happen though if parallelizing it increased the<br>

> single-threaded performance of the tasks being done in parallel. Perhaps<br>

> using multiple cores means we are using multiple caches, so each thread is<br>

> getting much better single-threaded performance due to reduced memory<br>

> bottlenecking?<br>

><br>

> -- Sean Silva<br>

><br>

>><br>

>> In particular, 35% of the samples are spent inserting into<br>

>> unordered_map, so there's maybe something we can do differently there<br>

>> (e.g. , Rui's proposal of a concurrent map doesn't seem that bad).<br>

>><br>

>> Thanks,<br>

>><br>

>> --<br>

>> Davide<br>

>><br>

>> "There are no solved problems; there are only problems that are more<br>

>> or less solved" -- Henri Poincare<br>

>> _______________________________________________<br>

>> LLVM Developers mailing list<br>

>> <a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>

>> <a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

><br>

><br>

<br>

</div></div>David, Thanks for the input. I'll try DenseMap tomorrow and report results.<br>

Sean, I personally was amazed by that too. I cannot exclude some<br>

errors in the sampling for hwpmc,</blockquote><div><br></div><div>If you just measure cache misses with hardware counter (no sampling), it should be zero-overhead. Just getting comparative total counts of cache misses should give some insight.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> I'll try to repeat the profiling<br>

and/or use another profiler to see if I can confirm the results.<br>

About your other answer, I guess that would require a more<br>

fine-grained analysis which includes memory bandwidth, cache misses<br>

etc.. I'll try to get to it later this week or in the weekend. For<br>

now, I'm just focusing on CPU profiling.<br></blockquote><div><br></div><div>A simple non-fine-grained way to sanity-check the hypothesis is to enable/disable hyperthreading and/or restrict LLD to run on cores that share/don't share hardware cache resources. The hypothesis is that the total CPU time should be relatively insensitive to adding/removing extra execution resources that don't also add cache resources, while it should be relatively sensitive to adding/removing cache resources that don't change execution resources (e.g. pin LLD to 2 cores that share a cache vs pin LLD to two cores that don't share that cache; or pin LLD to 8 threads, one on each core vs. pinning LLD to 8 threads, two per core (hyperthreading)).</div><div><br></div><div>-- Sean Silva</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="HOEnZb"><div class="h5"><br>

Thanks,<br>

<br>

--<br>

Davide<br>

<br>

"There are no solved problems; there are only problems that are more<br>

or less solved" -- Henri Poincare<br>

</div></div></blockquote></div><br></div></div>