<div dir="ltr">Hi,<div><br></div><div>Out of interest, given the major churn going on in the test-suite at the moment, is now the right time to discuss how best to replace the utterly archaic and incomprehensible makefile system?</div>

<div><br></div><div>Cheers,</div><div><br></div><div>James</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On 20 May 2014 14:54, Yi Kong <span dir="ltr"><<a href="mailto:kongy.dev@gmail.com" target="_blank">kongy.dev@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">It's a really strange test case. The spikes disappear when I measure<br>

cache/TLB misses using perf.<br>

<div class="HOEnZb"><div class="h5"><br>

On 20 May 2014 13:47, Hal Finkel <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>> wrote:<br>

> ----- Original Message -----<br>

>> From: "Yi Kong" <<a href="mailto:kongy.dev@gmail.com">kongy.dev@gmail.com</a>><br>

>> To: "Tobias Grosser" <<a href="mailto:tobias@grosser.es">tobias@grosser.es</a>><br>

>> Cc: "Hal Finkel" <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>>, "Eric Christopher" <<a href="mailto:echristo@gmail.com">echristo@gmail.com</a>>, "llvm-commits"<br>


>> <<a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a>><br>

>> Sent: Tuesday, May 20, 2014 7:11:27 AM<br>

>> Subject: Re: [PATCH] Add benchmarking-only mode to the test suite<br>

>><br>

>> Tobias, I can't reproduce your findings on my machine. Even if I<br>

>> disabled output(removing -DPOLYBENCH_DUMP_ARRAYS) and piped to<br>

>> /dev/null, I still get lots of spikes. I think we need to exclude<br>

>> those tests until we find out how to stabilize those results.<br>

><br>

> Okay, I'll also exclude them for now. How large is the working set? Could you be seeing TLB misses?<br>

><br>

>  -Hal<br>

><br>

>><br>

>> On 18 May 2014 12:08, Yi Kong <<a href="mailto:kongy.dev@gmail.com">kongy.dev@gmail.com</a>> wrote:<br>

>> > I think that's due to the vast amount of output it produces. Maybe<br>

>> > replacing the output with an accumulator with give a more stable<br>

>> > result?<br>

>> ><br>

>> > On 17 May 2014 22:34, Tobias Grosser <<a href="mailto:tobias@grosser.es">tobias@grosser.es</a>> wrote:<br>

>> >> On 17/05/2014 14:08, Yi Kong wrote:<br>

>> >>><br>

>> >>> On 16 May 2014 15:25, Hal Finkel <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>> wrote:<br>

>> >>>><br>

>> >>>> ----- Original Message -----<br>

>> >>>>><br>

>> >>>>> From: "Yi Kong" <<a href="mailto:kongy.dev@gmail.com">kongy.dev@gmail.com</a>><br>

>> >>>>> To: "Hal Finkel" <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>><br>

>> >>>>> Cc: "Eric Christopher" <<a href="mailto:echristo@gmail.com">echristo@gmail.com</a>>, "llvm-commits"<br>

>> >>>>> <<a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a>>, "Tobias Grosser"<br>

>> >>>>> <<a href="mailto:tobias@grosser.es">tobias@grosser.es</a>><br>

>> >>>>> Sent: Thursday, May 15, 2014 5:41:04 PM<br>

>> >>>>> Subject: Re: [PATCH] Add benchmarking-only mode to the test<br>

>> >>>>> suite<br>

>> >>>>><br>

>> >>>>> On 15 May 2014 13:59, Hal Finkel <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>> wrote:<br>

>> >>>>>><br>

>> >>>>>> ----- Original Message -----<br>

>> >>>>>>><br>

>> >>>>>>> From: "Yi Kong" <<a href="mailto:kongy.dev@gmail.com">kongy.dev@gmail.com</a>><br>

>> >>>>>>> To: "Hal Finkel" <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>><br>

>> >>>>>>> Cc: "Eric Christopher" <<a href="mailto:echristo@gmail.com">echristo@gmail.com</a>>, "llvm-commits"<br>

>> >>>>>>> <<a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a>>, "Tobias Grosser"<br>

>> >>>>>>> <<a href="mailto:tobias@grosser.es">tobias@grosser.es</a>><br>

>> >>>>>>> Sent: Thursday, May 15, 2014 5:26:54 AM<br>

>> >>>>>>> Subject: Re: [PATCH] Add benchmarking-only mode to the test<br>

>> >>>>>>> suite<br>

>> >>>>>>><br>

>> >>>>>>> Hi Hal Finkel,<br>

>> >>>>>>><br>

>> >>>>>>> What's the criteria you use to to decide useful benchmarks?<br>

>> >>>>>><br>

>> >>>>>><br>

>> >>>>>> Please refer to the LLVMDev thread "[RFC] Benchmarking subset<br>

>> >>>>>> of<br>

>> >>>>>> the test suite" in which I explain my methadology in detail.<br>

>> >>>>><br>

>> >>>>><br>

>> >>>>> I think the approach you've taken is indeed sensible. However I<br>

>> >>>>> don't<br>

>> >>>>> really agree with your make -j6 option. The Xeon chip you are<br>

>> >>>>> testing<br>

>> >>>>> on only has 4 core, which means a lot of context switch<br>

>> >>>>> happens. The<br>

>> >>>><br>

>> >>>><br>

>> >>>> It is a dual-socket machine.<br>

>> >>>><br>

>> >>>>> noise produced by that would be far too great for "normal"<br>

>> >>>>> environment. Also I believe that the testing machine should be<br>

>> >>>>> as<br>

>> >>>>> quiet as possible, otherwise we are actually measuring the<br>

>> >>>>> noise!<br>

>> >>>><br>

>> >>>><br>

>> >>>> This is obviously ideal, but rarely possible in practice. More<br>

>> >>>> to the<br>

>> >>>> point, the buildbots are not quiet, but we still want to be able<br>

>> >>>> to extract<br>

>> >>>> execution-time changes from them without a large number of false<br>

>> >>>> positives.<br>

>> >>>> Some tests are just too sensitive to I/O time, or are too short,<br>

>> >>>> for this to<br>

>> >>>> be possible (because you really are just seeing the noise), and<br>

>> >>>> this<br>

>> >>>> exclusion list is meant to exclude such test. Given a sufficient<br>

>> >>>> number of<br>

>> >>>> samples (10, for example), I've confirmed that it is possible to<br>

>> >>>> extract<br>

>> >>>> meaningful timing differences from the others at high<br>

>> >>>> confidence.<br>

>> >>>><br>

>> >>>>><br>

>> >>>>> I've been investigating the timeit tool in test suite. It turns<br>

>> >>>>> out<br>

>> >>>>> to<br>

>> >>>>> be really inaccurate, and sometimes it's the main source of<br>

>> >>>>> noise we<br>

>> >>>>> are seeing. I've implemented using Linux perf tool to measure<br>

>> >>>>> time.<br>

>> >>>>> So<br>

>> >>>>> far it seems to produce much better results. I will publish the<br>

>> >>>>> finding with the patch in a separate thread once I've gathered<br>

>> >>>>> enough<br>

>> >>>>> data points. Maybe with the more accurate timing tool, we might<br>

>> >>>>> get a<br>

>> >>>>> different picture.<br>

>> >>>><br>

>> >>>><br>

>> >>>> That's great!<br>

>> >>>><br>

>> >>>>><br>

>> >>>>>><br>

>> >>>>>>><br>

>> >>>>>>> I suggest you to also have a look at the standard deviation<br>

>> >>>>>>> or<br>

>> >>>>>>> MAD.<br>

>> >>>>>><br>

>> >>>>>><br>

>> >>>>>> Of course this has already been considered and taken into<br>

>> >>>>>> account<br>

>> >>>>>> ;)<br>

>> >>>>>><br>

>> >>>>>>> Some of the tests have really large variance that we may not<br>

>> >>>>>>> want<br>

>> >>>>>>> to<br>

>> >>>>>>> include when benchmarking, eg.<br>

>> >>>>>>> Polybench/linear-algebra/kernels/3mm/3mm. I've attached a<br>

>> >>>>>>> patch<br>

>> >>>>>>> which<br>

>> >>>>>>> makes tables sortable so that it is easier to investigate.<br>

>> >>>>>><br>

>> >>>>>><br>

>> >>>>>> If you feel that there is a test or tests that have too large<br>

>> >>>>>> of a<br>

>> >>>>>> variance for useful benchmarking, please compose a list,<br>

>> >>>>>> explain<br>

>> >>>>>> your criteria, and we'll merge in some useful way.<br>

>> >>>>><br>

>> >>>>><br>

>> >>>>> Mainly Polybench/linear-algebra, but I can't give you the list<br>

>> >>>>> right<br>

>> >>>>> now as LLVM LNT site is down again.<br>

>> >>><br>

>> >>><br>

>> >>> These 5 tests have really large MAD on various testing machines,<br>

>> >>> even<br>

>> >>> with perf tools. Please add them the the exclusion list.<br>

>> >>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/3mm<br>

>> >>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm<br>

>> >>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemm/gemm<br>

>> >>> SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm<br>

>> >><br>

>> >><br>

>> >> This is interesting. Those benchmarks should in fact give reliable<br>

>> >> performance numbers (and they do so when I execute them). I just<br>

>> >> very<br>

>> >> briefly looked into this and my observation was that, if I pipe<br>

>> >> the output<br>

>> >> to a file or /dev/null, the gemm performance is always at the<br>

>> >> lower bound.<br>

>> >> Only if I run 'timeit' I see these spikes. I see similar spikes if<br>

>> >> I just<br>

>> >> print the output to the console.<br>

>> >><br>

>> >> It would be great if we could understand where those spikes come<br>

>> >> from.<br>

>> >><br>

>> >> Tobias<br>

>><br>

><br>

> --<br>

> Hal Finkel<br>

> Assistant Computational Scientist<br>

> Leadership Computing Facility<br>

> Argonne National Laboratory<br>

_______________________________________________<br>

llvm-commits mailing list<br>

<a href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a><br>

</div></div></blockquote></div><br></div>