<html><body><div style="color:#000; background-color:#fff; font-family:arial, helvetica, sans-serif;font-size:12pt"><br>I should note here that although SPEC provided us with a sufficiently
large sample for our spill-count experiment, I don't think that SPEC has
enough hot functions with spills to make our execution-time results
statistically significant. That's because SPEC has many benchmarks with
peaky profiles, where one of two functions dominate the execution time.
So, if one heuristic gets very lucky (or unlucky) on a few hot
functions, it may get a deceivingly high (or low) score.
That's why I think if someone runs the same kind of test on a different
benchmark suite with comparable size, he may get different
execution-time results, but most likely he will get the same spill count results that we got (of course, I mean the relative results).<br><br>-Ghassan <br><br> <div style="font-family: arial, helvetica, sans-serif; font-size: 12pt;"> <div style="font-family: times new roman, new york, times, serif; font-size: 12pt;"> <div dir="ltr"> <hr size="1"> <font face="Arial" size="2"> <b><span style="font-weight:bold;">From:</span></b> Renato Golin <renato.golin@linaro.org><br> <b><span style="font-weight: bold;">To:</span></b> Ghassan Shobaki <ghassan_shobaki@yahoo.com> <br><b><span style="font-weight: bold;">Cc:</span></b> Andrew Trick <atrick@apple.com>; "llvmdev@cs.uiuc.edu" <llvmdev@cs.uiuc.edu> <br> <b><span style="font-weight: bold;">Sent:</span></b> Thursday, September 19, 2013 8:27 PM<br> <b><span style="font-weight: bold;">Subject:</span></b> Re: [LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3<br> </font>
</div> <div class="y_msg_container"><br>
<div id="yiv308121476"><div dir="ltr">On 19 September 2013 17:25, Ghassan Shobaki <span dir="ltr"><<a rel="nofollow" ymailto="mailto:ghassan_shobaki@yahoo.com" target="_blank" href="mailto:ghassan_shobaki@yahoo.com">ghassan_shobaki@yahoo.com</a>></span> wrote:<br><div class="yiv308121476gmail_extra"><div class="yiv308121476gmail_quote">
<blockquote class="yiv308121476gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div><div style="font-size:12pt;font-family:arial, helvetica, sans-serif;"><div style="font-family:arial, helvetica, sans-serif;font-size:12pt;">
<div style="font-family:times new roman, new york, times, serif;font-size:12pt;"><div><div><div dir="ltr"><div><div><div>Ghassan: You have made me so curious to try other benchmarks in our future work. Most academic publications on CPU performance though use SPEC. You can even find some recent publications that are still using SPEC CPU2000! When I was at AMD in 2009, performance optimization and benchmarking was all about SPEC CPU2006. Have things changed so much in the past 4 years?</div>
</div></div></div></div></div></div></div></div></div></blockquote><div><br></div><div>Unfortunately, no. Most manufacturers still use SPEC (and others) to design, test and certify their products.</div><div><br></div><div>
This is not a problem per se, as SPEC is very good and reasonably generic, but any single benchmark can't cover the wide range of applications a CPU is likely to undergo along its life. So, my grudge is that there isn't much effort into understanding how to benchmark the different uses of a CPU, not necessarily against SPEC. I think SPEC is a good match for your project.</div>
<div><br></div><div><br></div><blockquote class="yiv308121476gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div><div style="font-size:12pt;font-family:arial, helvetica, sans-serif;"><div style="font-family:arial, helvetica, sans-serif;font-size:12pt;">
<div style="font-family:times new roman, new york, times, serif;font-size:12pt;"><div><div><div dir="ltr"><div><div><div> And the more important question is: what specific features do these non-SPEC benchmarks have that are likely to affect the scheduler's register pressure reduction behavior? <br>
</div></div></div></div></div></div></div></div></div></div></blockquote><div><br></div><div>No idea. ;) Mind you that I don't know any decent benchmark that will give you the "general user" case, but there are a number of specific benchmarks (browsers, codecs, databases, web servers all have benchmark features enabled).</div>
<div><br></div><div>Also, for your project, you're only interested in a very specific behaviour of a very specific part of the compiler (spills), so any benchmark will give you a way to test it, but every one will have some form of bias.</div>
<div><br></div><div>What I recommend is not to spend much time running a plethora of benchmarks, only to find out that they all tell you the same story, but try to find a benchmark that is completely different from SPEC (say, Browsermark or the MySQL benchmark suite) and see if the spill correlation is similar. </div>
<div><br></div><div>If it is, ignore it. If not, just mention that this correlation may not be seen with other benchmarks. ;)</div><div><br></div><div><br></div><blockquote class="yiv308121476gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div><div style="font-size:12pt;font-family:arial, helvetica, sans-serif;"><div style="font-family:arial, helvetica, sans-serif;font-size:12pt;"><div style="font-family:times new roman, new york, times, serif;font-size:12pt;"><div>
<div><div dir="ltr"><div><div><div><span style="font-size:12pt;">Ghassan: Can you please give more specific features in these modern benchmarks that affect spill code reduction? Note that our study included over ten thousand functions with spills. Such a large sample is expected to cover many different kinds of behavior, and that's why I am calling it a "statistically significant" sample. </span></div>
</div></div></div></div></div></div></div></div></div></blockquote><div><br></div><div>I was being a bit pedantic in pointing out that 10K data points are only statistically relevant if they're independent, which they might not be if each individual test was created / crafted with the same intent in mind (similar function size, number of functions, number of temporaries, etc).</div>
<div><br></div><div>Most programmers don't pay that much attention to good code and end up writing horrible code, that stress specific parts of the compiler. If you have access to PlumHall suite, I encourage you to compile the chapter 7.22 test as an example.</div>
<div><br></div><div>Also, related to register pressure, different bad codes will stress different algorithms, so you also have to be careful in stating that one algorithm is much better than others only based on one badly-written program.</div>
<div><br></div><div><br></div><blockquote class="yiv308121476gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><div><div style="font-size:12pt;font-family:arial, helvetica, sans-serif;"><div style="font-family:arial, helvetica, sans-serif;font-size:12pt;">
<div style="font-family:times new roman, new york, times, serif;font-size:12pt;"><div><div><div dir="ltr"><div><div><div><span style="font-size:12pt;">Ghassan: Sorry if I did not include a clear enough description of the numbers meanings. Let me explain that more precisely:</span><br>
</div><div>First of all, the "source" scheduler was indeed run for 9 iterations (which took about 2 days), and that was our baseline. All the numbers in the execution-time table are percentage differences relative to "source". Of course, there were random variations in the numbers, but we did the standard SPEC practice of taking the median. For most benchmarks, the random variation was not significant.</div>
</div></div></div></div></div></div></div></div></div></blockquote><div><br></div><div>I see, my mistake.</div><div><br></div><div><br></div><blockquote class="yiv308121476gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div><div style="font-size:12pt;font-family:arial, helvetica, sans-serif;"><div style="font-family:arial, helvetica, sans-serif;font-size:12pt;"><div style="font-family:times new roman, new york, times, serif;font-size:12pt;"><div>
<div><div dir="ltr"><div><div><div> There was one particular benchmark though (libquantum), on which we thought that the random variation is too large to make a meaningful comparison, and therefore we decided to exclude that.<br>
</div></div></div></div></div></div></div></div></div></div></blockquote><div><br></div><div>Quite amusing, having the libquantum behaving erratically. ;)</div><div></div></div><br></div><div class="yiv308121476gmail_extra">cheers,</div>
<div class="yiv308121476gmail_extra">--renato</div></div>
</div><br><br></div> </div> </div> </div></body></html>