[LLVMdev] Experimental Evaluation of the Schedulers in LLVM 3.3
renato.golin at linaro.org
Thu Sep 19 07:30:16 PDT 2013
On 17 September 2013 19:04, Ghassan Shobaki <ghassan_shobaki at yahoo.com>wrote:
> We have done some experimental evaluation of the different schedulers in
> LLVM 3.3 (source, BURR, ILP, fast, MI). The evaluation was done on x86-64
> using SPEC CPU2006. We have measured both the amount of spill code as well
> as the execution time as detailed below.
This is an amazing piece of work, thanks for doing this. We need more
benchmarks like yours, and more often, too.
3. The source scheduler is the second best scheduler in terms of spill code
> and execution time, and its performance is very close to that of BURR in
> both metrics. This result is surprising for me, because, as far as I
> understand, this scheduler is a conservative scheduler that tries to
> preserve the original program order, isn't it? Does this result surprise
Well, SPEC is an old benchmark, when code was written to accommodate the
hardware requirements, so preserving the code order might not be that big
of a deal on SPEC, as it is on other types of code. So far, I haven't found
SPEC being too good to judge overall compilers' performance, but specific
Besides, hardware and software are designed nowadays based on some version
of Dhrystone, EEMBC, SPEC or CoreMark, so it's not impossible to see 50%
increase in performance with little changes in either.
4. The ILP scheduler has the worst execution times on FP2006 and the second
> worst spill counts, although it is the default on x86-64. Is this
> surprising? BTW, Dragon Egg sets the scheduler to source. On Line 368 in
> Backend.cpp, we find:
> if (!flag_schedule_insns)
This looks like someone ran a similar test and did the sensible thing. How
that reflects with Clang, or how important it is to be the default, I don't
know. This is the same discussion as the optimization levels, and what
passes should be included in what. It also depends on which scheduler will
evolve faster or further in time, and what kind of code you're compiling...
This is not a perfectly accurate metric, but, given the large sample size
> (> 10K functions), the total number of spills across such a statistically
> significant sample is believed to give a very strong indication about each
> scheduler's performance at reducing register pressure.
I agree this is a good enough metric, but I'd be cautious in stating that
there is a "very strong indication about each scheduler's performance".
SPEC is, after all, a special case in compiler/hardware world, and anything
you see here might not happen anywhere else.
Real world, modern code, (such as LAMP stack, browsers, office suites, etc)
are written expecting the compiler to do magic, while old-school benchmarks
weren't, and they were optimized for decades by both compiler and hardware
> The %Diff Max (Min) is the maximum (minimum) percentage difference on a
> single benchmark between each scheduler and the source scheduler. These
> numbers show the differences on individual FP benchmarks can be quite
I'm surprised that you didn't run "source" 5/9 times, too. Did you get the
exact performance numbers multiple times? Would be good to have a more
realistic geo-mean for source as well, so we could estimate how much the
other geo-means vary in comparison to source's.
Most of the above performance differences have been correlated with
> significant changes in spill counts in hot functions.
Which is a beautiful correlation between spill-rate and performance,
showing that your metrics are at least reasonably accurate, for all
We should probably report this as a performance bug if ILP stays the
> default scheduler on x86-64.
You should, regardless of what's the default choice.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-dev