[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

Sat Feb 1 19:16:26 PST 2014

Hi Chandler,

thanks for benchmarking this. I have tested on our arms and confirmed that we should enable the new register heuristics, the load-store heuristics, and the conditional predication of stores ( -enable-loadstore-runtime-unroll=1 -vectorize-num-stores-pred=1 -enable-ind-var-reg-heur=1).

I have not measured the impact of turning on vectorization of predicated stores on our arms (-enable-cond-stores-vec). The cost model does not account for the scalarization when we “vectorize” the stores. We could make things worse easily. I would rather leave this be disabled for now.

I am going to enable the above mentioned options by default.

Thanks,
Arnold

On Feb 1, 2014, at 4:02 AM, Chandler Carruth <chandlerc at google.com> wrote:

> On Fri, Jan 31, 2014 at 1:28 PM, Chandler Carruth <chandlerc at google.com> wrote:
> Hey Arnold,
> 
> I've completed some pretty thorough benchmarking and wanted to share the results.
> 
> On Mon, Jan 27, 2014 at 5:22 PM, Arnold Schwaighofer <aschwaighofer at apple.com> wrote:
> Furthermore, I added a heuristic to unroll until load/store ports are saturated “-mllvm enable-loadstore-runtime-unroll” instead of the pure size based heuristic. 
> 
> Those two together with a patch that slightly changes the register heuristic and libquantum’s three hot loops will unroll and goodness will ensue (at least for libquantum).
> 
> Both enabling loadstore runtime unrolling and the register heuristic (enabled with -enable-ind-var-reg-heur) show no interesting regressions (way below the noise) and a few nice benefits across all of the applications I measure. I'd support enabling them right away and getting more feedback from others. I've measured on both westmere and sandybridge, with -march=x86-64 and -march=corei7-avx.
> 
> I've now also measured -vectorize-num-stores-pred={1,2,4} both with and without -enable-cond-stores vec.
> 
> There are some crashers when using these currently. I may get a chance to reduce it soon, but I may not. However, enough built and ran that I can give some rough numbers on our end. With all permutations of these options I see a small improvement on a wide range o benchmarks running on westmere (march pinned at SSE3 essentially). I can't measure any real change between 1, 2, and 4. It's lost in the noise. But all are a definite improvement. The improvement is smaller on sandybridge for me, but still there, still consistent across 1, 2, and 4. No binary size impact of note (under 0.01% for *everything* discussed here).
> 
> When I target march=corei7-avx, I get no real performance change for these flags. No regressions, no improvements. And still no code size changes.
> 
> Note that for this last round, I started with the baseline of -enable-ind-var-reg-heur and -enable-loadstore-runtime-unroll, and added the -vectorize-num-stores-pred and -enable-cond-stores-vec to them.
> 
> So unless you (or others) chime in with worrisome evidence, I think we should probably turn all four of these on, with whatever value for -vectorize-num-stores-pred looks good in your benchmarking.