[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

Chandler Carruth chandlerc at google.com
Sat Feb 1 04:02:47 PST 2014

On Fri, Jan 31, 2014 at 1:28 PM, Chandler Carruth <chandlerc at google.com>wrote:

> Hey Arnold,
> I've completed some pretty thorough benchmarking and wanted to share the
> results.
> On Mon, Jan 27, 2014 at 5:22 PM, Arnold Schwaighofer <
> aschwaighofer at apple.com> wrote:
>> Furthermore, I added a heuristic to unroll until load/store ports are
>> saturated “-mllvm enable-loadstore-runtime-unroll” instead of the pure size
>> based heuristic.
>> Those two together with a patch that slightly changes the register
>> heuristic and libquantum’s three hot loops will unroll and goodness will
>> ensue (at least for libquantum).
> Both enabling loadstore runtime unrolling and the register heuristic
> (enabled with -enable-ind-var-reg-heur) show no interesting regressions
> (way below the noise) and a few nice benefits across all of the
> applications I measure. I'd support enabling them right away and getting
> more feedback from others. I've measured on both westmere and sandybridge,
> with -march=x86-64 and -march=corei7-avx.

I've now also measured -vectorize-num-stores-pred={1,2,4} both with and
without -enable-cond-stores vec.

There are some crashers when using these currently. I may get a chance to
reduce it soon, but I may not. However, enough built and ran that I can
give some rough numbers on our end. With all permutations of these options
I see a small improvement on a wide range o benchmarks running on westmere
(march pinned at SSE3 essentially). I can't measure any real change between
1, 2, and 4. It's lost in the noise. But all are a definite improvement.
The improvement is smaller on sandybridge for me, but still there, still
consistent across 1, 2, and 4. No binary size impact of note (under 0.01%
for *everything* discussed here).

When I target march=corei7-avx, I get no real performance change for these
flags. No regressions, no improvements. And still no code size changes.

Note that for this last round, I started with the baseline of
-enable-ind-var-reg-heur and -enable-loadstore-runtime-unroll, and added
the -vectorize-num-stores-pred and -enable-cond-stores-vec to them.

So unless you (or others) chime in with worrisome evidence, I think we
should probably turn all four of these on, with whatever value for
-vectorize-num-stores-pred looks good in your benchmarking.
