[PATCH] Unrolling improvements (target indep. and for x86)

Fri Feb 21 22:45:45 PST 2014

Chandler, Nadav, et al.,

I've attached a set of three patches related to unrolling. The first patch (which I've had sitting around for a while), adjusts the generic loop unroller and how it is used in the standard optimization pipeline. Specifically, the existing invocation is replaced with one that *only* does full unrolling, and a second invocation is scheduled after loop vectorization that does, not only full unrolling, but also partial unrolling at the target's discretion. This has the direct benefit, on all targets, of allowing full unrolling of loops that, because of vectorization, are now small enough to be unrolled.

Chandler pointed out to me last week that recent x86 cores can also benefit from partial unrolling because of how their uop buffers and loop-stream detectors work (both Intel and AMD chips are similar in this regard). The second patch provides an implementation of the relevant x86 TTI callback to enable partial/runtime unrolling to better fill the loop uop buffers. Currently, this is controlled by the -x86-use-partial-unrolling flag (not enabled by default). Because the size of these buffers is small (10s of uops), this really only affects loops with very small bodies. Chandler volunteered to benchmark this for me, so I'll leave that part to him (and others) ;) I did, however, run the TSVC benchmarks on a penryn box, saw no significant regressions, and the following significant speedups (these were run with the third patch also applied):

ControlLoops-dbl - 13% speedup
ControlLoops-flt - 15% speedup
Reductions-dbl - 7.5% speedup

The third patch (which has a test case that covers this change and the x86 unrolling logic) improves the way that code metrics estimates the number of instructions in a block, specifically dealing with an add/or/xor (logically an addition) used by a GEP in such a way that it can be folded into the addressing mode of the user. Recognizing this folding seems important for accurately predicting the size of small loops on x86, and thus when partial unrolling might help. This should also help with inlining (and I'll note that there is likely more useful work to be done in this area). With this change these two instructions in the test case:
  %.sum9 = or i64 %index, 2
  %2 = getelementptr double* %b, i64 %.sum9
are both counted as free because the relevant call to isLegalAddressingMode returns true.

Please review.

Thanks again,
Hal

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gep-add-cost.patch
Type: text/x-patch
Size: 7012 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140222/680d091d/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: x86-partial-unrolling.patch
Type: text/x-patch
Size: 4431 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140222/680d091d/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: complex-unrolling-late.patch
Type: text/x-patch
Size: 4221 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140222/680d091d/attachment-0002.bin>