[PATCH] Unrolling improvements (target indep. and for x86)

Fri Feb 21 23:05:17 PST 2014

On Fri, Feb 21, 2014 at 10:45 PM, Hal Finkel <hfinkel at anl.gov> wrote:

> Chandler pointed out to me last week that recent x86 cores can also
> benefit from partial unrolling because of how their uop buffers and
> loop-stream detectors work (both Intel and AMD chips are similar in this
> regard).

I just want to add a specific point of realization that occurred to me when
we were discussing this, and influenced my feeling that we should look into
using the partial unroller *in addition* to the loop vectorizer's unrolling.

The latter is, rightfully, about widening the loop. It exposes ILP and
other benefits. It is *not*, however, suitable to one thing which it is
currently being used for: unrolling *purely* to hide the branch cost and/or
properly fill the LSD or uop cache. For these purposes, restricting the
unrolling to that which can be done in an *interleaved* fashion isn't
always reasonable. Instead, we should also support doing this through
concatentation.

My general feeling is that we should essentially use the same
size-upper-bound metric in both the vectorizer's unroller and this one, and
unroll through interleaving as much as we can (subject to the independence
of the iterations), and then continue unrolling with concatentation until
we saturate whatever buffer size the targets wants.

That make sense to folks?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140221/6f7db002/attachment.html>