[PATCH] Unrolling improvements (target indep. and for x86)

Sun Feb 23 21:08:31 PST 2014

On Feb 21, 2014, at 11:05 PM, Chandler Carruth <chandlerc at google.com> wrote:

> 
> On Fri, Feb 21, 2014 at 10:45 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> Chandler pointed out to me last week that recent x86 cores can also benefit from partial unrolling because of how their uop buffers and loop-stream detectors work (both Intel and AMD chips are similar in this regard).
> 
> I just want to add a specific point of realization that occurred to me when we were discussing this, and influenced my feeling that we should look into using the partial unroller *in addition* to the loop vectorizer's unrolling.
> 
> The latter is, rightfully, about widening the loop. It exposes ILP and other benefits. It is *not*, however, suitable to one thing which it is currently being used for: unrolling *purely* to hide the branch cost and/or properly fill the LSD or uop cache. For these purposes, restricting the unrolling to that which can be done in an *interleaved* fashion isn't always reasonable. Instead, we should also support doing this through concatenation.

LSD was a mechanism that Core2 had for cacheing decoded x86 instructions. It was a power saving feature since the decoder was fast enough. Starting with sandy bridge there is a single huge (12k I think) cache for decoded uOps. But I imagine that some other processors have LSD-like mechanisms so I suggest that we still support unroll thresholds. 

> 
> My general feeling is that we should essentially use the same size-upper-bound metric in both the vectorizer's unroller and this one, and unroll through interleaving as much as we can (subject to the independence of the iterations), and then continue unrolling with concatentation until we saturate whatever buffer size the targets wants.
> 
> That make sense to folks?

On out-of-order processors cond branch instructions are ‘executed’ just to verify that the frontend guessed correctly. This happens independently of the arithmetic calculations and is usually not the bottleneck.  If I remember correctly Haswell can execute two branches each cycle.

The loop vectorizer has (had?) a heuristics for unrolling loops for reducing the branch overhead.  This only works for data-parallel loops and not for loops with cross iteration dependencies. Considering the capabilities of modern processors I am not sure how important it is to unroll loops with multiple basic blocks (that can’t be if-converted) or non data-parallel loops.  

Do you have any loops in mind?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140223/bd08ee8d/attachment.html>