[PATCH] Unrolling improvements (target indep. and for x86)

Mon Feb 24 03:17:49 PST 2014

----- Original Message -----
> From: "Nadav Rotem" <nrotem at apple.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "Chandler Carruth" <chandlerc at google.com>, "llvm-commits" <llvm-commits at cs.uiuc.edu>, "Diego Novillo"
> <dnovillo at google.com>
> Sent: Sunday, February 23, 2014 6:57:41 PM
> Subject: Re: [PATCH] Unrolling improvements (target indep. and for x86)
> 
> 
> 
> 
> On Feb 21, 2014, at 11:19 PM, Hal Finkel < hfinkel at anl.gov > wrote:
> 
> 
> Currently, the vectorizer unrolls only for ILP (or latency hiding),
> subject to the register pressure estimate, and having it also unroll
> for size would be a change to the current behavior. Is that your
> desire?
> 
> 
> The loop vectorizer also has a basic size threshold to prevent cases
> where the loop body is too big to fit in the uOp cache.

Right. Currently, the vectorizer compares the loop cost to 'SmallLoopCost' to determine the unrolling factor. Having now looked at this in some detail, the issues are:

 1. SmallLoopCost is comparing the throughput cost, not the size cost (from TTI.getUserCost and friends), and tends to over-estimate the number of uops by ~50%. This primarily comes from failure to properly account for addressing modes, and the fact that floating point throughput is half of integer throughput (and so the floating point arithmetic is given twice the throughput cost). As a result, we tend to not unroll as much as we can.

 2. In order to unroll based on SmallLoopCost, several pre-conditions need to be satisfied: 1) The loop may not have reductions, and 2) we're not doing load/store runtime unrolling. For both of these, if there is still room, concatenation unrolling seems to help (I think the TSVC benchmarks show this for the reduction case, and I'm guessing for the second). Whether more interleaved unrolling would also help I'm not sure.

To Chandler's point, that we should use the same cost threshold for both, I agree that sounds like a good idea. It will take some work, however, because of (1): In order to use TTI.getUserCost, we must already have the vector instructions (and their associated addressing instructions), and at the time when the unrolling factor is selected in the vectorizer, we've not yet formed these instructions. I'm not sure how best to handle this.

 -Hal

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory