[PATCH] [LoopInterchange] Add support to interchange loops with reductions.

Mon Mar 23 02:00:04 PDT 2015

On 03/21/2015 04:25 PM, Renato Golin wrote:
> In http://reviews.llvm.org/D8314#141802, @karthikthecool wrote:
>
>> Refactor some common code into functions. I have currently borrowed and modified some functions from loop vectorizer. Do i need to refactor them into a common utility as well? These functions such as AddReductionVar seems to be a bit tightly bound with loop vectorizer code.
>
>
> Yes, they are, and I can see what the problem is. But there is a lot of duplication added by this patch and I'm still uncomfortable. I've added Nadav and Arnold, our loop vectorizer experts, to assist on what to do next.
>
> I strongly suggest against duplication, and the only option I can think of is to spot the pattern while creating the reduction variable. You can create a function to iterate all containing loops and inspect all the ranges to make sure they match your pattern. Early exits should be made if the loop is not deep enough, or the outer loops don't iterate through any of the affected induction variables in your reduction.
>
>> Second change is in PassManagerBuilder. Running SimplifyCFGPass after LoopInterchange is sufficient to merge and remove redundant basic blocks(blocks with just unconditional branch)  produced after loop interhcange.Update the code to reflect the same.
>
>
> This is good news. Means that the pass is a lot less dramatic than you anticipated. :) This gives me hope that doing this inside the loop vectorizer can be managed.
>
>> I ran few phoronix benchmarks and lnt benchamrks but unfortunetly didn't see any improvement/regression due to this patch.
>
>
> I'd say "fortunately", since you haven't introduced any regressions, and that's a great thing!
>
>> As mentioned in previous comments post this change code such as-
>
>>
>
>>    void matrixMult(int N, int M, int K) {
>
>>      for(int i=0;i<N;i++)
>
>>        for(int j=0;j<M;j++)
>
>>          for(int k=0;k<K;k++)
>
>>            A[i][j]+=B[i][k]*C[k][j];
>
>>    }
>
>>
>
>> gets vectorized givinig some execution time improvement during large matrix multiplication.
>
>
> It seems we don't have that kind of benchmark on our test suite, and it would be good to have one. I don't know one off the top of my head, but maybe Hal/Nadav/Arnold could help.

This is 
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemm/gemm.c.

However, the largest speedups we see with Polly is using outer loop 
vectorization in combination with cache tiling. (We would also need 
register tiling to get close to get anywere close to optimal performance).

Cheers,
Tobias