[LLVMdev] [BBVectorizer] Obvious vectorization benefit, but req-chain is too short

Fri Feb 3 20:21:49 PST 2012

On Fri, 2012-02-03 at 10:28 +0100, Tobias Grosser wrote:
> Hi Hal,
> 
> this is one of the first test cases, I would love to have improved 
> vectorizer support. I sent it out earlier, but I think it is a good time 
> to look into it again, after the vectorizer was committed.
> 
> The basic examples is a set of scalar loads that load for consecutive 
> elements and store them back right ahead. For me this is an obvious case 
> where vectorization is beneficial (scalar.ll):
> 
> define i32 @main() nounwind {
> %V1 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 0), 	
> 	align 16
> %V2 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 1), 	
> 	align 4
> %V3= load float* getelementptr ([1024 x float]* @A, i64 0, i64 2),
> 	align 8
> %V4 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 3),
> 	align 4
> store float %V1, float* getelementptr ([1024 x float]* @B, i64 0, i64
> 				       0), align 16
> store float %V2, float* getelementptr ([1024 x float]* @B, i64 0, i64
> 				       1), align 4
> store float %V3, float* getelementptr ([1024 x float]* @B, i64 0, i64
>                                         2), align 8
> store float %V4, float* getelementptr ([1024 x float]* @B, i64 0, i64
>                                         3), align 4
>    ret i32 0
> }
> 
> opt -O3 -vectorize can not optimize this straight ahead, as the 
> req-chain is too short.
> 
> Adding -bb-vectorize-req-chain-depth=2 allows us to vectorize the code:
> 
> define i32 @main() nounwind {
>    %V1 = load <4 x float>* bitcast ([1024 x float]* @A to <4 x float>*),
> 	align 16
>    store <4 x float> %V1, <4 x float>* bitcast ([1024 x float]* @B to <4
> 					       x float>*), align 16
>    ret i32 0
> }
> 
> Is there any way, we can make this case work by default? Maybe we can 
> decrease the req-chain to 2, and increase the cost for non stride one 
> loads or stores?

Try it now (after r149761). If this "solution" causes other problems,
then we may need to think of something more sophisticated.

 -Hal

> 
> Another probably unrelated point. I tried also a run with 
> -bb-vectorize-req-chain-depth=1. The generated code is full of 
> shufflevector instructions and eight element vectors. For me this is 
> entirely unexpected. Do you have any ideas what is going on here?
> 
> Tobi

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory