[LLVMdev] [BBVectorizer] Obvious vectorization benefit, but req-chain is too short

Fri Feb 3 05:50:48 PST 2012

On Fri, 2012-02-03 at 10:28 +0100, Tobias Grosser wrote:
> Hi Hal,
> 
> this is one of the first test cases, I would love to have improved 
> vectorizer support. I sent it out earlier, but I think it is a good time 
> to look into it again, after the vectorizer was committed.
> 
> The basic examples is a set of scalar loads that load for consecutive 
> elements and store them back right ahead. For me this is an obvious case 
> where vectorization is beneficial (scalar.ll):
> 
> define i32 @main() nounwind {
> %V1 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 0), 	
> 	align 16
> %V2 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 1), 	
> 	align 4
> %V3= load float* getelementptr ([1024 x float]* @A, i64 0, i64 2),
> 	align 8
> %V4 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 3),
> 	align 4
> store float %V1, float* getelementptr ([1024 x float]* @B, i64 0, i64
> 				       0), align 16
> store float %V2, float* getelementptr ([1024 x float]* @B, i64 0, i64
> 				       1), align 4
> store float %V3, float* getelementptr ([1024 x float]* @B, i64 0, i64
>                                         2), align 8
> store float %V4, float* getelementptr ([1024 x float]* @B, i64 0, i64
>                                         3), align 4
>    ret i32 0
> }
> 
> opt -O3 -vectorize can not optimize this straight ahead, as the 
> req-chain is too short.
> 
> Adding -bb-vectorize-req-chain-depth=2 allows us to vectorize the code:
> 
> define i32 @main() nounwind {
>    %V1 = load <4 x float>* bitcast ([1024 x float]* @A to <4 x float>*),
> 	align 16
>    store <4 x float> %V1, <4 x float>* bitcast ([1024 x float]* @B to <4
> 					       x float>*), align 16
>    ret i32 0
> }
> 
> Is there any way, we can make this case work by default? Maybe we can 
> decrease the req-chain to 2, and increase the cost for non stride one 
> loads or stores?

Making the default chain length 2 will lead to a lot of unprofitable
vectorization. I think we'll probably want to do something like make
getDepthFactor return 3 for loads and stores. (or make the default chain
length 4 and make getDepthFactor return 2 for loads and stores). We
should experiment with this [this was already on my post-commit TODO
list].

> 
> Another probably unrelated point. I tried also a run with 
> -bb-vectorize-req-chain-depth=1. The generated code is full of 
> shufflevector instructions and eight element vectors. For me this is 
> entirely unexpected. Do you have any ideas what is going on here?

A chain length of 1 means "vectorize any pairs that you possibly can",
and it will do this iteratively until it cannot do it any more. As the
iteration continues it will pair the previously-paired instructions,
until the requested bit limit is reached, and so you'll end up with long
vectors (of short types).

Thanks again,
Hal

> 
> Tobi

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory