[PATCH] D59710: [SLP] remove lower limit for forming reduction patterns

Fri Nov 1 11:18:12 PDT 2019

ABataev added inline comments.

================
Comment at: llvm/test/Transforms/SLPVectorizer/X86/hadd.ll:82-86
+; CHECK-NEXT:    [[RDX_SHUF1:%.*]] = shufflevector <2 x i64> [[A:%.*]], <2 x i64> undef, <2 x i32> <i32 1, i32 undef>
+; CHECK-NEXT:    [[BIN_RDX2:%.*]] = add <2 x i64> [[RDX_SHUF1]], [[A]]
+; CHECK-NEXT:    [[RDX_SHUF:%.*]] = shufflevector <2 x i64> [[B:%.*]], <2 x i64> undef, <2 x i32> <i32 1, i32 undef>
+; CHECK-NEXT:    [[BIN_RDX:%.*]] = add <2 x i64> [[RDX_SHUF]], [[B]]
+; CHECK-NEXT:    [[R01:%.*]] = shufflevector <2 x i64> [[BIN_RDX2]], <2 x i64> [[BIN_RDX]], <2 x i32> <i32 0, i32 2>
----------------
spatel wrote:
> ABataev wrote:
> > This one is worse than it was before for SSE
> Here are the SSE alternatives:
> 
> Without SLP (original IR):
> 	
> ```
> movq	%xmm0, %rax
> pshufd	$78, %xmm0, %xmm0       ## xmm0 = xmm0[2,3,0,1]
> movq	%xmm0, %rcx
> addq	%rax, %rcx
> movq	%xmm1, %rax
> pshufd	$78, %xmm1, %xmm0       ## xmm0 = xmm1[2,3,0,1]
> movq	%xmm0, %rdx
> addq	%rax, %rdx
> movq	%rdx, %xmm1
> movq	%rcx, %xmm0
> punpcklqdq	%xmm1, %xmm0    ## xmm0 = xmm0[0],xmm1[0]
> 
> ```
> With SLP currently:
> 	
> ```
> movdqa	%xmm0, %xmm2
> punpcklqdq	%xmm1, %xmm2    ## xmm2 = xmm2[0],xmm1[0]
> punpckhqdq	%xmm1, %xmm0    ## xmm0 = xmm0[1],xmm1[1]
> paddq	%xmm2, %xmm0
> 
> ```
> With this SLP patch:
> 	
> ```
> pshufd	$78, %xmm0, %xmm2       ## xmm2 = xmm0[2,3,0,1]
> paddq	%xmm2, %xmm0
> pshufd	$78, %xmm1, %xmm2       ## xmm2 = xmm1[2,3,0,1]
> paddq	%xmm1, %xmm2
> punpcklqdq	%xmm2, %xmm0    ## xmm0 = xmm0[0],xmm2[0]
> 
> ```
> Ideally, we can get SLP to choose the shorter sequence (bypass treating this as a reduction). 
> 
> I don't think we can ask instcombine to create that sequence because it requires creating semi-arbitrary shuffle instuctions.
> 
> Or we can view this as a backend opportunity to reduce a shuffle-binop-shuffle sequence:
>         t6: v2i64 = vector_shuffle<1,u> t2, undef:v2i64
>       t7: v2i64 = add t6, t2
>         t8: v2i64 = vector_shuffle<1,u> t4, undef:v2i64
>       t9: v2i64 = add t8, t4
>     t10: v2i64 = vector_shuffle<0,2> t7, t9
> 
Maybe, try to reduce 2 elements only after regular reduction did not work somehow?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59710/new/

https://reviews.llvm.org/D59710