[PATCH] D96405: [DAGCombiner] Improve reduceBuildVecToShuffle Performance

Thu Feb 11 03:22:47 PST 2021

RKSimon added inline comments.

================
Comment at: llvm/test/CodeGen/X86/vector-shuffle-combining-avx512bwvl.ll:115
+; X86-NEXT:    vpsraw $8, %xmm1, %xmm1
+; X86-NEXT:    vpunpcklqdq {{.*#+}} ymm0 = ymm0[0],ymm1[0],ymm0[2],ymm1[2]
+; X86-NEXT:    vmovdqu %ymm0, (%eax)
----------------
mmarjieh wrote:
> RKSimon wrote:
> > At quick glance  - this looks wrong, I'd expect this still to be the same vshufpd?
> I am not familiar with X86's ISA.
> Can you explain why?
> Meanwhile, I will show you the difference in the DAG after my patch:
> 
> Before this patch:
> SelectionDAG has 28 nodes:
>   t0: ch = EntryToken
>   t5: v4i64,ch = load<(load 32 from `<4 x i64>* null`, align 8)> t0, Constant:i32<0>, undef:i32
>   t6: v4i64,ch = load<(load 32 from `<4 x i64>* undef`, align 8)> t0, undef:i32, undef:i32
>       t21: ch = TokenFactor t5:1, t6:1
>               t36: v8i16 = BUILD_VECTOR Constant:i16<0>, Constant:i16<0>, Constant:i16<0>, Constant:i16<0>, undef:i16, undef:i16, undef:i16, undef:i16
>             t52: v2f64 = bitcast t36
>           t62: v4f64 = concat_vectors t52, undef:v2f64
>         t63: v4f64 = vector_shuffle<u,u,0,0> t62, undef:v4f64
>               t37: v8i16 = X86ISD::VTRUNC t5
>             t39: v8i16 = sign_extend_inreg t37, ValueType:ch:v8i8
>           t44: v2f64 = bitcast t39
>               t40: v8i16 = X86ISD::VTRUNC t6
>             t41: v8i16 = sign_extend_inreg t40, ValueType:ch:v8i8
>           t49: v2f64 = bitcast t41
>         t59: v4f64 = concat_vectors t44, t49
>       t68: v4f64 = vector_shuffle<4,6,2,3> t63, t59
>       t3: i32,ch = load<(load 4 from %fixed-stack.0)> t0, FrameIndex:i32<-1>, undef:i32
>     t57: ch = store<(store 32 into %ir.10, align 2)> t21, t68, t3, undef:i32
>   t29: ch = X86ISD::RET_FLAG t57, TargetConstant:i32<0>
>   
>   
> After this patch:
> SelectionDAG has 26 nodes:
>   t0: ch = EntryToken
>   t5: v4i64,ch = load<(load 32 from `<4 x i64>* null`, align 8)> t0, Constant:i32<0>, undef:i32
>   t6: v4i64,ch = load<(load 32 from `<4 x i64>* undef`, align 8)> t0, undef:i32, undef:i32
>       t21: ch = TokenFactor t5:1, t6:1
>               t37: v8i16 = X86ISD::VTRUNC t5
>             t39: v8i16 = sign_extend_inreg t37, ValueType:ch:v8i8
>           t44: v2f64 = bitcast t39
>               t40: v8i16 = X86ISD::VTRUNC t6
>             t41: v8i16 = sign_extend_inreg t40, ValueType:ch:v8i8
>           t49: v2f64 = bitcast t41
>         t59: v4f64 = concat_vectors t44, t49
>             t36: v8i16 = BUILD_VECTOR Constant:i16<0>, Constant:i16<0>, Constant:i16<0>, Constant:i16<0>, undef:i16, undef:i16, undef:i16, undef:i16
>           t52: v2f64 = bitcast t36
>         t62: v4f64 = concat_vectors t52, undef:v2f64
>       t64: v4f64 = vector_shuffle<0,2,4,4> t59, t62
>       t3: i32,ch = load<(load 4 from %fixed-stack.0)> t0, FrameIndex:i32<-1>, undef:i32
>     t57: ch = store<(store 32 into %ir.10, align 2)> t21, t64, t3, undef:i32
>   t29: ch = X86ISD::RET_FLAG t57, TargetConstant:i32<0>
> 
My mistake - I missed that we were implicitly zeroing the upper elements (xmm -> ymm) - sorry about that

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D96405/new/

https://reviews.llvm.org/D96405