[PATCH] D70607: [x86] make SLM extract vector element more expensive than default

Sun Nov 24 06:03:22 PST 2019

spatel marked an inline comment as done.
spatel added inline comments.

================
Comment at: llvm/test/Transforms/SLPVectorizer/X86/hadd.ll:302
 ; SLM-LABEL: @test_v4i64(
-; SLM-NEXT:    [[TMP1:%.*]] = shufflevector <4 x i64> [[A:%.*]], <4 x i64> [[B:%.*]], <2 x i32> <i32 0, i32 4>
-; SLM-NEXT:    [[TMP2:%.*]] = shufflevector <4 x i64> [[A]], <4 x i64> [[B]], <2 x i32> <i32 1, i32 5>
-; SLM-NEXT:    [[TMP3:%.*]] = add <2 x i64> [[TMP1]], [[TMP2]]
-; SLM-NEXT:    [[TMP4:%.*]] = shufflevector <4 x i64> [[A]], <4 x i64> [[B]], <2 x i32> <i32 2, i32 6>
-; SLM-NEXT:    [[TMP5:%.*]] = shufflevector <4 x i64> [[A]], <4 x i64> [[B]], <2 x i32> <i32 3, i32 7>
-; SLM-NEXT:    [[TMP6:%.*]] = add <2 x i64> [[TMP4]], [[TMP5]]
-; SLM-NEXT:    [[R03:%.*]] = shufflevector <2 x i64> [[TMP3]], <2 x i64> [[TMP6]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
-; SLM-NEXT:    ret <4 x i64> [[R03]]
+; SLM-NEXT:    [[TMP1:%.*]] = shufflevector <4 x i64> [[A:%.*]], <4 x i64> [[B:%.*]], <4 x i32> <i32 0, i32 4, i32 2, i32 6>
+; SLM-NEXT:    [[TMP2:%.*]] = shufflevector <4 x i64> [[A]], <4 x i64> [[B]], <4 x i32> <i32 1, i32 5, i32 3, i32 7>
----------------
spatel wrote:
> RKSimon wrote:
> > craig.topper wrote:
> > > I'm not sure I understand what's happening here. SLM doesn't have 256-bit vectors. Is this going to codegen well?
> > Probably the cost model type legalization has kicked in. It maybe that its not handling EXTRACT_SUBVECTOR shuffle costs or something so it ends up scalarizing?
> I didn't step through SLP, but I agree this is suspicious. But then we end up with virtually identical asm before and after this change:
>   movdqa	%xmm0, %xmm4
>   movdqa	%xmm1, %xmm5
>   punpckhqdq	%xmm2, %xmm0    # xmm0 = xmm0[1],xmm2[1]
>   punpckhqdq	%xmm3, %xmm1    # xmm1 = xmm1[1],xmm3[1]
>   punpcklqdq	%xmm2, %xmm4    # xmm4 = xmm4[0],xmm2[0]
>   punpcklqdq	%xmm3, %xmm5    # xmm5 = xmm5[0],xmm3[0]
>   paddq	%xmm4, %xmm0
>   paddq	%xmm5, %xmm1
> 
I'm still not clear on exactly how SLP does its accounting, but debug output shows that when it used to evaluate the 4-wide vector ops, it saw this:
SLP: Spill Cost = 0.
SLP: Extract Cost = 4.
SLP: Total Cost = 6.

...and decided that would not be profitable. But then it evaluates doing the ops as 2-wide (128-bit), it sees this:

SLP: Spill Cost = 0.
SLP: Extract Cost = 2.
SLP: Total Cost = -1.
SLP: Vectorizing list at cost:-5.

So that's worth doing. With this patch, it now sees this at 4-wide:

SLP: Spill Cost = 0.
SLP: Extract Cost = 56.
SLP: Total Cost = -40.
SLP: Vectorizing list at cost:-44.

This seems more truthful - the cost of extract on SLM is very large relative to the cost of vector ops.

The cost model itself deals with illegal types (as here - 256-bit on a subtarget where that is not legal) by doing a simple scaling: see lines 2393, 2412 in the source code diff in this patch.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70607/new/

https://reviews.llvm.org/D70607