[PATCH] D70607: [x86] make SLM extract vector element more expensive than default
Sanjay Patel via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Sun Nov 24 06:03:22 PST 2019
spatel marked an inline comment as done.
spatel added inline comments.
================
Comment at: llvm/test/Transforms/SLPVectorizer/X86/hadd.ll:302
; SLM-LABEL: @test_v4i64(
-; SLM-NEXT: [[TMP1:%.*]] = shufflevector <4 x i64> [[A:%.*]], <4 x i64> [[B:%.*]], <2 x i32> <i32 0, i32 4>
-; SLM-NEXT: [[TMP2:%.*]] = shufflevector <4 x i64> [[A]], <4 x i64> [[B]], <2 x i32> <i32 1, i32 5>
-; SLM-NEXT: [[TMP3:%.*]] = add <2 x i64> [[TMP1]], [[TMP2]]
-; SLM-NEXT: [[TMP4:%.*]] = shufflevector <4 x i64> [[A]], <4 x i64> [[B]], <2 x i32> <i32 2, i32 6>
-; SLM-NEXT: [[TMP5:%.*]] = shufflevector <4 x i64> [[A]], <4 x i64> [[B]], <2 x i32> <i32 3, i32 7>
-; SLM-NEXT: [[TMP6:%.*]] = add <2 x i64> [[TMP4]], [[TMP5]]
-; SLM-NEXT: [[R03:%.*]] = shufflevector <2 x i64> [[TMP3]], <2 x i64> [[TMP6]], <4 x i32> <i32 0, i32 1, i32 2, i32 3>
-; SLM-NEXT: ret <4 x i64> [[R03]]
+; SLM-NEXT: [[TMP1:%.*]] = shufflevector <4 x i64> [[A:%.*]], <4 x i64> [[B:%.*]], <4 x i32> <i32 0, i32 4, i32 2, i32 6>
+; SLM-NEXT: [[TMP2:%.*]] = shufflevector <4 x i64> [[A]], <4 x i64> [[B]], <4 x i32> <i32 1, i32 5, i32 3, i32 7>
----------------
spatel wrote:
> RKSimon wrote:
> > craig.topper wrote:
> > > I'm not sure I understand what's happening here. SLM doesn't have 256-bit vectors. Is this going to codegen well?
> > Probably the cost model type legalization has kicked in. It maybe that its not handling EXTRACT_SUBVECTOR shuffle costs or something so it ends up scalarizing?
> I didn't step through SLP, but I agree this is suspicious. But then we end up with virtually identical asm before and after this change:
> movdqa %xmm0, %xmm4
> movdqa %xmm1, %xmm5
> punpckhqdq %xmm2, %xmm0 # xmm0 = xmm0[1],xmm2[1]
> punpckhqdq %xmm3, %xmm1 # xmm1 = xmm1[1],xmm3[1]
> punpcklqdq %xmm2, %xmm4 # xmm4 = xmm4[0],xmm2[0]
> punpcklqdq %xmm3, %xmm5 # xmm5 = xmm5[0],xmm3[0]
> paddq %xmm4, %xmm0
> paddq %xmm5, %xmm1
>
I'm still not clear on exactly how SLP does its accounting, but debug output shows that when it used to evaluate the 4-wide vector ops, it saw this:
SLP: Spill Cost = 0.
SLP: Extract Cost = 4.
SLP: Total Cost = 6.
...and decided that would not be profitable. But then it evaluates doing the ops as 2-wide (128-bit), it sees this:
SLP: Spill Cost = 0.
SLP: Extract Cost = 2.
SLP: Total Cost = -1.
SLP: Vectorizing list at cost:-5.
So that's worth doing. With this patch, it now sees this at 4-wide:
SLP: Spill Cost = 0.
SLP: Extract Cost = 56.
SLP: Total Cost = -40.
SLP: Vectorizing list at cost:-44.
This seems more truthful - the cost of extract on SLM is very large relative to the cost of vector ops.
The cost model itself deals with illegal types (as here - 256-bit on a subtarget where that is not legal) by doing a simple scaling: see lines 2393, 2412 in the source code diff in this patch.
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D70607/new/
https://reviews.llvm.org/D70607
More information about the llvm-commits
mailing list