[PATCH] D121354: [SLP] Fix lookahead operand reordering for splat loads.
Vasileios Porpodas via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Mar 17 14:24:34 PDT 2022
vporpo added inline comments.
================
Comment at: llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll:181-190
+; CHECK-NEXT: [[FROM_1:%.*]] = getelementptr double, double* [[FROM:%.*]], i32 1
+; CHECK-NEXT: [[V0_1:%.*]] = load double, double* [[FROM]], align 4
+; CHECK-NEXT: [[V0_2:%.*]] = load double, double* [[FROM_1]], align 4
+; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V0_2]], i64 0
+; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[P]], i64 1
+; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V0_1]], i64 0
+; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x double> [[TMP2]], <2 x double> poison, <2 x i32> zeroinitializer
----------------
ABataev wrote:
> vporpo wrote:
> > ABataev wrote:
> > > ABataev wrote:
> > > > vporpo wrote:
> > > > > ABataev wrote:
> > > > > > This block of code has thruoughput 2.5 instead of 2.0 before. I assume, there are some other cases, need some extra investigation.
> > > > > How did you come up with these throughput values ?
> > > > >
> > > > >
> > > > > The assembly code that comes out of llc for the original code is:
> > > > > ```
> > > > > movl 8(%esp), %eax
> > > > > movl 4(%esp), %ecx
> > > > > vmovupd (%ecx), %xmm0
> > > > > vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0]
> > > > > vmovq %xmm0, %xmm0 ## xmm0 = xmm0[0],zero
> > > > > vaddpd %xmm1, %xmm0, %xmm0
> > > > > vmovupd %xmm0, (%eax)
> > > > > ```
> > > > >
> > > > > The new code is:
> > > > > ```
> > > > > movl 8(%esp), %eax
> > > > > movl 4(%esp), %ecx
> > > > > vmovsd 8(%ecx), %xmm0 ## xmm0 = mem[0],zero
> > > > > vmovddup (%ecx), %xmm1 ## xmm1 = mem[0,0]
> > > > > vaddpd %xmm1, %xmm0, %xmm0
> > > > > vmovupd %xmm0, (%eax)
> > > > > ```
> > > > > I ran the function in a loop on a skylake and the new code is 25% faster.
> > > > https://godbolt.org/z/3rhGajsaT
> > > >
> > > > The first page is the result without patch, the second - with patch.
> > > What about this?
> > I am not sure what to do about this, it may have lower throughput but it has lower latency so it runs faster. Are we always considering throughput? It looks like in TTI we are mostly counting instructions at least from what I can see in getShuffleCost():
> > ```
> > {TTI::SK_PermuteSingleSrc, MVT::v2i64, 1}, // pshufd
> > {TTI::SK_PermuteSingleSrc, MVT::v4i32, 1}, // pshufd
> > {TTI::SK_PermuteSingleSrc, MVT::v8i16, 5}, // 2*pshuflw + 2*pshufhw
> > // + pshufd/unpck
> > { TTI::SK_PermuteSingleSrc, MVT::v16i8, 10 }, // 2*pshuflw + 2*pshufhw
> > // + 2*pshufd + 2*unpck + 2*packus
> > ```
> It is still in terms of throughput.
> Yeah, it maybe faster on skylake but not on corei7-avx. And there might be similar cases for other cpus. Need to tweak the estimation criteria
What kind of tweaking are you proposing?
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D121354/new/
https://reviews.llvm.org/D121354
More information about the llvm-commits
mailing list