[PATCH] D121354: [SLP] Fix lookahead operand reordering for splat loads.
Alexey Bataev via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Mar 14 16:29:17 PDT 2022
ABataev added inline comments.
================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:1090-1091
+ /// The same load multiple times. This should have a better score than
+ /// `ScoreSplat` because it takes only 1 instruction in x86 for a 2-lane
+ /// vector using `movddup (%reg), xmm0`.
+ static const int ScoreSplatLoads = 3;
----------------
vporpo wrote:
> ABataev wrote:
> > We don't care about the instruction count for SLP, but the throughput.
> Agreed, but usually code with fewer instructions is better for various reasons. How would you want me to rephrase this ?
Use the throughput, not number of instructions.
================
Comment at: llvm/test/Transforms/SLPVectorizer/X86/operandorder.ll:181-190
+; CHECK-NEXT: [[FROM_1:%.*]] = getelementptr double, double* [[FROM:%.*]], i32 1
+; CHECK-NEXT: [[V0_1:%.*]] = load double, double* [[FROM]], align 4
+; CHECK-NEXT: [[V0_2:%.*]] = load double, double* [[FROM_1]], align 4
+; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[V0_2]], i64 0
+; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[P]], i64 1
+; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[V0_1]], i64 0
+; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x double> [[TMP2]], <2 x double> poison, <2 x i32> zeroinitializer
----------------
vporpo wrote:
> ABataev wrote:
> > This block of code has thruoughput 2.5 instead of 2.0 before. I assume, there are some other cases, need some extra investigation.
> How did you come up with these throughput values ?
>
>
> The assembly code that comes out of llc for the original code is:
> ```
> movl 8(%esp), %eax
> movl 4(%esp), %ecx
> vmovupd (%ecx), %xmm0
> vpermilpd $1, %xmm0, %xmm1 ## xmm1 = xmm0[1,0]
> vmovq %xmm0, %xmm0 ## xmm0 = xmm0[0],zero
> vaddpd %xmm1, %xmm0, %xmm0
> vmovupd %xmm0, (%eax)
> ```
>
> The new code is:
> ```
> movl 8(%esp), %eax
> movl 4(%esp), %ecx
> vmovsd 8(%ecx), %xmm0 ## xmm0 = mem[0],zero
> vmovddup (%ecx), %xmm1 ## xmm1 = mem[0,0]
> vaddpd %xmm1, %xmm0, %xmm0
> vmovupd %xmm0, (%eax)
> ```
> I ran the function in a loop on a skylake and the new code is 25% faster.
https://godbolt.org/z/3rhGajsaT
The first page is the result without patch, the second - with patch.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D121354/new/
https://reviews.llvm.org/D121354
More information about the llvm-commits
mailing list