[PATCH] D135282: [SLP]Improve costs of vectorized loads/stores by analyzing GEPs.

Valeriy Dmitriev via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Fri Nov 18 10:22:37 PST 2022


vdmitrie added inline comments.


================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:6656
+            continue;
+          ScalarLdCost += TTI->getArithmeticInstrCost(Instruction::Add,
+                                                      Ptr->getType(), CostKind);
----------------
ABataev wrote:
> vdmitrie wrote:
> > ABataev wrote:
> > > vdmitrie wrote:
> > > > ABataev wrote:
> > > > > vdmitrie wrote:
> > > > > > We see quite a significant performance regression related to this patch.
> > > > > > It does not look the right adjustment. For x86 specifically these GEPS cost nothing as they end up merely as different displacement values in memory operands. So the bias towards vectorization isn't justified for plain loads and stores.
> > > > > > It can be seen even for test case test/Transforms/SLPVectorizer/X86/remark_not_all_parts.ll
> > > > > > Vectorization makes code less profitable here. It already existed before the patch but this patch but cost modeling although tipped over to vectorization it was close enough to say "not profitable".
> > > > > >  But now we have even more bias.
> > > > > > 
> > > > > > Vecorized code:
> > > > > > Instruction Info:
> > > > > > [1]: #uOps
> > > > > > [2]: Latency
> > > > > > [3]: RThroughput
> > > > > > [4]: MayLoad
> > > > > > [5]: MayStore
> > > > > > [6]: HasSideEffects (U)
> > > > > > 
> > > > > > [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
> > > > > >  1      1     0.25                        subq  $136, %rsp
> > > > > >  1      0     0.17                        xorl  %ecx, %ecx
> > > > > >  1      0     0.17                        xorl  %eax, %eax
> > > > > >  1      5     0.50    *                   movq  (%rdi,%rcx), %xmm0
> > > > > >  1      5     0.50    *                   movq  16(%rdi,%rcx), %xmm1
> > > > > >  1      1     0.33                        paddd %xmm0, %xmm1
> > > > > >  1      2     1.00                        movd  %xmm1, %edx
> > > > > >  1      1     0.25                        addl  %eax, %edx
> > > > > >  2      1     1.00           *            movq  %xmm1, -128(%rsp,%rcx)
> > > > > >  1      1     1.00                        pshufd        $85, %xmm1, %xmm0
> > > > > >  1      2     1.00                        movd  %xmm0, %eax
> > > > > >  1      1     0.25                        addl  %edx, %eax
> > > > > >  1      1     0.25                        addq  $32, %rcx
> > > > > >  1      1     0.25                        cmpq  $256, %rcx
> > > > > >  1      1     0.50                        jne   .LBB0_1
> > > > > >  1      1     0.25                        addq  $136, %rsp
> > > > > >  3      7     1.00                  U     retq
> > > > > > 
> > > > > > Original:
> > > > > > 
> > > > > > [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
> > > > > >  1      1     0.25                        subq  $136, %rsp
> > > > > >  1      0     0.17                        xorl  %eax, %eax
> > > > > >  1      1     0.25                        movq  $-256, %rcx
> > > > > >  1      5     0.50    *                   movl  272(%rdi,%rcx), %edx
> > > > > >  2      6     0.50    *                   addl  256(%rdi,%rcx), %edx
> > > > > >  1      1     1.00           *            movl  %edx, 128(%rsp,%rcx)
> > > > > >  1      5     0.50    *                   movl  276(%rdi,%rcx), %esi
> > > > > >  2      6     0.50    *                   addl  260(%rdi,%rcx), %esi
> > > > > >  1      1     1.00           *            movl  %esi, 132(%rsp,%rcx)
> > > > > >  1      1     0.25                        addl  %esi, %eax
> > > > > >  1      1     0.25                        addl  %edx, %eax
> > > > > >  1      1     0.25                        addq  $32, %rcx
> > > > > >  1      1     0.50                        jne   .LBB0_1
> > > > > >  1      1     0.25                        addq  $136, %rsp
> > > > > >  3      7     1.00                  U     retq
> > > > > > 
> > > > > > 
> > > > > I can say, it was expected. That's why there was a discussion about using getGEPCost instead of this. This changes just syncs cost estimation for masked gathers and vector loads. As you noted, we already had the issue with the geps costs. We need to fix this. It would help if Intel will try to implement their part of getGEPCost and we can start using it here for better cost estimation.
> > > > You got me wrong. I did not say we already had issues with geps. What I did I did say is: we already had issues with vectorizing sequences when we should not. Most issues with these wrongful vectorizations come from shuffles and permutations generated. And the example, which is the test case in this patch (remark_not_all_parts.ll) merely does show that vectorized code is worse that the original.
> > > >  And I don't think that you anticipated 60% performance regression from this patch. But that is what we have now. I suggest you to revert this change for these reasons:
> > > > - huge regression it introduced. It makes bad things even worse. Cost of inserts and permutations on integer vectors seems underestimated. That is where most regressions come from. But adding unjustified bias to the cost towards vectorization makes the problem even worse.
> > > > - the CM heuristics added does not reflect real thigs - it goes into displacement part of a memory operand which costs nothing.
> > > > - test cases which changed within this patch do not show where the patch would help. Moreover they create impression that nothing has changed. But that isn't the case. Can you show any real test case which shows how this patch improves vectorization?
> > > > 
> > > > Aligning gather loads does not justify enough IMO. May be gathers have this same issue?
> > > > Can you point at a test case for gather loads?
> > > > 
> > > > 
> > > It is not to improve the vectorization but to fix the cost difference between vector loads and masked gather. If we're going to revert it, we need to remove the geps cost estimation for masked gathers. Otherwise there are cases, where consecutive loads are less profitable than the masked gather, and it leads to the perf regressions.
> > Yes, we better to focus on these cases and find out why gather loads look more profitable (instead of making unit stride load look less profitable). It can be because of the same issue with geps or it can be that gather load cost itself is too optimistic.
> > For gather loads basically indices are populated at run time - so this per-element ADD cost should go into gather load rather than scalar loads. Although it most likely will be a single load + ADD of displacements stored in constants pool.
> The reason is know - need to fix the cost estimation for GEPs. And we need to fix getGEPCost function. Without this we are overoptimistic about GEPs. And we need to fix the cost in getGEPCost function and use it in the cost estimation. This will fix the regression introduced in this patch and fix general problem.
> Reverting the patch won't fix the issue, it will just hide it.
Could you point out the test case (where GEPS are incorrectly estimated) please?
So far general problem that I see is that scalar load cost too overestimated and that adds too much favor towards vectorization.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D135282/new/

https://reviews.llvm.org/D135282



More information about the llvm-commits mailing list