[PATCH] D124616: [TTI][X86] Fix splat-load cost when load+broadcast cannot be combined.

Valeriy Dmitriev via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Mon May 2 19:18:35 PDT 2022


vdmitrie added inline comments.


================
Comment at: llvm/lib/Target/X86/X86TargetTransformInfo.cpp:1616
+      bool LoadCanBeCombined =
+          L->hasOneUse() && isa<Instruction>(L->user_back());
+      if (ST->hasSSE3() && LoadCanBeCombined)
----------------
vporpo wrote:
> vdmitrie wrote:
> > vporpo wrote:
> > > dmgreen wrote:
> > > > The user of an instruction should always be an instruction.
> > > > 
> > > > Just to make sure - it isn't a problem for the SLP vectorizer is it? That the load might have multiple uses, which we are turning into a splat?
> > > Oops, I will fix it.
> > > 
> > > Good point. I did not see any test failures, but I am pretty sure we won't get the right shuffle cost because we are querying the cost model before generating the vector code (which btw is a good reason for testing the actual cost values with something like https://reviews.llvm.org/D124802). 
> > > Hmm not sure what the best approach would be. Perhaps we could go through the users and count that we have at most one vector instruction. In this way SLP will still work as the users will be scalar and TTI still works with regular vector instructions. Any thoughts?
> > This new part of TTI interface actually lacks description. For broadcast kind it sounds like we better to pass just one load instruction which we want to broadcast rather than the whole set that form the splat (otherwise we indeed applying assumption wrt intended use of the interface). Since we are broadcasting a scalar load  its users are either scalar instructions or an insertelement.  What did you mean by vector instruction user?
> > 
> > During SLP cost modeling we have:
> > 
> >     %i1 = load double, double* %gep, align 8
> >     %i1 = load double, double* %gep, align 8
> >      fmul double %i1, ..
> >      fmul double %i1, ..
> > ...
> >     . fadd double %i1, ...
> > 
> > which we may transform into 
> > 
> >   %i1 = load double, double* %gep, align 8
> >   %0 = insertelement <2 x double> poison, double %i1, i32 0
> >   %1 = insertelement <2 x double> %0, double %i1, i32 1
> >     fmul <2 x double> %1
> >    ...
> >      fadd double %i1, ...
> > 
> > i.e. we have two uses to form the splat and one extra user. You probably intended to check one user per lane (i.e. each use will form a lane in a vector). But that is not that straightforward when we for example trying to vectorize with 2 or more users of the splat.
> > 
> > If we end up vectorizing the tree
> > CG can generate either code like this:
> > 
> >    vmovsd   xmm1, mem64         ; load          (throughput 0.5)
> >    vmovddup xmm2, xmm1          ; register form  (throughput 1)
> >    vmulpd xmm2..
> > 
> >    vaddsd  xmm1
> > 
> > or like this
> >    vmovddup xmm1, mem64        ; load form  (throughput 0.5)
> >    vmulpd xmm1..
> > 
> >    vaddsd  xmm1                ; scalar user
> > 
> > 
> > Based on costs I tend to think CG should always prefer the second option (i.e. memory form)
> > CG may consider other factors than cost when life range between load and its scalar
> > users crosses block boundaries but since extract from vector register on x86 is essentially
> > free it is again makes memory form of movddup look more profitable.
> > We need somebody familiar with codegen to comment here.
> > 
> > 
> > 
> I agree, the description is not very clear. My understanding is that we need `getShuffleCost()` to work on both scalar code or vector code:
> 
> - Case 1: If we query TTI from the SLP pass, the code is still scalar as we have not generated any vector instructions yet, so there are no `insertelement` instructions, the code would look like this:
> ```
> %ld = load double,...
> %mul1 = fmul double %ld, ...
> %mul2 = fmul double %ld, ...
> ...
> %add1 = fadd double %ld, ...
> %add2 = fadd double %ld, ...
> ```
> So the operand of the broadcast is `%ld` but the load has multiple users. In this case I am not sure how we would guarantee that `%ld` won't have some other user that is not getting vectorized. Perhaps just check that `num_users % VL == 0` ?
> 
> - Case 2: If we have vector code, it could look like this:
> ```
> %ld = load double, ...
> %vec  = insertelement undef, %ld, 0
> %bcast1 = shufflevector %vec, ...
> ```
> In this case the operand of the TTI getShuffleCost function should still be the scalar load. In this case the load should have a single user and that user should only have a single user too.
> 
> 
> One way to check which case we are in is to check the type of the users. If the users are scalar, then we are in Case 1. If the users are vector then we are in Case 2.
> So the operand of the broadcast is `%ld` but the load has multiple users. In this case I am not sure how we would guarantee that `%ld` won't have some other user that is not getting vectorized. Perhaps just check that `num_users % VL == 0` ?
The main issue here is that even if condition `num_users % VL == 0` is true there is no guarantee that all these uses will be vector eventually. Only caller supposed to know that. 

> 
> - Case 2: If we have vector code, it could look like this:
> ```
> %ld = load double, ...
> %vec  = insertelement undef, %ld, 0
> %bcast1 = shufflevector %vec, ...
> ```
> In this case the operand of the TTI getShuffleCost function should still be the scalar load.

Ah, that's an important note. Thanks.

> One way to check which case we are in is to check the type of the users. If the users are scalar, then we are in Case 1. If the users are vector then we are in Case 2.

For me it is still not obvious why having extra (scalar) uses in case 2 would prevent combining load with broadcast shuffle. In my experiments I have not found any difference in generated code between

  %0 = insertelement <2 x double> poison, double %ld, i32 0
  %1 = shufflevector <2 x double> %0, <2 x double> poison, <2 x i32> zeroinitializer

versus

  %0 = insertelement <2 x double> poison, double %ld, i32 0
  %1 = insertelement <2 x double> %0, double %ld, i32 1

Even having extra scalar use of the load did not prevent combining broadcast with load.



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D124616/new/

https://reviews.llvm.org/D124616



More information about the llvm-commits mailing list