[PATCH] D131118: [LV] Add generic scalarization support for unpredicated scalable vectors

Tue Aug 16 03:17:45 PDT 2022

david-arm added a comment.

Hi @reames, by the way I completely understand the intent of the patch and it seems like a useful addition to be able to fall back on scalarisation this way for scalable vectors.

However, just for some context on why I'm worried about this patch as it stands currently - I tried out this patch on a really simple (and obviously contrived!) loop like this:

  void inv_store_i16(short* __restrict__ dst, short* __restrict__ src, long n) {
    for (long i = 0; i < n; i++) {
      *dst = src[i];
    }
  }

corresponding to the test `inv_store_i16` changed in this patch. On a neoverse-v1 (SVE) machine I created a micro-benchmark that iterated over this a number of times. Here are some rough results:

scalar loop: 4.9s
vector loop with SVE scatters: 4.5s
vector loop with scalarised op (using this patch): 13.7s

That strongly suggests that the cost model is broken now because we're now choosing the worst cost-based widening decision. At the moment the patch regresses performance for SVE. In another comment I've suggested two possible ways forward and I'd be happy with either. Of course, I'm happy to listen to other ideas too!

I'll try to review the technical changes in the patch properly soon ...

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:6870
+          // the generic replicate path which isn't yet implemented.
+          if (!foldTailByMasking())
+            return true;
----------------
reames wrote:
> david-arm wrote:
> > reames wrote:
> > > reames wrote:
> > > > david-arm wrote:
> > > > > I think you can avoid the scalarisation for SVE here simply by asking for the scalarisation cost of the instruction, similarly to how it's done elsewhere. For SVE this should return Invalid. Alternatively you could add a TTI hook to ask if target should scalarise or not, i.e.
> > > > > 
> > > > >   if (!foldTailByMasking())
> > > > >     return isLegalToScalarize();
> > > > > 
> > > > > We have always considered it 'illegal' to scalarise for SVE.
> > > > You seem to be missing the point of the code.
> > > > 
> > > > The costing here is done at two levels: TTI, and LV.  TTI returns invalid costs for a bunch of cases which LV then turns around and decides to scalarize by reasoning about the scalarization cost and scalar op cost directly.
> > > > 
> > > > Doing as you suggest here completely prevents the use of the added lowering.  
> > > > 
> > > > Adding a TTI hook is feasible, but I don't have a good name/semantic for it.  Any suggestions?  
> > > Another possibility here - would you be okay with the scalarization being behind a flag for now, and the cost model discussion (i.e. profitability) being done in a separate review?  I'm wanting to avoid too much coupling here, and when I tried various different approaches to costing, a ton of unrelated issues started falling out.  
> > That might be an option too. I'll think about this more in the morning!
> A further point for your consideration.
> 
> The scatter based lowering here has the following cost:
> Found an estimated cost of 81 for VF vscale x 4 For instruction:   store i16 %ld, i16* %dst, align 2
> 
> Assuming this is a correct cost for the scatter, this is blatantly unprofitable.  Essentially, we've switched from one unprofitable vectorization to another.  As such, I don't think this test is particularly interesting.
> 
> 
> 
Given the comment

  // If not predicated, we can now scalarize generically with a loop

then this looks wrong here. I think we should be doing:

  if (!blockNeedsPredicationForAnyReason())

instead of

  if (!foldTailByMasking())

so that we also don't scalarise loops with conditional uniform memory operations. Or if the scalarisation code does support conditional ops, then perhaps the comment needs updating?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:6893
         const InstructionCost ScalarizationCost = isLegalToScalarize() ?
           getUniformMemOpCost(&I, VF) : InstructionCost::getInvalid();

----------------
david-arm wrote:
> It sounds like getUniformMemOpCost is not returning Invalid here for SVE. I'm not sure why, but by returning true from isLegalToScalarize it relies on the cost being sensible. When I'm back at work tomorrow morning I'll take a better look.
I'm now convinced that in the current form the patch is wrong because `getUniformMemOpCost` assumes that for stores we will perform a normal vector store, i.e.:

  StoreInst *SI = cast<StoreInst>(I);

  bool isLoopInvariantStoreValue = Legal->isUniform(SI->getValueOperand());
  return TTI.getAddressComputationCost(ValTy) +
         TTI.getMemoryOpCost(Instruction::Store, ValTy, Alignment, AS,
                             CostKind) +
         (isLoopInvariantStoreValue
              ? 0
              : TTI.getVectorInstrCost(Instruction::ExtractElement, VectorTy,
                                       VF.getKnownMinValue() - 1));

whereas we're actually scalarising in some cases through generation of an inner vector loop. I can think of two ways forward here:

1. Introduce this feature under control of a flag, which is off by default. Then in a later patch fix up the costs to properly calculate the scalarisation cost, which includes the loop overhead (pre-header, phi nodes, IV cmp+branch, etc.).
2. Fix the cost as part of this patch.

As you say, provided the scalarisation cost is fairly sensible the vectoriser should make the right decision and choose the best widening decision. But right now we're making decisions based on what look like very incorrect costs.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlan.cpp:800
                                 BasicBlock *LoopExitBB) {
+  //LoopHeaderBB->getParent()->dump();
   // The vector body may be more than a single basic-block by this point.
----------------
nit: Stray comment.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D131118/new/

https://reviews.llvm.org/D131118