[PATCH] D75525: [TTI][ARM][MVE] Refine gather/scatter cost model

Wed Mar 4 02:20:31 PST 2020

dmgreen added a comment.

The fact that the cost of zext/sext's change so much with context is pretty annoying. This "per single instruction" costmodel is having trouble keeping up. We might need to eventually do something about that, come up with something better at a higher level. I'm not sure what that would look like though.

This looks like a nice improvement over what was already there.

================
Comment at: llvm/include/llvm/Analysis/TargetTransformInfo.h:1012
                             ArrayRef<Value *> Args, FastMathFlags FMF,
-                            unsigned VF = 1) const;
+                            const Instruction *I, unsigned VF = 1) const;

----------------
Is it possible to keep I last and give it a default argument of nullptr? Or does that cause other problems?

================
Comment at: llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp:897
+  unsigned ExtSize = EltSize;
+  // check whether there's a single user that asks for an extended type
+  if (I != nullptr) {
----------------
check -> Check

================
Comment at: llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp:907
+      if (isa<ZExtInst>(Us) || isa<SExtInst>(Us)) {
+        // only allow valid type combinations
+        unsigned TypeSize =
----------------
only -> Only

================
Comment at: llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp:921
+        (match(I, m_Intrinsic<Intrinsic::masked_scatter>()) &&
+         I->getNumUses() == 0 && (T = dyn_cast<TruncInst>(I->getOperand(1))))) {
+      // only allow valid type combinations
----------------
I don't think this  I->getNumUses() == 0 is needed?

================
Comment at: llvm/test/Analysis/CostModel/ARM/mve-gather-scatter-cost.ll:187-197
+  ; result zext
+  %gep5 = getelementptr i16, i16* %base16, <4 x i16> %ind16
+  %gep5.2 = getelementptr i32, i32* %base, <4 x i32> %ind32
+  %res5 = call <4 x i16> @llvm.masked.gather.v4i16.v4p0i16(<4 x i16*> %gep5, i32 2, <4 x i1> %mask, <4 x i16> undef)
+  %res5zext = zext <4 x i16> %res5 to <4 x i32>
+  call void @llvm.masked.scatter.v4i32.v4p0i32(<4 x i32> %res5zext, <4 x i32*> %gep5.2, i32 4, <4 x i1> %mask)
+  
----------------
Can you move these into gep_v4i16. And add a trunc to the scatters. Maybe something like:
  %res6 = call <4 x i16> @llvm.masked.gather.v4i16.v4p0i16(<4 x i16*> %gep5, i32 2, <4 x i1> %mask, <4 x i16> undef)
  %res6sext = sext <4 x i16> %res6 to <4 x i32>
  %res6trunc = trunc <4 x i32> %res6sext to <4 x i16>
  call void @llvm.masked.scatter.v4i32.v4p0i32(<4 x i16> %res6trunc, <4 x i32*> %gep5.2, i32 4, <4 x i1> %mask)

If I haven't got that wrong, those scatters should be cheap(ish). It's probably worth adding similar v4i8 and v8i8 tests in too.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75525/new/

https://reviews.llvm.org/D75525