[llvm] [AMDGPU] Enable vectorization of i8 values. (PR #134934)

Mon Apr 21 17:12:42 PDT 2025

================
@@ -1423,3 +1425,30 @@ void GCNTTIImpl::collectKernelLaunchBounds(
   LB.push_back({"amdgpu-waves-per-eu[0]", WavesPerEU.first});
   LB.push_back({"amdgpu-waves-per-eu[1]", WavesPerEU.second});
 }
+
+InstructionCost GCNTTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src,
+                                            Align Alignment,
+                                            unsigned AddressSpace,
+                                            TTI::TargetCostKind CostKind,
+                                            TTI::OperandValueInfo OpInfo,
+                                            const Instruction *I) {
+  if (VectorType *VecTy = dyn_cast<VectorType>(Src))
+    if (Opcode == Instruction::Load &&
+        VecTy->getElementType() ==
+            IntegerType::getInt8Ty(VecTy->getContext())) {
+      unsigned ElementCount = VecTy->getElementCount().getFixedValue();
+      return ((ElementCount - 1) / 4) + 1;
----------------
jrbyrnes wrote:

It shouldn't be the ratio of i32 / i8 -- but rather the amount of loads needed. I think, it should always be 1 as long as the width of the load is supported by hardware. May want to asset / check the vector width against the supported load/store width for the address space.

https://github.com/llvm/llvm-project/pull/134934