[llvm] AMDGPU: Improve getShuffleCost accuracy for 8- and 16-bit shuffles (PR #168818)

Thu Nov 20 16:59:03 PST 2025

================
@@ -1241,46 +1241,108 @@ InstructionCost GCNTTIImpl::getShuffleCost(TTI::ShuffleKind Kind,
       (ScalarSize == 16 || ScalarSize == 8)) {
     // Larger vector widths may require additional instructions, but are
     // typically cheaper than scalarized versions.
-    unsigned NumVectorElts = cast<FixedVectorType>(SrcTy)->getNumElements();
-    unsigned RequestedElts =
-        count_if(Mask, [](int MaskElt) { return MaskElt != -1; });
-    unsigned EltsPerReg = 32 / ScalarSize;
-    if (RequestedElts == 0)
+    //
+    // We assume that shuffling at a register granularity can be done for free.
+    // This is not true for vectors fed into memory instructions, but it is
+    // effectively true for all other shuffling. The emphasis of the logic here
+    // is to assist generic transform in cleaning up / canonicalizing those
+    // shuffles.
+    unsigned NumDstElts = cast<FixedVectorType>(DstTy)->getNumElements();
+    unsigned NumSrcElts = cast<FixedVectorType>(SrcTy)->getNumElements();
----------------
nhaehnle wrote:

It's not really clear to me what most of these are meant to mean for scalable vectors, so instead of doing it in terms of ElementCount I'm adding type checks and returning `InstructionCost::getInvalid()` if it's not fixed vectors, which matches what AArch64 does at least.

https://github.com/llvm/llvm-project/pull/168818