[llvm] [AMDGPU] LiveRegOptimizer: fix PHI same-BB filter; consider i8/i16 binops on SDWA (PR #155800)

Mon Sep 15 04:10:17 PDT 2025

================
@@ -126,7 +126,37 @@ class LiveRegOptimizer {
     return LK.first != TargetLoweringBase::TypeLegal;
   }
 
-  bool isOpLegal(Instruction *I) { return isa<StoreInst, IntrinsicInst>(I); }
+  bool isOpLegal(Instruction *I) {
+    if (auto *Intr = dyn_cast<IntrinsicInst>(I))
+      return true; // FIXME: narrow to known native intrinsics (DOT/MFMA/tbuffer) or use TTI cost.
----------------
michaelselehov wrote:

Sorry, posted this reply in the wrong place. Reposting here.

@arsenm, I instrumented the TTI queries on gfx90a: `add <4 x i8>` comes out at cost 4 with TCK_SizeAndLatency (and likewise for getArithmeticInstrCost), which is below the previous profitability threshold (8). So switching to TTI-only would reintroduce the regression. I propose to keep the very narrow SDWA safety-net for v4i8/v2i16 (≤32b) here and look at improving AMDGPU TTI separately if needed.

https://github.com/llvm/llvm-project/pull/155800