[llvm] [AMDGPU] LiveRegOptimizer: fix PHI same-BB filter; consider i8/i16 binops on SDWA (PR #155800)
via llvm-commits
llvm-commits at lists.llvm.org
Fri Sep 12 05:19:57 PDT 2025
================
@@ -150,7 +184,12 @@ class LiveRegOptimizer {
if (!CVisited.insert(CII).second)
continue;
- if (CII->getParent() == II->getParent() && !IsLookThru(II))
+ // Allow same-BB non-lookthrough users when the def is a PHI:
+ // loop headers frequently consume the carried value in the header block
+ // (e.g. byte-wise vector binops). We *do* want to coerce across the
+ // backedge in that common case to enable packed i32 + SDWA lowering.
+ if (CII->getParent() == II->getParent() && !IsLookThru(CII) &&
+ !isa<PHINode>(II))
----------------
michaelselehov wrote:
@arsenm, I instrumented the TTI queries on gfx90a: `add <4 x i8>` comes out at cost 4 with TCK_SizeAndLatency (and likewise for getArithmeticInstrCost), which is below the previous profitability threshold (8). So switching to TTI-only would reintroduce the regression. I propose to keep the very narrow SDWA safety-net for v4i8/v2i16 (≤32b) here and look at improving AMDGPU TTI separately if needed.
https://github.com/llvm/llvm-project/pull/155800
More information about the llvm-commits
mailing list