[llvm] [AMDGPU][LRO] LRO fix PHI same-BB filter; treat i8/i16 binops as profitable (PR #155800)
Matt Arsenault via llvm-commits
llvm-commits at lists.llvm.org
Thu Aug 28 03:02:52 PDT 2025
================
@@ -0,0 +1,67 @@
+; REQUIRES: amdgpu-registered-target
+; RUN: opt -S -passes=amdgpu-late-codegenprepare \
+; RUN: -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a %s | FileCheck %s
+
+; Purpose:
+; - Input has a loop-carried PHI of type <4 x i8> and byte-wise adds in the
+; loop header (same basic block as the PHI).
+; - After amdgpu-late-codegenprepare, the PHI must be coerced to i32 across
+; the backedge, and a single dominating "bitcast i32 -> <4 x i8>" must be
+; placed in the header (enabling SDWA-friendly lowering later).
+;
+; What we check:
+; - PHI is i32 (no loop-carried <4 x i8> PHI remains).
+; - A header-local bitcast i32 -> <4 x i8> exists and feeds the vector add.
+; - The loopexit produces a bitcast <4 x i8> -> i32 for the backedge.
+
+target triple = "amdgcn-amd-amdhsa"
+
+define amdgpu_kernel void @lro_coerce_v4i8_phi(i8* nocapture %p, i32 %n) #0 {
+entry:
+ br label %loop
+
+loop:
+ ; Loop index
+ %i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
+
+ ; Loop-carried accumulator in vector-of-bytes form (problematic on input).
+ %acc = phi <4 x i8> [ zeroinitializer, %entry ], [ %acc.next, %loop ]
+
+ ; Make up four i8 values derived from %i to avoid memory noise.
+ %i0 = trunc i32 %i to i8
+ %i1i = add i32 %i, 1
+ %i1 = trunc i32 %i1i to i8
+ %i2i = add i32 %i, 2
+ %i2 = trunc i32 %i2i to i8
+ %i3i = add i32 %i, 3
+ %i3 = trunc i32 %i3i to i8
+
+ ; Pack them into <4 x i8>.
+ %v01 = insertelement <4 x i8> undef, i8 %i0, i32 0
----------------
arsenm wrote:
```suggestion
%v01 = insertelement <4 x i8> poison, i8 %i0, i32 0
```
https://github.com/llvm/llvm-project/pull/155800
More information about the llvm-commits
mailing list