[llvm] [AMDGPU] Fix GFX11 WMMA intrinsic lowering regression for compute kernels (PR #164036)
Luis Chamberlain via llvm-commits
llvm-commits at lists.llvm.org
Sun Oct 19 16:43:56 PDT 2025
mcgrof wrote:
# Pull Request Update - v2
Thank you for the feedback! I've addressed both points:
## 1. ✅ Auto-Generated CHECK Lines
I've regenerated both test files using `utils/update_llc_test_checks.py`. The tests now have complete assembly verification instead of the minimal manual checks:
**Before (manual):**
```llvm
; GFX11-W32-LABEL: test_wmma_f32_16x16x16_f16_kernel:
; GFX11-W32: v_wmma_f32_16x16x16_f16
```
**After (auto-generated):**
```llvm
; GFX11-W32-LABEL: test_wmma_f32_16x16x16_f16_kernel:
; GFX11-W32: ; %bb.0: ; %entry
; GFX11-W32-NEXT: s_load_b256 s[0:7], s[2:3], 0x0
; GFX11-W32-NEXT: v_mov_b32_e32 v24, 0
... (30+ lines of detailed assembly verification)
; GFX11-W32-NEXT: v_wmma_f32_16x16x16_f16 v[16:23], v[0:7], v[8:15], v[16:23]
```
Both test files now have comprehensive assembly checks (172 and 186 lines respectively).
---
## 2. 🔍 ROCm's Alternative Approach
You're absolutely correct - I investigated ROCm's llvm-project and found their solution. They **did** solve this differently, and it's quite elegant:
### ROCm's Approach
They created specialized `ComplexPattern` selectors in C++ that accept both bare and wrapped operands:
**In `AMDGPUISelDAGToDAG.cpp`:**
```cpp
bool AMDGPUDAGToDAGISel::SelectWMMAModsF16Neg(SDValue In, SDValue &Src,
SDValue &SrcMods) const {
Src = In; // ← Accepts bare operands
unsigned Mods = SISrcMods::OP_SEL_1; // ← Default modifiers
// Extract FNEG/FABS if present, but always succeeds
if (auto *BV = dyn_cast<BuildVectorSDNode>(stripBitcast(In))) {
// ... check for modifiers and extract if found ...
}
SrcMods = CurDAG->getTargetConstant(Mods, SDLoc(In), MVT::i32);
return true; // ← Always succeeds, even with bare operands
}
```
**In `VOP3PInstructions.td` (VOP3PWMMA_Profile):**
```tablegen
dag Src0InPat = !cond(
IsAB_F16 : (ins (Src0VT (WMMAModsF16Neg Src0VT:$src0, i32:$src0_modifiers))),
IsAB_BF16 : (ins Src0VT:$src0), // ← Bare for BF16
IsIU : (ins (VOP3PModsNeg i32:$src0_modifiers), Src0VT:$src0),
IsFP8BF8 : (ins Src0VT:$src0)); // ← Bare for FP8
```
These `ComplexPattern` selectors (`WMMAModsF16Neg`, `WMMAModsF32NegAbs`) automatically handle both cases and provide infrastructure that works across all WMMA types.
### Why I Chose the Explicit Pattern Approach
While ROCm's approach is more comprehensive, I believe the explicit pattern approach is better **for upstreaming** because:
1. **Minimal Surface Area**: ~60 lines of TableGen patterns vs touching C++ pattern selection infrastructure
2. **Surgical Fix**: Targets exactly the broken case (GFX11 compute kernels) without affecting:
- GFX12+ code generation
- Graphics shader paths
- Other VOP3P instruction selection
3. **Explicit Intent**: The high-priority patterns with comments make it immediately obvious what's being fixed and why
4. **Easier Review**: Reviewers can verify pattern matching logic directly in TableGen rather than tracing through C++ selection code
5. **Lower Risk**: Modifying `AMDGPUISelDAGToDAG.cpp` affects all VOP3P instruction selection; explicit patterns only affect GFX11 WMMA intrinsics
The ROCm approach would be ideal if we were adding new WMMA variants or needed this flexibility across many instructions. For fixing a specific regression in a single architecture, the targeted approach seems more appropriate.
**Would you prefer I adopt ROCm's infrastructure-level approach?** I can implement the `ComplexPattern` selectors if that's the preferred direction for upstream.
---
## 3. ✅ Hardware Verification
I've verified this fix empirically on AMD Radeon PRO W7900 (gfx1100):
**Test kernel:**
```cpp
v8f32 result = __builtin_amdgcn_wmma_f32_16x16x16_f16_w32(a_vec, b_vec, c_vec);
```
**Compiled assembly output:**
```asm
v_wmma_f32_16x16x16_f16 v[7:14], v[15:22], v[23:30], v[7:14]
```
✅ Confirmed WMMA instruction is generated correctly for compute kernels.
The commit message has been updated to reflect:
- Accurate description of ROCm's ComplexPattern approach
- Rationale for choosing the explicit pattern approach
- Hardware verification on W7900
---
Let me know if you'd like me to implement the ROCm-style `ComplexPattern` selectors instead, or if the current approach is acceptable. Happy to adjust based on upstream preferences!
---
## Quick Summary for Busy Reviewers
**Changes in v2:**
- ✅ Regenerated test CHECK lines with `utils/update_llc_test_checks.py`
- ✅ Investigated ROCm's approach (ComplexPattern selectors in C++)
- ✅ Explained why explicit patterns are better for upstreaming
- ✅ Verified on AMD W7900 hardware
- ✅ Updated commit message with accurate ROCm details
**Key Question:**
Do you prefer the current explicit pattern approach (~60 lines, targeted fix) or should I implement ROCm's ComplexPattern infrastructure (~200+ lines, more comprehensive)?
https://github.com/llvm/llvm-project/pull/164036
More information about the llvm-commits
mailing list