[Mlir-commits] [mlir] [AMDGPU] Implement gpu.subgroup_reduce with DPP intrinsics on AMD GPUs (PR #133204)

Wed Apr 2 16:32:10 PDT 2025

krzysz00 wrote:

> Ok so after actually running a few sizes of MatVecs, I see that it runs into the same issue as our pass of "ExpandGPUOps" decomposing the subgroup_reduce before it can make it to these passes.
> 
> So, in conclusion, why does that that pattern work? It doesn't...

I think this means IREE's `ExpandGPUOps` needs to be fixed to run this pattern before - or with a higher benefit - than the expansion to shuffles, then.

> I can't seem to find an equivalent op to permlanex16 defined in the ROCDL or AMDGPU dialects in mlir, should I be using the intrinsics from llvm here instead?

You'll at the very least want to add `rocdl.permlanex16` if it doesn't - and you may want `amdgpu.permlanex16` if there're bitcasts / splitting up vectors required

https://github.com/llvm/llvm-project/pull/133204