[llvm] [AMDGPU] Add new llvm.amdgcn.subgroup.shuffle intrinsic (PR #167372)

Thu Nov 13 05:27:56 PST 2025

jayfoad wrote:

> > I think your loweing logic is _too_ complicated. Let's work on simplifying it and then see if it still needs to be implemented in C++.
> 
> So I took a deeper look here, and I'm not finding any immediately obvious ways to simplify. Just so you can follow my line of thought:
> 
> In GFX11 I've found no way to permute across the two wave halves in wave64 mode (this is trivially done by ds_bpermute_b32 in GFX12). As far as I can tell, to do the wave-wide permute we have to use permlane64 to swap the halves, and then do the logic to select for each lane whether it should take the bpermute result from the same or opposite side. I think the selection logic is as simple as I can get it, just checking if each lane's ThreadID is on the same side of 32 as the index it wants to pull from. Please let me know if I've missed anything or if you have any ideas to try for simplifying the logic.
> 
> In addition, when I tried removing the set_inactive calls, the final compiled code was actually longer due to more VGPRs being used in the wave-wide mode section, and thus more spilling being required. Let me know if you think that trade-off is worth it to have the simpler lowering logic.

OK, I am starting to understand that the set.inactive calls are kind of dummies to limit the size of the WWM region. They are not expected to generate any code.

A couple of minor optimizations:
- I think you can apply amdgcn.wwm to Swapped instead of BPermOtherHalf. This should make the WWM region even smaller, but probably won't improve codegen.
- You don't need the call to amdgcn.mbcnt.hi. amdgcn.mbcnt.lo already returns 32 only for lanes >= 32, which is all that you need here.

https://github.com/llvm/llvm-project/pull/167372