[llvm] [AMDGPU] Extend wave reduce intrinsics for i32 type (PR #126469)

Wed Apr 23 21:35:19 PDT 2025

pravinjagtap wrote:

> > I'd like to flag that these - or separate - intrinsics should have a clustered reduce mode, such that you can, say, do "the first 16 lanes get the reduction of their values, the second 16 lanes get the reduction of their values, ...".
> > Otherwise, higher-level code like LLPC or (in development) MLIR will need to implement that logic itself
> 
> That's basically the `width` argument to the CUDA `__shfl` intrinsic, right? We could reasonably add that as an argument to this.

Right, `unsigned __reduce_add_sync(unsigned mask, unsigned value)` builtins are already implemented. Here `mask` is divergent value which represents subgroups in a wave and performs reduction within this subgroups. 

https://github.com/llvm/llvm-project/pull/126469