[llvm] [AMDGPU] Extend wave reduce intrinsics for i32 type (PR #126469)

Pravin Jagtap via llvm-commits llvm-commits at lists.llvm.org
Wed Apr 23 21:35:19 PDT 2025


pravinjagtap wrote:

> > I'd like to flag that these - or separate - intrinsics should have a clustered reduce mode, such that you can, say, do "the first 16 lanes get the reduction of their values, the second 16 lanes get the reduction of their values, ...".
> > Otherwise, higher-level code like LLPC or (in development) MLIR will need to implement that logic itself
> 
> That's basically the `width` argument to the CUDA `__shfl` intrinsic, right? We could reasonably add that as an argument to this.

Right, `unsigned __reduce_add_sync(unsigned mask, unsigned value)` builtins are already implemented. Here `mask` is divergent value which represents subgroups in a wave and performs reduction within this subgroups. 

https://github.com/llvm/llvm-project/pull/126469


More information about the llvm-commits mailing list