[llvm] [AMDGPU] Extend wave reduce intrinsics for i32 type (PR #126469)
Pravin Jagtap via llvm-commits
llvm-commits at lists.llvm.org
Wed Apr 23 21:35:19 PDT 2025
pravinjagtap wrote:
> > I'd like to flag that these - or separate - intrinsics should have a clustered reduce mode, such that you can, say, do "the first 16 lanes get the reduction of their values, the second 16 lanes get the reduction of their values, ...".
> > Otherwise, higher-level code like LLPC or (in development) MLIR will need to implement that logic itself
>
> That's basically the `width` argument to the CUDA `__shfl` intrinsic, right? We could reasonably add that as an argument to this.
Right, `unsigned __reduce_add_sync(unsigned mask, unsigned value)` builtins are already implemented. Here `mask` is divergent value which represents subgroups in a wave and performs reduction within this subgroups.
https://github.com/llvm/llvm-project/pull/126469
More information about the llvm-commits
mailing list