[libc-commits] [clang] [libc] [Clang] Add width handling for <gpuintrin.h> shuffle helper (PR #125896)
Artem Belevich via libc-commits
libc-commits at lists.llvm.org
Wed Feb 5 11:06:23 PST 2025
================
@@ -149,22 +149,23 @@ _DEFAULT_FN_ATTRS static __inline__ void __gpu_sync_lane(uint64_t __lane_mask) {
// Shuffles the the lanes inside the warp according to the given index.
_DEFAULT_FN_ATTRS static __inline__ uint32_t
-__gpu_shuffle_idx_u32(uint64_t __lane_mask, uint32_t __idx, uint32_t __x) {
+__gpu_shuffle_idx_u32(uint64_t __lane_mask, uint32_t __idx, uint32_t __x,
+ uint32_t __width) {
uint32_t __mask = (uint32_t)__lane_mask;
- return __nvvm_shfl_sync_idx_i32(__mask, __x, __idx, __gpu_num_lanes() - 1u);
+ return __nvvm_shfl_sync_idx_i32(__mask, __x, __idx,
+ ((__gpu_num_lanes() - __width) << 8u) | 0x1f);
----------------
Artem-B wrote:
IIUIC, the `0x1f` replaces the original `__gpu_num_lanes() - 1u` and assumes that the warp width will never change.
The bits 8-12 contain "mask for logically splitting warps".
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-shfl-sync
How exactly does `(__gpu_num_lanes() - __width)` create a mask? AFAICT it will only work when `__width == __gpu_num_lanes() ` and the value is 0, or if `width == 1` which does make a mask for the `__gpu_num_lanes()`, but I don't think that's the intent.
Was it supposed to be `((__gpu_num_lanes() - __width) - 1) << 8u` ? But that would not be right either, as then with the default `__width == __gpu_num_lanes()` we'd end up with `(0-1) << 8` and have all upper bits set.
Either I'm confused, or the code as written has a bug.
https://github.com/llvm/llvm-project/pull/125896
More information about the libc-commits
mailing list