[clang] [AMDGPU] Add clang builtin for generic AMDGPU shuffle (PR #185302)

Sun Mar 8 10:58:43 PDT 2026

jhuber6 wrote:

While working on the documentation I've become quite confused. I'd expect this to have the same semantics as `ds_bpermute` but it doesn't seem to. Consider this test.
```c
#include <gpuintrin.h>
#include <stdio.h>

int main() {
  int x = __gpu_num_lanes() - __gpu_lane_id();
  int y = __builtin_amdgcn_wave_shuffle((__gpu_lane_id() + 1), x);
  int z = __builtin_amdgcn_ds_bpermute((__gpu_lane_id() + 1) << 2, x);
  fprintf(stderr, "lane %d: %d %d %d\n", __gpu_lane_id(), x, y, z);
}
```
```console
$ ./bin/clang shuffle.c --target=amdgcn-amd-amdhsa -mcpu=native -flto -lc ./lib/amdgcn-amd-amdhsa/crt1.o -O2 && llvm-gpu-loader --threads 32 a.out
lane 0: 32 1 31
lane 1: 31 32 30
lane 2: 30 31 29
lane 3: 29 30 28
lane 4: 28 29 27
lane 5: 27 28 26
lane 6: 26 27 25
lane 7: 25 26 24
lane 8: 24 25 23
lane 9: 23 24 22
lane 10: 22 23 21
lane 11: 21 22 20
lane 12: 20 21 19
lane 13: 19 20 18
lane 14: 18 19 17
lane 15: 17 18 16
lane 16: 16 17 15
lane 17: 15 16 14
lane 18: 14 15 13
lane 19: 13 14 12
lane 20: 12 13 11
lane 21: 11 12 10
lane 22: 10 11 9
lane 23: 9 10 8
lane 24: 8 9 7
lane 25: 7 8 6
lane 26: 6 7 5
lane 27: 5 6 4
lane 28: 4 5 3
lane 29: 3 4 2
lane 30: 2 3 1
lane 31: 1 2 32
```
This shows the values are different, and the wrap-around behavior is false as stated in the comments. @saxlungs is this tested anywhere? What's the expected behavior. I've also opened https://github.com/llvm/llvm-project/issues/185303 as an issue I encountered with this.

https://github.com/llvm/llvm-project/pull/185302