[clang] [CUDA] Change 'activemask' to use 'nvvm_activemask()' (PR #79892)

Mon Jan 29 14:58:45 PST 2024

jhuber6 wrote:

I've actually encountered some really strange behavior when trying to update `libc` to use the new intrinsic. The following returns a common 64-bit value to be compatible with AMDGPU's 64 lane wide mode. When I run this against the test suite, it fails on tests that specifically check against divergence.

This works
```c++
[[clang::convergent, gnu::noinline]]  uint64_t get_lane_mask() {
  uint32_t mask;              
  mask = __nvvm_activemask();
  return mask;               
} 
```

But this does not
```c++
[[clang::convergent, gnu::noinline]] uint64_t get_lane_mask() {
  return __nvvm_activemask();     
} 
```

If I check the PTX, the main difference seems to be the `cvt` instruction, here's the output respectively.

```asm
.weak .func  (.param .b64 func_retval0) _ZN22__llvm_libc_19_0_0_git3gpu13get_lane_maskEv()
{
  .reg .b32   %r<2>;
  .reg .b64   %rd<2>;

// %bb.0:                               // %entry
  activemask.b32  %r1;
  cvt.u64.u32   %rd1, %r1;
  st.param.b64  [func_retval0+0], %rd1;
  ret;
}
```

```asm
.weak .func  (.param .b64 func_retval0) _ZN22__llvm_libc_19_0_0_git3gpu13get_lane_maskEv()
{
  .reg .b32   %r<2>;
  .reg .b64   %rd<2>;

// %bb.0:                               // %entry
  activemask.b32  %r1;
  cvt.s64.s32   %rd1, %r1;
  st.param.b64  [func_retval0+0], %rd1;
  ret;
}
```

So, the difference is that the version that works uses `cvt.u64.u32` while the version that's broken uses `cvt.s64.s32`. This means that likely this is returning a "signed" value, and the conversion is treating it like a negative number when all threads are active. @Artem-B is there a correct way to assert that this is unsigned so it does the correct thing?

https://github.com/llvm/llvm-project/pull/79892

[clang] [CUDA] Change '__activemask' to use '__nvvm_activemask()' (PR #79892)

[clang] [CUDA] Change 'activemask' to use 'nvvm_activemask()' (PR #79892)