[clang] [compiler-rt] [llvm] [PGO][AMDGPU] Add offload profiling with uniformity-aware optimization (PR #177665)

Mon Mar 9 17:06:30 PDT 2026

yxsamliu wrote:

> This is a structural question, but what's stopping us from building an actual library for the GPU portion? This PR seems to code the warp-aggregate increment in-line while it could probably be ~10 lines of portable C code using `gpuintrin.h`. Now that HIP is on the new offloading driver it should be trivial to just link in when we pass this to the linker-wrapper.
> 
> Something like this:
> 
> ```c
> #include <gpuintrin.h>
> #include "InstrProfiling.h"
> 
> void __llvm_profile_instrument_gpu(uint64_t *counter, uint64_t step) {
>   uint64_t mask = __gpu_lane_mask();
>   if (__gpu_is_first_in_lane(mask))
>       __scoped_atomic_fetch_add(counter, step * __builtin_popcountg(mask),
>                       __ATOMIC_RELAXED, __MEMORY_SCOPE_DEVICE);
> }
> ```
> 
> The global loading / accessing could probably be abstracted further, and I'm also wondering if we shouldn't make the OpenMP handling do this as well.
> 
> I could try experimenting with building what `InstrProfiling*` files already work using the existing build if that would help.

I think this is a great idea. Can you open a PR for it? I can rebase my patch on that PR. Thanks.

https://github.com/llvm/llvm-project/pull/177665