[libc-commits] [clang] [compiler-rt] [libc] [llvm] [PGO][AMDGPU] Add offload profiling with uniformity-aware optimization (PR #177665)

Thu Apr 2 20:00:02 PDT 2026

================
@@ -42,14 +43,39 @@ COMPILER_RT_VISIBILITY void __llvm_profile_instrument_gpu(uint64_t *counter,
   }
 }
 
+// Block-level sampling for offload PGO. For GPU kernels with stationary
+// behavior (where all thread blocks execute the same code paths regardless of
+// block ID), partial sampling significantly reduces instrumentation overhead
+// without losing PGO performance gains.
+//
+// Returns 1 if this block should be instrumented, 0 to skip. Samples by
+// matching lower bits of the linearized 3D block ID to zero.
+//   sampling_bits=0: all blocks (100%)
+//   sampling_bits=3: every 8th block (12.5%, default)
+COMPILER_RT_VISIBILITY int __llvm_profile_sampling_gpu(uint32_t sampling_bits) {
+  if (sampling_bits == 0)
+    return 1;
+
+  uint32_t gdx = __gpu_num_blocks_x();
+  uint32_t gdy = __gpu_num_blocks_y();
+  uint32_t block_id = __gpu_block_id_x() + __gpu_block_id_y() * gdx +
+                      __gpu_block_id_z() * gdx * gdy;
+
+  uint32_t mask = (1u << sampling_bits) - 1;
+  return (block_id & mask) == 0;
+}
+
 #if defined(__AMDGPU__)
+__attribute__((weak)) const int __oclc_ABI_version = 600;
----------------
jhuber6 wrote:

Why do we need this?

https://github.com/llvm/llvm-project/pull/177665