[clang] [compiler-rt] [llvm] [PGO][HIP] HSA-introspection device profile drain + GPU PGO tests (PR #203056)

Wed Jun 17 08:44:18 PDT 2026

https://github.com/lfmeadow updated https://github.com/llvm/llvm-project/pull/203056

>From d083235c13129f8681719496716c368e88ae0587 Mon Sep 17 00:00:00 2001
From: Larry Meadows <Lawrence.Meadows at amd.com>
Date: Sun, 14 Jun 2026 14:18:38 -0500
Subject: [PATCH 1/7] [PGO][HIP] Add supplemental HSA-introspection device
 drain and GPU PGO tests

The host-shadow device-profile drain (InstrProfilingPlatformROCm.cpp) can only
collect device counters for kernels that registered a host-side shadow via
__hipRegisterVar. Device-linked programs (e.g. RCCL) link the instrumented code
object directly into the device image with no host shadow, so their counters are
never drained.

Add a supplemental, Linux-only drain that introspects the loaded code objects via
the HSA runtime: it walks each GPU agent, enumerates only the code objects
actually resident there, reads each one's __llvm_profile_sections, and routes
them through the existing processDeviceOffloadPrf() path so the emitted profraw
layout is identical. A small content-dedup set keyed on the
(data, counters, names) device-pointer triple ensures a section already drained
by the host-shadow pass is not drained again, so the two passes compose without
double-counting. HSA is brought up lazily from the drain (never from a library
constructor) to avoid poisoning fork-based callers.

Because the HSA walk only ever touches resident code objects, it also makes the
host-shadow pass's collect-all fallback unnecessary on Linux: when no kernel
launch was tracked (a program that never launches, collects before its first
launch, or launches only via an untracked API), the host-shadow pass is skipped
and the HSA drain covers it safely instead of faulting/hanging on a non-resident
device on a multi-GPU host.

Link the device profile runtime on both the new-offload-driver
(LinkerWrapper::ConstructJob) and traditional (HIPAMD constructLldCommand) link
paths so instrumented device images resolve the runtime symbols.

On the host side, link clang_rt.profile_rocm -- the self-contained runtime
variant that carries the device-counter drain and the hipModuleLoad interceptor
-- for any instrumented host link on a ROCm-equipped toolchain, not only for
links with an active HIP offload action (OFK_HIP). HIP host code is frequently
linked into a shared library or executable from pre-compiled objects (e.g.
RCCL's librccl.so is linked from .o inputs by a plain clang++ -shared); such a
link carries no OFK_HIP yet still needs the drain, and gating on it left those
.profraw files with host counters only. profile_rocm is emitted ahead of the
base clang_rt.profile (which stays inert), guarded by an existence check that
leaves lean toolchains unchanged, and both the interceptor and the drain
self-skip when the process has no resident device code.

Also add a GPU-executed test suite (compiler-rt/test/profile/{GPU,AMDGPU}/*.hip)
and a dependency-free "lit-lite" runner (run_gpu_tests.py) so the device drain
can be exercised on a real AMD GPU runner: basic/coverage/pgo-use, multi-kernel,
device-branching, multi-GPU and non-default-device drain + dedup, early-collect /
no-kernel edges, RDC vs non-RDC __llvm_profile_sections, fork-safety (the RCCL
parent-no-HIP / kernel-in-forked-child pattern), quantitative device-counter
correctness, multi-process offline accumulation, and explicit-collect
idempotency. A standalone device-pgo/ build helper reproduces the toolchain
locally. The object-only host-link path is exercised by
clang/test/Driver/hip-profile-rocm-runtime.hip.

Co-authored-by: Cursor <cursoragent at cursor.com>
---
 clang/lib/Driver/ToolChains/HIPAMD.cpp        |  20 +
 .../profile/InstrProfilingPlatformROCm.cpp    | 682 ++++++++++++++++--
 .../test/profile/AMDGPU/device-basic.hip      |  67 ++
 .../profile/AMDGPU/device-early-collect.hip   |  68 ++
 .../test/profile/AMDGPU/device-no-kernel.hip  |  44 ++
 .../test/profile/AMDGPU/device-symbols.hip    |  42 ++
 .../test/profile/AMDGPU/lit.local.cfg.py      |   4 +
 .../test/profile/GPU/instrprof-hip-basic.hip  |  51 ++
 .../GPU/instrprof-hip-collect-after.hip       |  63 ++
 .../GPU/instrprof-hip-counter-correctness.hip |  56 ++
 .../profile/GPU/instrprof-hip-coverage.hip    |  51 ++
 .../GPU/instrprof-hip-device-branching.hip    |  67 ++
 .../profile/GPU/instrprof-hip-fork-safety.hip |  61 ++
 .../profile/GPU/instrprof-hip-multi-gpu.hip   |  57 ++
 .../GPU/instrprof-hip-multi-process-merge.hip |  63 ++
 .../GPU/instrprof-hip-multiple-kernels.hip    |  58 ++
 .../GPU/instrprof-hip-nondefault-device.hip   |  60 ++
 .../profile/GPU/instrprof-hip-pgo-use.hip     |  63 ++
 compiler-rt/test/profile/device-pgo/README.md | 125 ++++
 compiler-rt/test/profile/device-pgo/build.sh  |  56 ++
 .../profile/device-pgo/toolchain-cache.cmake  |  55 ++
 compiler-rt/test/profile/run_gpu_tests.py     | 408 +++++++++++
 22 files changed, 2169 insertions(+), 52 deletions(-)
 create mode 100644 compiler-rt/test/profile/AMDGPU/device-basic.hip
 create mode 100644 compiler-rt/test/profile/AMDGPU/device-early-collect.hip
 create mode 100644 compiler-rt/test/profile/AMDGPU/device-no-kernel.hip
 create mode 100644 compiler-rt/test/profile/AMDGPU/device-symbols.hip
 create mode 100644 compiler-rt/test/profile/AMDGPU/lit.local.cfg.py
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-basic.hip
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-collect-after.hip
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-counter-correctness.hip
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-coverage.hip
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-device-branching.hip
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-fork-safety.hip
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-multi-gpu.hip
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-multi-process-merge.hip
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-multiple-kernels.hip
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-nondefault-device.hip
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-pgo-use.hip
 create mode 100644 compiler-rt/test/profile/device-pgo/README.md
 create mode 100755 compiler-rt/test/profile/device-pgo/build.sh
 create mode 100644 compiler-rt/test/profile/device-pgo/toolchain-cache.cmake
 create mode 100644 compiler-rt/test/profile/run_gpu_tests.py

diff --git a/clang/lib/Driver/ToolChains/HIPAMD.cpp b/clang/lib/Driver/ToolChains/HIPAMD.cpp
index 84664bcddbb94..45e71ac802a89 100644
--- a/clang/lib/Driver/ToolChains/HIPAMD.cpp
+++ b/clang/lib/Driver/ToolChains/HIPAMD.cpp
@@ -19,6 +19,7 @@
 #include "clang/Options/Options.h"
 #include "llvm/Support/FileSystem.h"
 #include "llvm/Support/Path.h"
+#include "llvm/Support/VirtualFileSystem.h"
 #include "llvm/TargetParser/TargetParser.h"
 
 using namespace clang::driver;
@@ -142,6 +143,25 @@ void AMDGCN::Linker::constructLldCommand(Compilation &C, const JobAction &JA,
 
   LldArgs.push_back("--no-whole-archive");
 
+  // With PGO/coverage instrumentation, instrumented device code references the
+  // device profile runtime (__llvm_profile_instrument_gpu and the
+  // __llvm_profile_sections bounds table emitted by InstrProfilingPlatformGPU).
+  // The new-offload-driver path injects this in LinkerWrapper::ConstructJob,
+  // but HIP using the traditional offload path (e.g. on Windows, which does not
+  // route device linking through clang-linker-wrapper) reaches the device link
+  // here instead. Forward the static device profile runtime to this lld device
+  // link so the runtime is pulled in regardless of offload-driver/host OS. The
+  // archive is arch-suffixed, so pass its full path rather than a -l name.
+  if (ToolChain::needsProfileRT(Args)) {
+    std::string ProfileRT =
+        TC.getCompilerRT(Args, "profile", ToolChain::FT_Static);
+    // Use the ToolChain VFS (matches the new-offload-driver path in
+    // Clang.cpp) so overlay/virtual filesystems used by the driver are
+    // honored; llvm::sys::fs bypasses them and can wrongly skip the runtime.
+    if (TC.getVFS().exists(ProfileRT))
+      LldArgs.push_back(Args.MakeArgString(ProfileRT));
+  }
+
   const char *Lld = Args.MakeArgStringRef(getToolChain().GetProgramPath("lld"));
   C.addCommand(std::make_unique<Command>(JA, *this, ResponseFileSupport::None(),
                                          Lld, LldArgs, Inputs, Output));
diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
index d0d9b1ea8f61d..b1db1d8a74041 100644
--- a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
+++ b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
@@ -66,6 +66,15 @@ struct OffloadSectionShadowGroup;
 static int processDeviceOffloadPrf(void *DeviceOffloadPrf, const char *Target,
                                    const OffloadSectionShadowGroup *Sections);
 
+#if defined(__linux__) && !defined(_WIN32)
+// Record a drained section-bounds tuple so the supplemental HSA-introspection
+// pass (Linux only) skips any code object the host-shadow path already
+// drained. Defined alongside the HSA drain below; forward-declared here so
+// processDeviceOffloadPrf can register every successful host-shadow drain.
+static void profRecordDrainedBounds(const void *Data, const void *Counters,
+                                    const void *Names);
+#endif
+
 static int isVerboseMode() {
   static int IsVerbose = -1;
   if (IsVerbose == -1)
@@ -1119,8 +1128,14 @@ static int processDeviceOffloadPrf(void *DeviceOffloadPrf, const char *Target,
 
   if (ret != 0) {
     PROF_ERR("%s\n", "failed to write device profile using shared API");
-  } else if (isVerboseMode()) {
-    PROF_NOTE("%s\n", "Successfully wrote device profile using shared API");
+  } else {
+#if defined(__linux__) && !defined(_WIN32)
+    // Dedup against the supplemental HSA pass: this section is now drained, so
+    // the HSA walk must not drain the same device code object again.
+    profRecordDrainedBounds(DevDataBegin, DevCntsBegin, DevNamesBegin);
+#endif
+    if (isVerboseMode())
+      PROF_NOTE("%s\n", "Successfully wrote device profile using shared API");
   }
 
   return ret;
@@ -1148,72 +1163,635 @@ static int isHipAvailable(void) {
   return pHipMemcpy != nullptr && pHipGetSymbolAddress != nullptr;
 }
 
-/* -------------------------------------------------------------------------- */
-/*  Collect device-side profile data                                          */
-/* -------------------------------------------------------------------------- */
+/* ========================================================================== */
+/*  Supplemental HSA-introspection drain (Linux only)                         */
+/*                                                                            */
+/*  The host-shadow drain above only sees device code objects registered      */
+/*  host-side (__hipRegisterVar shadows) or loaded through an intercepted */
+/*  hipModuleLoad* call. Device code linked by the offload device linker with */
+/*  no host-side shadow -- e.g. RCCL, whose many device functions are glued */
+/*  into a single kernel with no source module -- is invisible to it. This */
+/*  pass walks every GPU agent's loaded executables via HSA, finds each */
+/*  __llvm_profile_sections table directly on the device, and drains the ones */
+/*  the host-shadow pass did not already handle (deduped by the device */
+/*  section-bounds tuple). It reuses processDeviceOffloadPrf() for the */
+/*  copy/relocate/write so the on-disk profraw layout is identical.           */
+/* ========================================================================== */
+#if defined(__linux__) && !defined(_WIN32)
 
-extern "C" int __llvm_profile_hip_collect_device_data(void) {
-  if (NumShadowVariables == 0 && NumDynamicModules == 0)
+/* Minimal HSA type/enum stubs. compiler-rt cannot depend on ROCm headers at
+ * build time, so mirror just the handful of HSA declarations the drain needs.
+ * Values match hsa/hsa.h and hsa/hsa_ven_amd_loader.h. */
+typedef uint32_t prof_hsa_status_t;
+#define PROF_HSA_STATUS_SUCCESS ((prof_hsa_status_t)0x0)
+#define PROF_HSA_STATUS_INFO_BREAK ((prof_hsa_status_t)0x1)
+
+typedef struct {
+  uint64_t handle;
+} prof_hsa_agent_t;
+typedef struct {
+  uint64_t handle;
+} prof_hsa_executable_t;
+typedef struct {
+  uint64_t handle;
+} prof_hsa_executable_symbol_t;
+
+typedef uint32_t prof_hsa_agent_info_t;
+#define PROF_HSA_AGENT_INFO_NAME ((prof_hsa_agent_info_t)0)
+#define PROF_HSA_AGENT_INFO_DEVICE ((prof_hsa_agent_info_t)17)
+
+typedef uint32_t prof_hsa_device_type_t;
+#define PROF_HSA_DEVICE_TYPE_GPU ((prof_hsa_device_type_t)1)
+
+typedef uint32_t prof_hsa_symbol_kind_t;
+#define PROF_HSA_SYMBOL_KIND_VARIABLE ((prof_hsa_symbol_kind_t)0)
+
+typedef uint32_t prof_hsa_executable_symbol_info_t;
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_TYPE                                   \
+  ((prof_hsa_executable_symbol_info_t)0)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME_LENGTH                            \
+  ((prof_hsa_executable_symbol_info_t)1)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME                                   \
+  ((prof_hsa_executable_symbol_info_t)2)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS                       \
+  ((prof_hsa_executable_symbol_info_t)21)
+
+#define PROF_HSA_EXTENSION_AMD_LOADER ((uint16_t)0x201)
+
+typedef uint32_t prof_hsa_loader_storage_type_t;
+
+typedef struct {
+  prof_hsa_agent_t agent;
+  prof_hsa_executable_t executable;
+  prof_hsa_loader_storage_type_t code_object_storage_type;
+  const void *code_object_storage_base;
+  size_t code_object_storage_size;
+  size_t code_object_storage_offset;
+  const void *segment_base;
+  size_t segment_size;
+} prof_hsa_loader_segment_descriptor_t;
+
+typedef prof_hsa_status_t (*hsa_init_ty)(void);
+typedef prof_hsa_status_t (*hsa_iterate_agents_ty)(
+    prof_hsa_status_t (*)(prof_hsa_agent_t, void *), void *);
+typedef prof_hsa_status_t (*hsa_agent_get_info_ty)(prof_hsa_agent_t,
+                                                   prof_hsa_agent_info_t,
+                                                   void *);
+typedef prof_hsa_status_t (*hsa_executable_iterate_agent_symbols_ty)(
+    prof_hsa_executable_t, prof_hsa_agent_t,
+    prof_hsa_status_t (*)(prof_hsa_executable_t, prof_hsa_agent_t,
+                          prof_hsa_executable_symbol_t, void *),
+    void *);
+typedef prof_hsa_status_t (*hsa_executable_symbol_get_info_ty)(
+    prof_hsa_executable_symbol_t, prof_hsa_executable_symbol_info_t, void *);
+typedef prof_hsa_status_t (*hsa_system_get_major_extension_table_ty)(uint16_t,
+                                                                     uint16_t,
+                                                                     size_t,
+                                                                     void *);
+typedef prof_hsa_status_t (*hsa_loader_query_segment_descriptors_ty)(
+    prof_hsa_loader_segment_descriptor_t *, size_t *);
+
+/* First two members of hsa_ven_amd_loader_1_00_pfn_t. Only
+ * query_segment_descriptors is used; query_host_address keeps the offset. */
+typedef struct {
+  void *query_host_address;
+  hsa_loader_query_segment_descriptors_ty query_segment_descriptors;
+} prof_hsa_loader_pfn_t;
+
+static hsa_iterate_agents_ty pHsaIterateAgents = nullptr;
+static hsa_agent_get_info_ty pHsaAgentGetInfo = nullptr;
+static hsa_executable_iterate_agent_symbols_ty pHsaExecIterAgentSyms = nullptr;
+static hsa_executable_symbol_get_info_ty pHsaSymGetInfo = nullptr;
+static hsa_loader_query_segment_descriptors_ty pQuerySegDescs = nullptr;
+
+/* 0 = not yet attempted, 1 = ready, -1 = unavailable. Accessed with acquire/
+ * release atomics: a thread observing HsaRuntimeState==1 (acquire) also sees
+ * the fully-written p* function pointers (published before the release store
+ * of HsaRuntimeState=1 below). */
+static int HsaRuntimeState = 0;
+
+static int setHsaRuntimeState(int S) {
+  __atomic_store_n(&HsaRuntimeState, S, __ATOMIC_RELEASE);
+  return S > 0 ? 0 : -1;
+}
+
+/* Resolve HSA entry points (and the AMD loader extension) once, and confirm
+ * HIP's hipMemcpy is reachable for the device-to-host copies. HIP itself is
+ * resolved by the shared ensureHipLoaded() above. */
+static int loadHsaRuntimePointers(void) {
+  int State = __atomic_load_n(&HsaRuntimeState, __ATOMIC_ACQUIRE);
+  if (State)
+    return State > 0 ? 0 : -1;
+
+  if (!__interception::DynamicLoaderAvailable()) {
+    if (isVerboseMode())
+      PROF_NOTE("%s", "Dynamic library loading not available - "
+                      "HSA device profiling disabled\n");
+    return setHsaRuntimeState(-1);
+  }
+
+  void *Hsa = __interception::OpenLibrary("libhsa-runtime64.so");
+  if (!Hsa)
+    Hsa = __interception::OpenLibrary("libhsa-runtime64.so.1");
+  if (!Hsa) {
+    if (isVerboseMode())
+      PROF_NOTE("%s", "libhsa-runtime64.so not loadable - "
+                      "HSA device profiling disabled\n");
+    return setHsaRuntimeState(-1);
+  }
+
+  hsa_init_ty pHsaInit =
+      (hsa_init_ty)__interception::LookupSymbol(Hsa, "hsa_init");
+  hsa_system_get_major_extension_table_ty pGetExtTable =
+      (hsa_system_get_major_extension_table_ty)__interception::LookupSymbol(
+          Hsa, "hsa_system_get_major_extension_table");
+  pHsaIterateAgents = (hsa_iterate_agents_ty)__interception::LookupSymbol(
+      Hsa, "hsa_iterate_agents");
+  pHsaAgentGetInfo = (hsa_agent_get_info_ty)__interception::LookupSymbol(
+      Hsa, "hsa_agent_get_info");
+  pHsaExecIterAgentSyms =
+      (hsa_executable_iterate_agent_symbols_ty)__interception::LookupSymbol(
+          Hsa, "hsa_executable_iterate_agent_symbols");
+  pHsaSymGetInfo =
+      (hsa_executable_symbol_get_info_ty)__interception::LookupSymbol(
+          Hsa, "hsa_executable_symbol_get_info");
+
+  if (!pHsaInit || !pGetExtTable || !pHsaIterateAgents || !pHsaAgentGetInfo ||
+      !pHsaExecIterAgentSyms || !pHsaSymGetInfo) {
+    PROF_WARN("%s",
+              "required HSA symbols missing - HSA device profiling disabled\n");
+    return setHsaRuntimeState(-1);
+  }
+
+  /* Bring HSA up (idempotent, refcounted). This runs lazily on the first drain
+   * rather than from the library constructor, so merely loading the
+   * instrumented library does not initialize HSA in the process -- which would
+   * break fork-based callers that deliberately keep HIP/HSA uninitialized in
+   * the parent (see the constructor note at the end of the HSA block). In the
+   * common case the drain runs from the profile write path while HSA is still
+   * alive; if it only runs after HSA's own atexit(hsa_shut_down) has executed,
+   * this simply re-initializes HSA (the process is exiting anyway). */
+  prof_hsa_status_t St = pHsaInit();
+  if (St != PROF_HSA_STATUS_SUCCESS && St != PROF_HSA_STATUS_INFO_BREAK) {
+    if (isVerboseMode())
+      PROF_NOTE("hsa_init failed (0x%x) - HSA device profiling disabled\n", St);
+    return setHsaRuntimeState(-1);
+  }
+
+  prof_hsa_loader_pfn_t LoaderApi;
+  __builtin_memset(&LoaderApi, 0, sizeof(LoaderApi));
+  St = pGetExtTable(PROF_HSA_EXTENSION_AMD_LOADER, 1, sizeof(LoaderApi),
+                    &LoaderApi);
+  if (St != PROF_HSA_STATUS_SUCCESS || !LoaderApi.query_segment_descriptors) {
+    PROF_WARN("AMD loader extension unavailable (0x%x) - "
+              "HSA device profiling disabled\n",
+              St);
+    return setHsaRuntimeState(-1);
+  }
+  pQuerySegDescs = LoaderApi.query_segment_descriptors;
+
+  /* The device-to-host copies go through the shared HIP loader. */
+  ensureHipLoaded();
+  if (!pHipMemcpy) {
+    PROF_WARN("%s", "hipMemcpy unavailable - HSA device profiling disabled\n");
+    return setHsaRuntimeState(-1);
+  }
+
+  if (isVerboseMode())
+    PROF_NOTE("%s", "HSA + HIP runtime resolved for device profiling\n");
+  return setHsaRuntimeState(1);
+}
+
+/* The canonical device bounds-table symbol from InstrProfilingPlatformGPU.c. */
+static const char ProfileSectionsSymbol[] = "__llvm_profile_sections";
+
+/* Dedup of drained section-bounds tuples, shared with the host-shadow path
+ * (processDeviceOffloadPrf records here on every successful drain). A single
+ * linked device code object exposes one __llvm_profile_sections, but the same
+ * bounds may be seen via multiple agents, so each unique counter set is
+ * drained exactly once across both paths. */
+namespace {
+struct ProfBoundsTuple {
+  const void *data;
+  const void *cnts;
+  const void *names;
+};
+} // namespace
+
+#define PROF_MAX_SEEN_BOUNDS 256
+static ProfBoundsTuple SeenBounds[PROF_MAX_SEEN_BOUNDS];
+static int NumSeenBounds = 0;
+
+/* Pure check: has this bounds tuple already been drained? Does not mutate
+ * state, so a transient failure does not permanently suppress retries. */
+static int profBoundsAlreadyDrained(const void *D, const void *C,
+                                    const void *N) {
+  for (int i = 0; i < NumSeenBounds; ++i)
+    if (SeenBounds[i].data == D && SeenBounds[i].cnts == C &&
+        SeenBounds[i].names == N)
+      return 1;
+  return 0;
+}
+
+/* Record a drained bounds tuple. Idempotent. Called after a successful drain
+ * (either path) so a failed attempt stays retryable. */
+static void profRecordDrainedBounds(const void *D, const void *C,
+                                    const void *N) {
+  if (profBoundsAlreadyDrained(D, C, N))
+    return;
+  if (NumSeenBounds < PROF_MAX_SEEN_BOUNDS) {
+    SeenBounds[NumSeenBounds].data = D;
+    SeenBounds[NumSeenBounds].cnts = C;
+    SeenBounds[NumSeenBounds].names = N;
+    NumSeenBounds++;
+  }
+}
+
+#define PROF_MAX_GPU_AGENTS 64
+
+namespace {
+struct GpuAgent {
+  prof_hsa_agent_t agent;
+  char arch[64];
+};
+
+struct WalkState {
+  GpuAgent agents[PROF_MAX_GPU_AGENTS];
+  int num_agents;
+  int total_found;
+  int total_drained;
+};
+
+/* Per (agent, executable) symbol-iteration state. */
+struct SymbolState {
+  const char *arch;
+  int found;
+  int drained;
+};
+} // namespace
+
+/* HSA per-symbol callback: when it finds a __llvm_profile_sections variable,
+ * drain it via processDeviceOffloadPrf() unless the host-shadow path (or an
+ * earlier agent) already handled the same bounds. */
+static prof_hsa_status_t onSymbol(prof_hsa_executable_t, prof_hsa_agent_t,
+                                  prof_hsa_executable_symbol_t Sym,
+                                  void *Data) {
+  SymbolState *S = (SymbolState *)Data;
+
+  prof_hsa_symbol_kind_t Kind;
+  if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_TYPE, &Kind) !=
+          PROF_HSA_STATUS_SUCCESS ||
+      Kind != PROF_HSA_SYMBOL_KIND_VARIABLE)
+    return PROF_HSA_STATUS_SUCCESS;
+
+  uint32_t NameLen = 0;
+  if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME_LENGTH,
+                     &NameLen) != PROF_HSA_STATUS_SUCCESS ||
+      NameLen != sizeof(ProfileSectionsSymbol) - 1)
+    return PROF_HSA_STATUS_SUCCESS;
+
+  char NameBuf[64];
+  if (NameLen + 1 > sizeof(NameBuf))
+    return PROF_HSA_STATUS_SUCCESS;
+  if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME, NameBuf) !=
+      PROF_HSA_STATUS_SUCCESS)
+    return PROF_HSA_STATUS_SUCCESS;
+  NameBuf[NameLen] = '\0';
+
+  if (strcmp(NameBuf, ProfileSectionsSymbol) != 0)
+    return PROF_HSA_STATUS_SUCCESS;
+
+  uint64_t Addr = 0;
+  if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS,
+                     &Addr) != PROF_HSA_STATUS_SUCCESS ||
+      Addr == 0) {
+    if (isVerboseMode())
+      PROF_NOTE("%s", "failed to read __llvm_profile_sections address\n");
+    return PROF_HSA_STATUS_SUCCESS;
+  }
+
+  S->found++;
+
+  // Read the bounds table first to dedup (and detect empty sections) before
+  // the full copy/relocate done by processDeviceOffloadPrf.
+  __llvm_profile_gpu_sections Sec;
+  if (memcpyDeviceToHost(&Sec, (void *)(uintptr_t)Addr, sizeof(Sec)) != 0) {
+    PROF_WARN("%s", "failed to copy device bounds table\n");
+    return PROF_HSA_STATUS_SUCCESS;
+  }
+  if (profBoundsAlreadyDrained(Sec.DataStart, Sec.CountersStart,
+                               Sec.NamesStart)) {
+    if (isVerboseMode())
+      PROF_NOTE("%s", "device bounds already drained, skipping\n");
+    return PROF_HSA_STATUS_SUCCESS;
+  }
+
+  size_t DataBytes = (const char *)Sec.DataStop - (const char *)Sec.DataStart;
+  size_t CntsBytes =
+      (const char *)Sec.CountersStop - (const char *)Sec.CountersStart;
+  if (DataBytes == 0 || CntsBytes == 0) {
+    // Empty code object: nothing to write. Mark seen so we don't revisit it.
+    profRecordDrainedBounds(Sec.DataStart, Sec.CountersStart, Sec.NamesStart);
+    return PROF_HSA_STATUS_SUCCESS;
+  }
+
+  // Generate a collision-free target. Multiple distinct device code objects on
+  // the same arch (e.g. non-RDC multi-TU) must not clobber each other's file.
+  static int DrainIndex = 0;
+  char Target[96];
+  if (DrainIndex == 0)
+    snprintf(Target, sizeof(Target), "%s", S->arch);
+  else
+    snprintf(Target, sizeof(Target), "%s.%d", S->arch, DrainIndex);
+
+  // processDeviceOffloadPrf returns 0 on a successful write, -1 on error.
+  // Record the bounds (and advance the target index) only on success so a
+  // transient error stays retryable on a later agent or collect call.
+  int Rc = processDeviceOffloadPrf((void *)(uintptr_t)Addr, Target, nullptr);
+  if (Rc == 0) {
+    S->drained++;
+    DrainIndex++;
+    profRecordDrainedBounds(Sec.DataStart, Sec.CountersStart, Sec.NamesStart);
+  }
+
+  return PROF_HSA_STATUS_SUCCESS;
+}
+
+static prof_hsa_status_t collectAgent(prof_hsa_agent_t Agent, void *Data) {
+  prof_hsa_device_type_t DevType;
+  if (pHsaAgentGetInfo(Agent, PROF_HSA_AGENT_INFO_DEVICE, &DevType) !=
+          PROF_HSA_STATUS_SUCCESS ||
+      DevType != PROF_HSA_DEVICE_TYPE_GPU)
+    return PROF_HSA_STATUS_SUCCESS;
+
+  WalkState *W = (WalkState *)Data;
+  if (W->num_agents >= PROF_MAX_GPU_AGENTS)
+    return PROF_HSA_STATUS_SUCCESS;
+
+  GpuAgent &GA = W->agents[W->num_agents++];
+  GA.agent = Agent;
+  char Name[64];
+  __builtin_memset(Name, 0, sizeof(Name));
+  pHsaAgentGetInfo(Agent, PROF_HSA_AGENT_INFO_NAME, Name);
+  size_t N = strnlen(Name, sizeof(GA.arch) - 1);
+  __builtin_memcpy(GA.arch, Name, N);
+  GA.arch[N] = '\0';
+  if (!GA.arch[0])
+    strncpy(GA.arch, "amdgpu", sizeof(GA.arch) - 1);
+
+  if (isVerboseMode())
+    PROF_NOTE("GPU agent %d: %s\n", W->num_agents - 1, GA.arch);
+  return PROF_HSA_STATUS_SUCCESS;
+}
+
+/* Reentrancy guard and "drained data at least once" latch. The collect hook
+ * may run more than once (an explicit early __llvm_profile_write_file plus the
+ * exit write); a successful walk latches HsaDrainCompleted so we never re-emit
+ * duplicate .profraw files, while transient no-op outcomes ("runtime not yet
+ * loadable", "no GPU agents", "no loaded segments", "nothing instrumented")
+ * stay retryable so a later call can still pick up code objects loaded later.
+ * HsaDrainInProgress prevents a concurrent or reentrant call (e.g. a library
+ * destructor) from corrupting the global SeenBounds table. Both flags use
+ * acquire/release atomics. */
+static int HsaDrainInProgress = 0;
+static int HsaDrainCompleted = 0;
+
+static int drainDevicesViaHsa(void) {
+  if (__atomic_load_n(&HsaDrainCompleted, __ATOMIC_ACQUIRE))
     return 0;
 
-  if (!isHipAvailable())
+  int Expected = 0;
+  if (!__atomic_compare_exchange_n(&HsaDrainInProgress, &Expected, 1,
+                                   /*weak=*/0, __ATOMIC_ACQ_REL,
+                                   __ATOMIC_ACQUIRE))
     return 0;
 
-  int Ret = 0;
+  struct InProgressGuard {
+    ~InProgressGuard() {
+      __atomic_store_n(&HsaDrainInProgress, 0, __ATOMIC_RELEASE);
+    }
+  } _Guard;
 
-  /* Shadow variables (static-linked kernels): drain from every device. */
-  if (NumShadowVariables > 0) {
-    int OrigDevice = -1;
-    hipGetDevice(&OrigDevice);
+  if (loadHsaRuntimePointers() != 0)
+    return 0; /* Runtime unavailable: stay retryable. */
 
-    for (int Dev = 0; Dev < NumDevices; ++Dev) {
-      if (!shouldCollectDevice(Dev)) {
-        if (isVerboseMode())
-          PROF_NOTE("Skipping unused device %d\n", Dev);
-        continue;
+  WalkState W;
+  __builtin_memset(&W, 0, sizeof(W));
+  prof_hsa_status_t St = pHsaIterateAgents(collectAgent, &W);
+  if (St != PROF_HSA_STATUS_SUCCESS && St != PROF_HSA_STATUS_INFO_BREAK) {
+    PROF_WARN("hsa_iterate_agents failed (0x%x)\n", St);
+    return -1;
+  }
+  if (W.num_agents == 0) {
+    if (isVerboseMode())
+      PROF_NOTE("%s", "no GPU agents present; nothing to drain (will retry)\n");
+    return 0;
+  }
+
+  /* query_segment_descriptors ships in every loader-extension version and is
+   * more permissive than iterate_executables on ROCm. It yields the loaded
+   * (agent, executable) pairs directly. */
+  size_t NumSegs = 0;
+  St = pQuerySegDescs(nullptr, &NumSegs);
+  if (St != PROF_HSA_STATUS_SUCCESS) {
+    PROF_WARN("query_segment_descriptors(count) failed (0x%x)\n", St);
+    return -1;
+  }
+  if (NumSegs == 0) {
+    if (isVerboseMode())
+      PROF_NOTE("%s", "no loaded segments; nothing to drain (will retry)\n");
+    return 0;
+  }
+
+  prof_hsa_loader_segment_descriptor_t *Segs =
+      (prof_hsa_loader_segment_descriptor_t *)calloc(NumSegs, sizeof(*Segs));
+  if (!Segs) {
+    PROF_ERR("%s\n", "failed to allocate segment descriptor array");
+    return -1;
+  }
+  UniqueFree SegsOwner(Segs);
+
+  St = pQuerySegDescs(Segs, &NumSegs);
+  if (St != PROF_HSA_STATUS_SUCCESS) {
+    PROF_WARN("query_segment_descriptors(fetch) failed (0x%x)\n", St);
+    return -1;
+  }
+
+  if (isVerboseMode())
+    PROF_NOTE("query_segment_descriptors: %zu segments\n", NumSegs);
+
+  /* Walk unique (agent, executable) pairs. */
+  enum { kMaxPairs = 512 };
+  uint64_t SeenAgents[kMaxPairs];
+  uint64_t SeenExecs[kMaxPairs];
+  int NumPairs = 0;
+  int IterFailures = 0;
+
+  for (size_t i = 0; i < NumSegs; ++i) {
+    if (Segs[i].executable.handle == 0 || Segs[i].agent.handle == 0)
+      continue;
+
+    int Seen = 0;
+    for (int j = 0; j < NumPairs; ++j)
+      if (SeenAgents[j] == Segs[i].agent.handle &&
+          SeenExecs[j] == Segs[i].executable.handle) {
+        Seen = 1;
+        break;
       }
-      if (hipSetDevice(Dev) != 0) {
-        if (isVerboseMode())
-          PROF_NOTE("Failed to set device %d, skipping\n", Dev);
-        continue;
+    if (Seen)
+      continue;
+    if (NumPairs < kMaxPairs) {
+      SeenAgents[NumPairs] = Segs[i].agent.handle;
+      SeenExecs[NumPairs] = Segs[i].executable.handle;
+      NumPairs++;
+    }
+
+    const char *Arch = nullptr;
+    for (int k = 0; k < W.num_agents; ++k)
+      if (W.agents[k].agent.handle == Segs[i].agent.handle) {
+        Arch = W.agents[k].arch;
+        break;
       }
-      const char *ArchName = getDeviceArchName(Dev);
-      if (isVerboseMode())
-        PROF_NOTE("Collecting static profile data from device %d (%s)\n", Dev,
-                  ArchName);
-      for (int i = 0; i < NumShadowVariables; ++i) {
-        /* RDC-mode multi-shadow drains need a distinct profraw per TU;
-         * single-TU programs keep the bare arch target. */
-        const char *Target = ArchName;
-        char TargetWithIdx[64];
-        if (NumShadowVariables > 1) {
-          snprintf(TargetWithIdx, sizeof(TargetWithIdx), "%s.%d", ArchName, i);
-          Target = TargetWithIdx;
+    if (!Arch)
+      continue; /* not a GPU agent we collected */
+
+    SymbolState S;
+    __builtin_memset(&S, 0, sizeof(S));
+    S.arch = Arch;
+    if (isVerboseMode())
+      PROF_NOTE("walking executable 0x%llx on %s\n",
+                (unsigned long long)Segs[i].executable.handle, Arch);
+    prof_hsa_status_t IterSt =
+        pHsaExecIterAgentSyms(Segs[i].executable, Segs[i].agent, onSymbol, &S);
+    if (IterSt != PROF_HSA_STATUS_SUCCESS &&
+        IterSt != PROF_HSA_STATUS_INFO_BREAK) {
+      PROF_WARN("hsa_executable_iterate_agent_symbols on executable 0x%llx "
+                "failed (0x%x)\n",
+                (unsigned long long)Segs[i].executable.handle, IterSt);
+      IterFailures++;
+    }
+    W.total_found += S.found;
+    W.total_drained += S.drained;
+  }
+
+  if (isVerboseMode())
+    PROF_NOTE("HSA walk complete: agents=%d pairs=%d found=%d drained=%d "
+              "iter-failures=%d\n",
+              W.num_agents, NumPairs, W.total_found, W.total_drained,
+              IterFailures);
+
+  /* Latch only when we actually drained data. Deliberately do NOT latch the
+   * "walked everything but found nothing new" case: an early collect call can
+   * run before any kernel launch, and latching it would suppress the real
+   * exit-time drain once kernels do run. Repeating a no-op walk is cheap. */
+  if (W.total_drained > 0)
+    __atomic_store_n(&HsaDrainCompleted, 1, __ATOMIC_RELEASE);
+  return (IterFailures > 0) ? -1 : 0;
+}
+
+/* NOTE: deliberately no library constructor that calls hsa_init() here.
+ * Bringing HSA up merely because the instrumented library was loaded poisons
+ * fork-based callers: frameworks and tests (e.g. RCCL's unit tests) keep
+ * HIP/HSA uninitialized in the parent and only touch HIP inside forked
+ * children. A parent that has already hsa_init()'d makes those children crash
+ * inside HSA (HSA state is not valid across fork()). HSA is instead brought up
+ * lazily from drainDevicesViaHsa() -> loadHsaRuntimePointers(); see the init
+ * rationale there. */
+
+#endif /* defined(__linux__) && !defined(_WIN32) -- HSA drain */
+
+/* -------------------------------------------------------------------------- */
+/*  Collect device-side profile data                                          */
+/* -------------------------------------------------------------------------- */
+
+extern "C" int __llvm_profile_hip_collect_device_data(void) {
+  int Ret = 0;
+
+  /* Host-shadow drain: static-linked kernels (host __hipRegisterVar shadows)
+   * and intercepted dynamic modules. Only meaningful when something registered
+   * host-side; skipped entirely for pure device-linked programs (RCCL), which
+   * the supplemental HSA pass below handles. */
+  if ((NumShadowVariables != 0 || NumDynamicModules != 0) && isHipAvailable()) {
+    /* Shadow variables (static-linked kernels): drain from every device. */
+    if (NumShadowVariables > 0) {
+      int OrigDevice = -1;
+      hipGetDevice(&OrigDevice);
+
+      for (int Dev = 0; Dev < NumDevices; ++Dev) {
+        if (!shouldCollectDevice(Dev)) {
+          if (isVerboseMode())
+            PROF_NOTE("Skipping unused device %d\n", Dev);
+          continue;
+        }
+#if defined(__linux__) && !defined(_WIN32)
+        /* When no kernel launch was tracked at all, shouldCollectDevice()
+         * falls back to collect-all, which can fault/hang reading a
+         * non-resident device's sections on a multi-GPU host (e.g. a program
+         * that never launches, collects before its first launch, or launches
+         * only via an untracked API). On Linux the supplemental HSA drain
+         * below covers those cases safely -- it walks only code objects
+         * actually resident on each agent -- so skip the host-shadow pass
+         * entirely rather than take the unsafe fallback. */
+        if (!__atomic_load_n(&AnyDeviceUsed, __ATOMIC_ACQUIRE)) {
+          if (isVerboseMode())
+            PROF_NOTE("No tracked launch; deferring device %d to HSA drain\n",
+                      Dev);
+          continue;
+        }
+#endif
+        if (hipSetDevice(Dev) != 0) {
+          if (isVerboseMode())
+            PROF_NOTE("Failed to set device %d, skipping\n", Dev);
+          continue;
+        }
+        const char *ArchName = getDeviceArchName(Dev);
+        if (isVerboseMode())
+          PROF_NOTE("Collecting static profile data from device %d (%s)\n", Dev,
+                    ArchName);
+        for (int i = 0; i < NumShadowVariables; ++i) {
+          /* RDC-mode multi-shadow drains need a distinct profraw per TU;
+           * single-TU programs keep the bare arch target. */
+          const char *Target = ArchName;
+          char TargetWithIdx[64];
+          if (NumShadowVariables > 1) {
+            snprintf(TargetWithIdx, sizeof(TargetWithIdx), "%s.%d", ArchName,
+                     i);
+            Target = TargetWithIdx;
+          }
+          if (processShadowVariable(i, Target) != 0)
+            Ret = -1;
         }
-        if (processShadowVariable(i, Target) != 0)
-          Ret = -1;
       }
-    }
 
-    if (OrigDevice >= 0)
-      hipSetDevice(OrigDevice);
-  }
+      if (OrigDevice >= 0)
+        hipSetDevice(OrigDevice);
+    }
 
-  /* Warn about unprocessed TUs; skip cleared slots (already drained). */
-  lockDynamicModules();
-  for (int i = 0; i < NumDynamicModules; ++i) {
-    OffloadDynamicModuleInfo *MI = &DynamicModules[i];
-    if (!MI->ModulePtr)
-      continue;
-    for (int t = 0; t < MI->NumTUs; ++t) {
-      if (!MI->TUs[t].Processed) {
-        PROF_WARN("dynamic module %p TU %d was not processed before exit\n",
-                  MI->ModulePtr, t);
-        Ret = -1;
+    /* Warn about unprocessed TUs; skip cleared slots (already drained). */
+    lockDynamicModules();
+    for (int i = 0; i < NumDynamicModules; ++i) {
+      OffloadDynamicModuleInfo *MI = &DynamicModules[i];
+      if (!MI->ModulePtr)
+        continue;
+      for (int t = 0; t < MI->NumTUs; ++t) {
+        if (!MI->TUs[t].Processed) {
+          PROF_WARN("dynamic module %p TU %d was not processed before exit\n",
+                    MI->ModulePtr, t);
+          Ret = -1;
+        }
       }
     }
+    unlockDynamicModules();
   }
-  unlockDynamicModules();
+
+#if defined(__linux__) && !defined(_WIN32)
+  /* Supplemental HSA-introspection drain: catches device code objects with no
+   * host-side shadow (e.g. RCCL device-linked kernels). Runs after the
+   * host-shadow drain so already-drained sections are deduped out, and runs
+   * even when there are no host shadows at all (the common RCCL case). */
+  if (drainDevicesViaHsa() != 0)
+    Ret = -1;
+#endif
 
   if (Ret != 0)
     PROF_WARN("%s\n", "failed to collect device profile data");
diff --git a/compiler-rt/test/profile/AMDGPU/device-basic.hip b/compiler-rt/test/profile/AMDGPU/device-basic.hip
new file mode 100644
index 0000000000000..4fcf044802240
--- /dev/null
+++ b/compiler-rt/test/profile/AMDGPU/device-basic.hip
@@ -0,0 +1,67 @@
+// Basic HIP device PGO drain end-to-end: a host + device .profraw are written
+// at exit (the device one arch-prefixed), they merge, the merged profile
+// contains the device kernel's counters, and llvm-cov reports device-side
+// coverage. Covers both non-RDC and RDC device compiles.
+//
+// REQUIRES: hip, amdgpu
+
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+
+// --- non-RDC ---
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \
+// RUN:   -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN:   -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   %t.dir/a.out
+// A device profraw (arch-prefixed) must have been drained alongside the host one.
+// RUN: ls %t.dir/gfx*.profraw
+// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/a.profdata
+// RUN: llvm-profdata show --all-functions %t.dir/a.profdata \
+// RUN:   | FileCheck --check-prefix=FUNCS %s
+// Confirm the embedded device image is extractable (failure here is the real
+// cause of any downstream llvm-cov failure, so let it propagate).
+// RUN: llvm-objdump --offloading %t.dir/a.out > /dev/null
+// RUN: llvm-cov report %t.dir/a.out.0.hip-amdgcn-amd-amdhsa--*gfx* \
+// RUN:   -instr-profile=%t.dir/a.profdata 2>&1 | FileCheck --check-prefix=COV %s
+
+// --- RDC ---
+// RUN: rm -f %t.dir/*.profraw
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fgpu-rdc \
+// RUN:   -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/b.out \
+// RUN:   -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   %t.dir/b.out
+// RUN: ls %t.dir/gfx*.profraw
+// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/b.profdata
+// RUN: llvm-profdata show --all-functions %t.dir/b.profdata \
+// RUN:   | FileCheck --check-prefix=FUNCS %s
+
+#include <hip/hip_runtime.h>
+
+__global__ void addk(int *p) {
+  if (*p > 0)
+    *p += 1;
+  else
+    *p -= 1;
+}
+
+int main() {
+  int *d = nullptr;
+  if (hipMalloc(&d, sizeof(int)) != hipSuccess)
+    return 2;
+  int h = 5;
+  (void)hipMemcpy(d, &h, sizeof(int), hipMemcpyHostToDevice);
+  addk<<<1, 1>>>(d);
+  (void)hipMemcpy(&h, d, sizeof(int), hipMemcpyDeviceToHost);
+  (void)hipFree(d);
+  return h > 0 ? 0 : 1;
+}
+
+// The merged profile contains both the host main and the device kernel,
+// proving the device counters were drained and merged.
+// FUNCS-DAG: addk
+// FUNCS-DAG: main
+
+// COV: TOTAL
diff --git a/compiler-rt/test/profile/AMDGPU/device-early-collect.hip b/compiler-rt/test/profile/AMDGPU/device-early-collect.hip
new file mode 100644
index 0000000000000..3e2c6e84e26c2
--- /dev/null
+++ b/compiler-rt/test/profile/AMDGPU/device-early-collect.hip
@@ -0,0 +1,68 @@
+// M1 regression: calling __llvm_profile_hip_collect_device_data() before any
+// kernel has been launched must not poison the later atexit drain.  The early
+// call sees "no instrumented code object loaded yet" (a transient no-op) and
+// must not latch the drain as completed; otherwise the post-launch atexit
+// pass produces no device .profraw and we silently lose device counters.
+//
+// REQUIRES: hip, amdgpu
+// Guards the Linux introspection drain's DrainCompleted latch; the Windows
+// host-shadow drain has no such latch (it tracks per-TU Processed flags).
+// UNSUPPORTED: windows
+
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \
+// RUN:   -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN:   -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   %t.dir/a.out
+// Both the host profraw and at least one device profraw (gfx-prefixed) must
+// have been produced, despite the early collection attempt.
+// RUN: ls %t.dir/host.*.profraw
+// RUN: ls %t.dir/gfx*.profraw
+// And the merged profile must contain the device kernel that was launched
+// *after* the early collect.
+// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/a.profdata
+// RUN: llvm-profdata show --all-functions %t.dir/a.profdata \
+// RUN:   | FileCheck %s
+
+#include <hip/hip_runtime.h>
+
+// Declared by libclang_rt.profile-<host arch>.a; we call it directly to
+// simulate any caller that drains device counters at an arbitrary point in
+// the program lifetime (e.g. a per-iteration profile dump).
+extern "C" int __llvm_profile_hip_collect_device_data(void);
+
+__global__ void post_collect_kernel(int *p) {
+  if (*p > 0)
+    *p += 1;
+  else
+    *p -= 1;
+}
+
+int main() {
+  // (1) Early collection -- runs before any kernel launch.  The drainer
+  //     finds either no GPU agents, no loaded segments, or no instrumented
+  //     bounds table, and returns 0 without latching DrainCompleted.
+  (void)__llvm_profile_hip_collect_device_data();
+
+  // (2) Now launch a kernel.  HIP loads the device code object that carries
+  //     the __llvm_profile_sections bounds table, executes our kernel, and
+  //     populates the device-side counters.
+  int *d = nullptr;
+  if (hipMalloc(&d, sizeof(int)) != hipSuccess)
+    return 2;
+  int h = 5;
+  (void)hipMemcpy(d, &h, sizeof(int), hipMemcpyHostToDevice);
+  post_collect_kernel<<<1, 1>>>(d);
+  (void)hipMemcpy(&h, d, sizeof(int), hipMemcpyDeviceToHost);
+  (void)hipFree(d);
+
+  // (3) Exit normally.  The atexit drain runs and -- because step (1) did
+  //     not latch DrainCompleted -- it walks the (now loaded) code object,
+  //     finds __llvm_profile_sections, and emits the device .profraw.
+  return h > 0 ? 0 : 1;
+}
+
+// CHECK-DAG: post_collect_kernel
+// CHECK-DAG: main
diff --git a/compiler-rt/test/profile/AMDGPU/device-no-kernel.hip b/compiler-rt/test/profile/AMDGPU/device-no-kernel.hip
new file mode 100644
index 0000000000000..a154308d725d8
--- /dev/null
+++ b/compiler-rt/test/profile/AMDGPU/device-no-kernel.hip
@@ -0,0 +1,44 @@
+// Independence / robustness: an instrumented HIP program that never launches a
+// kernel still writes its host .profraw, and the device drain is a clean no-op
+// (no crash, no spurious device .profraw). We assert the no-op condition
+// directly via the runtime's verbose log rather than rely on HIP lazy-loading
+// to leave the device code object unloaded -- the loader may load it for
+// other reasons (e.g. eager registration), and in that case the drain
+// legitimately walks it and reports zero instrumented sections / zero
+// drained. Either outcome is correct.
+//
+// REQUIRES: hip, amdgpu
+// The terminal conditions checked below ("no GPU agents", "no loaded
+// segments", "drained=0") are Linux HSA-drain strings with no Windows analog.
+// UNSUPPORTED: windows
+
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch \
+// RUN:   -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN:   -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   LLVM_PROFILE_VERBOSE=1 \
+// RUN:   %t.dir/a.out 2> %t.dir/verbose.log
+// RUN: ls %t.dir/host.*.profraw
+// No arch-prefixed device profraw should have been produced.
+// RUN: not ls %t.dir/gfx*.profraw
+// The drain must have run; one of these three terminal conditions must hold:
+//   - no GPU agents enumerated (test host has /dev/kfd but no usable agent)
+//   - no loaded code object segments at exit
+//   - the walk completed and drained=0 (no instrumented kernel was launched
+//     so the device code object either wasn't loaded or its bounds were
+//     empty/already drained)
+// RUN: FileCheck --input-file=%t.dir/verbose.log %s
+// CHECK: {{no GPU agents present|no loaded segments|drained=0}}
+
+#include <hip/hip_runtime.h>
+
+// Defined but never launched.
+__global__ void unused(int *p) { *p += 1; }
+
+int main() {
+  int n = 0;
+  (void)hipGetDeviceCount(&n);
+  return 0;
+}
diff --git a/compiler-rt/test/profile/AMDGPU/device-symbols.hip b/compiler-rt/test/profile/AMDGPU/device-symbols.hip
new file mode 100644
index 0000000000000..f12283b7da636
--- /dev/null
+++ b/compiler-rt/test/profile/AMDGPU/device-symbols.hip
@@ -0,0 +1,42 @@
+// The decoupled drain reads only the canonical __llvm_profile_sections bounds
+// table provided by the device profile runtime (InstrProfilingPlatformGPU.c),
+// since clang no longer emits a per-TU struct. Assert that symbol is present
+// in the device ELF's dynamic symbol table (protected visibility) for both
+// non-RDC and RDC device compiles. This is the contract the drainer depends on.
+//
+// REQUIRES: hip, amdgpu
+
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+
+// --- non-RDC ---
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \
+// RUN:   -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN:   -L%hip_lib_path -lamdhip64
+// Extraction failure here would make the readelf invocation succeed against
+// an empty/missing file; surface it instead of hiding it behind `|| true`.
+// RUN: llvm-objdump --offloading %t.dir/a.out > /dev/null
+// RUN: llvm-readelf --dyn-syms %t.dir/a.out.0.hip-amdgcn-amd-amdhsa--*gfx* \
+// RUN:   | FileCheck %s
+
+// --- RDC ---
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fgpu-rdc \
+// RUN:   -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/b.out \
+// RUN:   -L%hip_lib_path -lamdhip64
+// RUN: llvm-objdump --offloading %t.dir/b.out > /dev/null
+// RUN: llvm-readelf --dyn-syms %t.dir/b.out.0.hip-amdgcn-amd-amdhsa--*gfx* \
+// RUN:   | FileCheck %s
+
+// CHECK: PROTECTED {{.*}} __llvm_profile_sections
+
+#include <hip/hip_runtime.h>
+
+__global__ void k(int *p) { *p += 1; }
+
+int main() {
+  int *d = nullptr;
+  if (hipMalloc(&d, sizeof(int)) != hipSuccess)
+    return 2;
+  k<<<1, 1>>>(d);
+  (void)hipFree(d);
+  return 0;
+}
diff --git a/compiler-rt/test/profile/AMDGPU/lit.local.cfg.py b/compiler-rt/test/profile/AMDGPU/lit.local.cfg.py
new file mode 100644
index 0000000000000..5148dd6b9e2f2
--- /dev/null
+++ b/compiler-rt/test/profile/AMDGPU/lit.local.cfg.py
@@ -0,0 +1,4 @@
+# Device-profile drain tests: require an AMD GPU (and, implicitly, the amdgcn
+# device profile runtime in the resource directory and a ROCm/HIP install).
+if "amdgpu" not in config.available_features:
+    config.unsupported = True
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-basic.hip b/compiler-rt/test/profile/GPU/instrprof-hip-basic.hip
new file mode 100644
index 0000000000000..8cbe7c970052c
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-basic.hip
@@ -0,0 +1,51 @@
+// Test basic HIP PGO instrumentation and profile collection.
+//
+// REQUIRES: hip, amdgpu
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN:   --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: env LLVM_PROFILE_FILE=%t.dir/prof.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %run %t
+// RUN: ls %t.dir/prof.profraw
+// RUN: llvm-profdata merge -o %t.profdata %t.dir/
+// RUN: llvm-profdata show --all-functions %t.profdata \
+// RUN:   | FileCheck %s --check-prefix=PROF
+//
+// PROF: _Z6squarePiPKii
+// PROF: main
+// PROF: Functions shown: 2
+// PROF: Total functions: 2
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void square(int *out, const int *in, int n) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx < n)
+        out[idx] = in[idx] * in[idx];
+}
+
+int main() {
+    constexpr int N = 64;
+    int h_in[N], h_out[N];
+    for (int i = 0; i < N; ++i) h_in[i] = i;
+
+    int *d_in, *d_out;
+    (void)hipMalloc(&d_in, N * sizeof(int));
+    (void)hipMalloc(&d_out, N * sizeof(int));
+    (void)hipMemcpy(d_in, h_in, N * sizeof(int), hipMemcpyHostToDevice);
+
+    square<<<1, N>>>(d_out, d_in, N);
+
+    (void)hipMemcpy(h_out, d_out, N * sizeof(int), hipMemcpyDeviceToHost);
+
+    int ok = 1;
+    for (int i = 0; i < N; ++i)
+        if (h_out[i] != i * i) ok = 0;
+
+    printf("%s\n", ok ? "PASS" : "FAIL");
+    (void)hipFree(d_in);
+    (void)hipFree(d_out);
+    return !ok;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-collect-after.hip b/compiler-rt/test/profile/GPU/instrprof-hip-collect-after.hip
new file mode 100644
index 0000000000000..5a2393f8dcc47
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-collect-after.hip
@@ -0,0 +1,63 @@
+// Explicit-collect idempotency: a program that calls
+// __llvm_profile_hip_collect_device_data() itself *after* a launch (e.g. a
+// periodic profile dump) and then also exits normally must not double-count the
+// device counters. The explicit drain and the atexit drain write the same
+// arch-named device profraw, so the merged profile must reflect a single launch
+// (function count 64, even-branch 32), not two. Complements device-early-collect
+// (which covers a collect *before* the first launch).
+//
+// REQUIRES: hip, amdgpu
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \
+// RUN:   -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN:   -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %t.dir/a.out
+// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/a.profdata
+// RUN: llvm-profdata show --all-functions --counts %t.dir/a.profdata \
+// RUN:   | FileCheck %s
+//
+// A single launch of 64 threads, drained twice (explicit + atexit), must still
+// merge to exactly one launch's worth of counts.
+// CHECK: _Z8classifyPii:
+// CHECK: Function count: 64
+// CHECK: Block counts: [0, 32]
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+extern "C" int __llvm_profile_hip_collect_device_data(void);
+
+__global__ void classify(int *out, int n) {
+  int idx = blockIdx.x * blockDim.x + threadIdx.x;
+  if (idx >= n)
+    return;
+  if (idx % 2 == 0)
+    out[idx] = 1;
+  else
+    out[idx] = 0;
+}
+
+int main() {
+  constexpr int N = 64;
+  int *d = nullptr;
+  if (hipMalloc(&d, N * sizeof(int)) != hipSuccess)
+    return 2;
+  classify<<<1, N>>>(d, N);
+  (void)hipDeviceSynchronize();
+
+  // Explicit mid-program collect after the launch. The atexit drain runs again
+  // at exit; together they must not double the device counters.
+  (void)__llvm_profile_hip_collect_device_data();
+
+  int h[N];
+  (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost);
+  (void)hipFree(d);
+
+  int evens = 0;
+  for (int i = 0; i < N; ++i)
+    evens += h[i];
+  printf("%s\n", evens == 32 ? "PASS" : "FAIL");
+  return evens == 32 ? 0 : 1;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-counter-correctness.hip b/compiler-rt/test/profile/GPU/instrprof-hip-counter-correctness.hip
new file mode 100644
index 0000000000000..c2bfc9ac9dc66
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-counter-correctness.hip
@@ -0,0 +1,56 @@
+// Quantitative device-counter correctness: the drained device profile must carry
+// the *exact* per-region execution counts produced by the kernel, not merely
+// "some counts are present". A single launch of 64 threads over classify() must
+// record a function entry count of 64 and an even-branch block count of 32
+// (idx % 2 == 0 holds for exactly half of idx in [0, 64)). This pins the drain +
+// dedup path against silent under/over-counting (e.g. a dedup bug that dropped
+// or doubled a section would change these numbers).
+//
+// REQUIRES: hip, amdgpu
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \
+// RUN:   -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN:   -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %t.dir/a.out
+// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/a.profdata
+// RUN: llvm-profdata show --all-functions --counts %t.dir/a.profdata \
+// RUN:   | FileCheck %s
+//
+// The device kernel ran with exactly 64 threads, all of which entered the
+// function; the even branch was taken 32 times and the early-return path 0.
+// CHECK: _Z8classifyPii:
+// CHECK: Function count: 64
+// CHECK: Block counts: [0, 32]
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void classify(int *out, int n) {
+  int idx = blockIdx.x * blockDim.x + threadIdx.x;
+  if (idx >= n)
+    return;
+  if (idx % 2 == 0)
+    out[idx] = 1;
+  else
+    out[idx] = 0;
+}
+
+int main() {
+  constexpr int N = 64;
+  int *d = nullptr;
+  if (hipMalloc(&d, N * sizeof(int)) != hipSuccess)
+    return 2;
+  classify<<<1, N>>>(d, N);
+  (void)hipDeviceSynchronize();
+  int h[N];
+  (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost);
+  (void)hipFree(d);
+
+  int evens = 0;
+  for (int i = 0; i < N; ++i)
+    evens += h[i];
+  printf("%s\n", evens == 32 ? "PASS" : "FAIL");
+  return evens == 32 ? 0 : 1;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-coverage.hip b/compiler-rt/test/profile/GPU/instrprof-hip-coverage.hip
new file mode 100644
index 0000000000000..a867c30f0edfb
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-coverage.hip
@@ -0,0 +1,51 @@
+// Test HIP coverage mapping produces source-level coverage for host code.
+//
+// REQUIRES: hip, amdgpu
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN:   --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: env LLVM_PROFILE_FILE=%t.dir/prof.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %run %t
+// RUN: llvm-profdata merge -o %t.profdata %t.dir/
+// RUN: llvm-cov report %t -instr-profile=%t.profdata 2>&1 \
+// RUN:   | FileCheck %s --check-prefix=REPORT
+//
+// REPORT: instrprof-hip-coverage.hip
+// No coverage column should be fully uncovered. Anchor on a non-digit before
+// the "0.00%" so this does not spuriously match e.g. "80.00%".
+// REPORT-NOT: {{[^.0-9]0[.]00%}}
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__device__ int gpu_abs(int x) {
+    return x < 0 ? -x : x;
+}
+
+__global__ void abs_kernel(int *data, int n) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx < n)
+        data[idx] = gpu_abs(data[idx]);
+}
+
+int main() {
+    constexpr int N = 16;
+    int h[N];
+    for (int i = 0; i < N; ++i)
+        h[i] = (i % 2 == 0) ? i : -i;
+
+    int *d;
+    (void)hipMalloc(&d, N * sizeof(int));
+    (void)hipMemcpy(d, h, N * sizeof(int), hipMemcpyHostToDevice);
+    abs_kernel<<<1, N>>>(d, N);
+    (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost);
+    (void)hipFree(d);
+
+    int ok = 1;
+    for (int i = 0; i < N; ++i)
+        if (h[i] != i) ok = 0;
+
+    printf("%s\n", ok ? "PASS" : "FAIL");
+    return !ok;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-device-branching.hip b/compiler-rt/test/profile/GPU/instrprof-hip-device-branching.hip
new file mode 100644
index 0000000000000..a24b28ec9af0a
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-device-branching.hip
@@ -0,0 +1,67 @@
+// Test that device-side branching is captured in profile counters.
+// Exercises the classify-style pattern where different branches are taken
+// by different threads, verifying that counter values reflect actual execution.
+//
+// REQUIRES: hip, amdgpu
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN:   --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: env LLVM_PROFILE_FILE=%t.dir/prof.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %run %t
+// RUN: llvm-profdata merge -o %t.profdata %t.dir/
+// RUN: llvm-profdata show --all-functions %t.profdata \
+// RUN:   | FileCheck %s --check-prefix=PROF
+//
+// Device functions should appear with non-zero counters. The __device__
+// classify() helper is inlined into the histogram kernel, so it does not get a
+// separate profile record; its branching is captured within the kernel's
+// counters instead.
+// PROF-DAG: _Z9histogramPKiPii
+// PROF-DAG: main
+// PROF: Total functions: 2
+// PROF: Maximum function count: {{[1-9][0-9]*}}
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__device__ int classify(int x) {
+    if (x > 100)    return 2;
+    else if (x > 0) return 1;
+    else            return 0;
+}
+
+__global__ void histogram(const int *input, int *bins, int n) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx < n) {
+        int cls = classify(input[idx]);
+        atomicAdd(&bins[cls], 1);
+    }
+}
+
+int main() {
+    constexpr int N = 256;
+    constexpr int NBINS = 3;
+
+    int h_in[N], h_bins[NBINS] = {};
+    for (int i = 0; i < N; ++i)
+        h_in[i] = (i % 3 == 0) ? -1 : (i % 3 == 1) ? 50 : 200;
+
+    int *d_in, *d_bins;
+    (void)hipMalloc(&d_in, N * sizeof(int));
+    (void)hipMalloc(&d_bins, NBINS * sizeof(int));
+    (void)hipMemcpy(d_in, h_in, N * sizeof(int), hipMemcpyHostToDevice);
+    (void)hipMemset(d_bins, 0, NBINS * sizeof(int));
+
+    histogram<<<1, N>>>(d_in, d_bins, N);
+
+    (void)hipMemcpy(h_bins, d_bins, NBINS * sizeof(int), hipMemcpyDeviceToHost);
+    printf("bins: [%d, %d, %d]\n", h_bins[0], h_bins[1], h_bins[2]);
+
+    int ok = (h_bins[0] > 0 && h_bins[1] > 0 && h_bins[2] > 0);
+    printf("%s\n", ok ? "PASS" : "FAIL");
+
+    (void)hipFree(d_in);
+    (void)hipFree(d_bins);
+    return !ok;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-fork-safety.hip b/compiler-rt/test/profile/GPU/instrprof-hip-fork-safety.hip
new file mode 100644
index 0000000000000..c79cf568f88bc
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-fork-safety.hip
@@ -0,0 +1,61 @@
+// Fork safety: loading the profile-instrumented library must NOT initialize
+// HSA in a process that itself never touches HIP and only runs device work in
+// forked children. RCCL's unit tests follow exactly this pattern -- the parent
+// deliberately keeps HIP/HSA uninitialized and launches kernels only inside
+// forked children. If the profile runtime's library constructor eagerly called
+// hsa_init(), the child would inherit invalid HSA state across fork() and crash
+// inside HSA (e.g. hsa_amd_signal_create -> SharedSignalPool::alloc). The HSA
+// drain therefore brings HSA up lazily, never from a constructor.
+//
+// REQUIRES: hip, amdgpu
+// The eager-hsa_init fork hazard and the lazy HSA drain are Linux-only.
+// UNSUPPORTED: windows
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN:   --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   %run %t 2>&1 | FileCheck %s
+//
+// The forked child must complete its kernel without crashing in HSA.
+// CHECK: PASS
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+#include <sys/wait.h>
+#include <unistd.h>
+
+__global__ void increment(int *p) { *p += 1; }
+
+static int doChildWork() {
+  int *d = nullptr;
+  if (hipMalloc(&d, sizeof(int)) != hipSuccess)
+    return 1;
+  int h = 41;
+  (void)hipMemcpy(d, &h, sizeof(int), hipMemcpyHostToDevice);
+  increment<<<1, 1>>>(d);
+  if (hipDeviceSynchronize() != hipSuccess)
+    return 1;
+  (void)hipMemcpy(&h, d, sizeof(int), hipMemcpyDeviceToHost);
+  (void)hipFree(d);
+  return h == 42 ? 0 : 1;
+}
+
+int main() {
+  // The parent intentionally performs no HIP/HSA work before forking.
+  pid_t pid = fork();
+  if (pid < 0) {
+    printf("FAIL (fork failed)\n");
+    return 1;
+  }
+  if (pid == 0) {
+    // Child runs the device work; _exit avoids flushing the parent's profile
+    // handlers from the child (the RCCL test pattern).
+    _exit(doChildWork());
+  }
+
+  int status = 0;
+  (void)waitpid(pid, &status, 0);
+  int ok = WIFEXITED(status) && WEXITSTATUS(status) == 0;
+  printf("%s\n", ok ? "PASS" : "FAIL");
+  return ok ? 0 : 1;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-multi-gpu.hip b/compiler-rt/test/profile/GPU/instrprof-hip-multi-gpu.hip
new file mode 100644
index 0000000000000..6a99546d34bdb
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-multi-gpu.hip
@@ -0,0 +1,57 @@
+// Test that HIP PGO works on multi-GPU systems. The kernel runs on the default
+// device, so the host-shadow drain (guarded by upstream's launch tracking)
+// collects only that device and the supplemental HSA agent-walk then finds the
+// same code object and dedups it out. The point of the test is that neither
+// pass crashes or hangs reading a non-resident device on a host with several
+// GPUs (the failure mode that the launch tracking + HSA residency walk fix).
+//
+// REQUIRES: hip, amdgpu
+// The "walk complete" / dedup notes are Linux-only HSA-drain strings; the
+// Windows host-shadow drain collects only the current device.
+// UNSUPPORTED: windows
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN:   --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   LLVM_PROFILE_VERBOSE=1 %run %t 2>&1 | FileCheck %s
+//
+// The host-shadow pass drains the launched device, the HSA walk finds that same
+// code object and dedups it (drained=0), and the program does not crash.
+// CHECK: Copied device sections:
+// CHECK: device bounds already drained, skipping
+// CHECK: walk complete: agents={{[0-9]+}} pairs={{[0-9]+}} found={{[1-9][0-9]*}} drained={{[0-9]+}}
+// CHECK: PASS
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void add_one(int *data, int n) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx < n)
+        data[idx] += 1;
+}
+
+int main() {
+    int ndev = 0;
+    (void)hipGetDeviceCount(&ndev);
+
+    constexpr int N = 32;
+    int h_data[N];
+    for (int i = 0; i < N; ++i) h_data[i] = i;
+
+    int *d_data;
+    (void)hipMalloc(&d_data, N * sizeof(int));
+    (void)hipMemcpy(d_data, h_data, N * sizeof(int), hipMemcpyHostToDevice);
+
+    add_one<<<1, N>>>(d_data, N);
+
+    (void)hipMemcpy(h_data, d_data, N * sizeof(int), hipMemcpyDeviceToHost);
+    (void)hipFree(d_data);
+
+    int ok = 1;
+    for (int i = 0; i < N; ++i)
+        if (h_data[i] != i + 1) ok = 0;
+
+    printf("%s (devices=%d)\n", ok ? "PASS" : "FAIL", ndev);
+    return !ok;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-multi-process-merge.hip b/compiler-rt/test/profile/GPU/instrprof-hip-multi-process-merge.hip
new file mode 100644
index 0000000000000..8cf1258a63535
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-multi-process-merge.hip
@@ -0,0 +1,63 @@
+// Multi-process accumulation: device counters from several independent runs of an
+// instrumented HIP program must accumulate when their profraws are offline-merged
+// (llvm-profdata merge), the common real-world feedback-collection path. Three
+// runs of a 64-thread launch must sum to a function count of 192 and an
+// even-branch block count of 96 (3 x 64 / 3 x 32).
+//
+// Note: on-the-fly merge-pooling via LLVM_PROFILE_FILE=...%m does NOT currently
+// accumulate *device* counters -- the device profraw is rewritten per process
+// rather than merged in place -- so each process must write a distinct file
+// (here via %p) and the accumulation is done by llvm-profdata merge.
+//
+// REQUIRES: hip, amdgpu
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \
+// RUN:   -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN:   -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/run1.%%p.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %t.dir/a.out
+// RUN: env LLVM_PROFILE_FILE=%t.dir/run2.%%p.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %t.dir/a.out
+// RUN: env LLVM_PROFILE_FILE=%t.dir/run3.%%p.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %t.dir/a.out
+// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/a.profdata
+// RUN: llvm-profdata show --all-functions --counts %t.dir/a.profdata \
+// RUN:   | FileCheck %s
+//
+// CHECK: _Z8classifyPii:
+// CHECK: Function count: 192
+// CHECK: Block counts: [0, 96]
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void classify(int *out, int n) {
+  int idx = blockIdx.x * blockDim.x + threadIdx.x;
+  if (idx >= n)
+    return;
+  if (idx % 2 == 0)
+    out[idx] = 1;
+  else
+    out[idx] = 0;
+}
+
+int main() {
+  constexpr int N = 64;
+  int *d = nullptr;
+  if (hipMalloc(&d, N * sizeof(int)) != hipSuccess)
+    return 2;
+  classify<<<1, N>>>(d, N);
+  (void)hipDeviceSynchronize();
+  int h[N];
+  (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost);
+  (void)hipFree(d);
+
+  int evens = 0;
+  for (int i = 0; i < N; ++i)
+    evens += h[i];
+  printf("%s\n", evens == 32 ? "PASS" : "FAIL");
+  return evens == 32 ? 0 : 1;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-multiple-kernels.hip b/compiler-rt/test/profile/GPU/instrprof-hip-multiple-kernels.hip
new file mode 100644
index 0000000000000..0fd6185b82441
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-multiple-kernels.hip
@@ -0,0 +1,58 @@
+// Test PGO with multiple kernel launches from a single TU.
+// Verifies that counters from all device functions are collected correctly.
+//
+// REQUIRES: hip, amdgpu
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN:   --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: env LLVM_PROFILE_FILE=%t.dir/prof.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %run %t
+// RUN: llvm-profdata merge -o %t.profdata %t.dir/
+// RUN: llvm-profdata show --all-functions %t.profdata \
+// RUN:   | FileCheck %s --check-prefix=PROF
+//
+// All three kernels plus main should be profiled.
+// PROF-DAG: _Z4fillPii
+// PROF-DAG: _Z5scalePii
+// PROF-DAG: _Z6negatePii
+// PROF-DAG: main
+// PROF: Total functions: 4
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void fill(int *data, int val) {
+    data[threadIdx.x] = val;
+}
+
+__global__ void scale(int *data, int factor) {
+    data[threadIdx.x] *= factor;
+}
+
+__global__ void negate(int *data, int n) {
+    int idx = threadIdx.x;
+    if (idx < n)
+        data[idx] = -data[idx];
+}
+
+int main() {
+    constexpr int N = 16;
+    int h[N];
+    int *d;
+    (void)hipMalloc(&d, N * sizeof(int));
+
+    fill<<<1, N>>>(d, 5);
+    scale<<<1, N>>>(d, 3);
+    negate<<<1, N>>>(d, N);
+
+    (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost);
+    (void)hipFree(d);
+
+    int ok = 1;
+    for (int i = 0; i < N; ++i)
+        if (h[i] != -15) ok = 0;
+
+    printf("%s\n", ok ? "PASS" : "FAIL");
+    return !ok;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-nondefault-device.hip b/compiler-rt/test/profile/GPU/instrprof-hip-nondefault-device.hip
new file mode 100644
index 0000000000000..5d3dea671047b
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-nondefault-device.hip
@@ -0,0 +1,60 @@
+// Test PGO when the kernel runs on a non-default device (here the program
+// selects device 1). Upstream's launch tracking records that device 1 was used,
+// so the host-shadow drain skips the other devices and collects device 1, and
+// the supplemental HSA agent-walk then finds that same code object and dedups it
+// out. This exercises both that the correct device is drained and that an
+// unused device is never read (which would fault/hang on a multi-GPU host).
+//
+// REQUIRES: hip, amdgpu, multi-device
+// The "walk complete" / dedup notes are Linux-only HSA-drain strings; the
+// Windows host-shadow drain only collects the current device.
+// UNSUPPORTED: windows
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN:   --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   LLVM_PROFILE_VERBOSE=1 %run %t 2>&1 | FileCheck %s
+//
+// The launched device (1) is drained, the unused default device is skipped, and
+// the HSA walk finds the same code object and dedups it (drained=0).
+// CHECK: Skipping unused device 0
+// CHECK: Collecting static profile data from device 1
+// CHECK: Copied device sections:
+// CHECK: device bounds already drained, skipping
+// CHECK: walk complete: agents={{[0-9]+}} pairs={{[0-9]+}} found={{[1-9][0-9]*}} drained={{[0-9]+}}
+// CHECK: PASS
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void fill(int *data, int val, int n) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx < n)
+        data[idx] = val;
+}
+
+int main() {
+    int ndev = 0;
+    (void)hipGetDeviceCount(&ndev);
+    if (ndev < 2) {
+        printf("PASS (skipped: only %d device)\n", ndev);
+        return 0;
+    }
+
+    (void)hipSetDevice(1);
+
+    constexpr int N = 32;
+    int h[N] = {};
+    int *d;
+    (void)hipMalloc(&d, N * sizeof(int));
+    fill<<<1, N>>>(d, 99, N);
+    (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost);
+    (void)hipFree(d);
+
+    int ok = 1;
+    for (int i = 0; i < N; ++i)
+        if (h[i] != 99) ok = 0;
+
+    printf("%s\n", ok ? "PASS" : "FAIL");
+    return !ok;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-pgo-use.hip b/compiler-rt/test/profile/GPU/instrprof-hip-pgo-use.hip
new file mode 100644
index 0000000000000..9a8a8187f8e77
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-pgo-use.hip
@@ -0,0 +1,63 @@
+// Test the full PGO cycle: instrument, collect, merge, optimize.
+// Verifies that the optimized binary produces correct output and that
+// profile data is consumed without errors.
+//
+// REQUIRES: hip, amdgpu
+//
+// Step 1: Build instrumented binary.
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN:   --offload-arch=%amdgpu_arch %s -o %t.instr \
+// RUN:   -L%hip_lib_path -lamdhip64
+//
+// Step 2: Run to collect profile data.
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: env LLVM_PROFILE_FILE=%t.dir/prof.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %run %t.instr 2>&1 | FileCheck %s
+//
+// Step 3: Merge profile data.
+// RUN: llvm-profdata merge -o %t.profdata %t.dir/
+//
+// Step 4: Build optimized binary with profile data.
+// RUN: %clang -x hip -fprofile-instr-use=%t.profdata \
+// RUN:   --offload-arch=%amdgpu_arch %s -o %t.opt \
+// RUN:   -L%hip_lib_path -lamdhip64 -O2
+//
+// Step 5: Run optimized binary.
+// RUN: env LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %run %t.opt 2>&1 | FileCheck %s
+//
+// CHECK: PASS
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void scale(float *data, float factor, int n) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx < n)
+        data[idx] *= factor;
+}
+
+int main() {
+    constexpr int N = 128;
+    float h[N];
+    for (int i = 0; i < N; ++i) h[i] = (float)i;
+
+    float *d;
+    (void)hipMalloc(&d, N * sizeof(float));
+    (void)hipMemcpy(d, h, N * sizeof(float), hipMemcpyHostToDevice);
+
+    scale<<<1, N>>>(d, 2.0f, N);
+
+    (void)hipMemcpy(h, d, N * sizeof(float), hipMemcpyDeviceToHost);
+    (void)hipFree(d);
+
+    int ok = 1;
+    for (int i = 0; i < N; ++i) {
+        float expected = (float)(i * 2);
+        if (h[i] != expected) ok = 0;
+    }
+
+    printf("%s\n", ok ? "PASS" : "FAIL");
+    return !ok;
+}
diff --git a/compiler-rt/test/profile/device-pgo/README.md b/compiler-rt/test/profile/device-pgo/README.md
new file mode 100644
index 0000000000000..4338c637abe49
--- /dev/null
+++ b/compiler-rt/test/profile/device-pgo/README.md
@@ -0,0 +1,125 @@
+# HIP device PGO / code coverage: standalone build & test recipe
+
+This directory provides a CMake-based recipe to build and exercise HIP device
+profile-guided optimization (PGO) and source-based code coverage **outside
+TheRock**, using only an `llvm-project` checkout plus a ROCm runtime.
+
+It builds, in one configure:
+
+- the host toolchain (`clang`, `clang++`, `lld`, `llvm-profdata`, `llvm-cov`)
+  and the lit-lite test utilities (`FileCheck`, `not`);
+- the host ROCm drain runtime `clang_rt.profile_rocm` (opt-in,
+  `COMPILER_RT_BUILD_PROFILE_ROCM=ON`). It runs the upstream host-shadow drain on
+  all platforms; on **Linux** `InstrProfilingPlatformROCm.cpp` additionally runs a
+  supplemental HSA-introspection pass (with content-dedup) to collect device code
+  objects that have no host shadow (e.g. device-linked/RCCL kernels);
+- the **amdgcn device** profile runtime `libclang_rt.profile.a` (the baremetal
+  profile subset that provides `__llvm_profile_instrument_gpu` and the
+  `__llvm_profile_sections` bounds table), built for the `amdgcn-amd-amdhsa`
+  runtime target with LLVM libc for amdgcn.
+
+## Why a separate library
+
+Upstream relands HIP offload PGO runtime support as the **opt-in**
+`clang_rt.profile_rocm` (llvm#201606), a `/MD` superset of `clang_rt.profile`;
+the base library stays unchanged. The driver links `clang_rt.profile_rocm`
+ahead of `clang_rt.profile` for HIP host links when profiling is requested
+(see `clang/lib/Driver/ToolChains/{Linux,MSVC}.cpp`). This recipe just turns
+the option on and builds the matching amdgcn device runtime.
+
+## Prerequisites
+
+- A ROCm installation (for `libamdhip64` and, on Linux, `libhsa-runtime64`),
+  e.g. `/opt/rocm`. Export `ROCM_PATH`.
+- An AMD GPU visible to the runtime for the *run* step (the build step does
+  not need a GPU). `amdgpu-arch` should list your device(s).
+- Ninja, a host C/C++ compiler, and Python 3.
+
+## Build
+
+```bash
+export ROCM_PATH=/opt/rocm
+./build.sh                 # builds into <repo>/build/device-pgo
+# or: ./build.sh /path/to/builddir
+```
+
+Key outputs under the build dir:
+
+```
+bin/{clang,clang++,lld,llvm-profdata,llvm-cov,FileCheck,not}
+lib/clang/<ver>/lib/<host-triple>/libclang_rt.profile_rocm.a
+lib/clang/<ver>/lib/amdgcn-amd-amdhsa/libclang_rt.profile.a
+```
+
+See `toolchain-cache.cmake` for the exact CMake variables, including the
+`LLVM_RUNTIME_TARGETS="default;amdgcn-amd-amdhsa"` split.
+
+## Run the tests
+
+The lit-lite runner (`../run_gpu_tests.py`) compiles each `.hip` test with the
+just-built toolchain, runs it on the GPU, and pipes output through `FileCheck`.
+It auto-detects features (`multi-device` via `amdgpu-arch`) so tests that need
+two visible GPUs are skipped on single-GPU hosts.
+
+```bash
+python3 ../run_gpu_tests.py \
+    --toolchain-bin "$PWD/<builddir>/bin" \
+    --hip-lib-path "$ROCM_PATH/lib" \
+    ../GPU ../AMDGPU
+```
+
+`--toolchain-bin` must be an **absolute** path (the runner executes each RUN
+line from a temp directory). With the toolchain's `amdgpu-arch`/`offload-arch`
+on hand, `--offload-arch=native` resolves automatically and the `multi-device`
+feature is enabled when 2+ GPUs are visible (so multi-GPU tests run on a
+multi-GPU host and are skipped otherwise). On a multi-gfx90a host this suite is
+15 passed, 0 failed.
+
+### Coverage notes / known gaps
+
+- Quantitative device-counter correctness (`instrprof-hip-counter-correctness`),
+  multi-process offline accumulation (`instrprof-hip-multi-process-merge`) and
+  explicit-collect idempotency (`instrprof-hip-collect-after`) pin exact device
+  counter values, so a dedup or drain regression that drops/doubles a section is
+  caught.
+- `LLVM_PROFILE_FILE=...%m` on-the-fly merge-pooling does **not** accumulate
+  *device* counters today (the device profraw is rewritten per process rather
+  than merged in place); multi-process accumulation must go through
+  `llvm-profdata merge` of distinct per-process files.
+- There is no in-tree test that drains a code object with **no** host shadow in
+  isolation (the pure device-linked/RCCL case the HSA pass uniquely handles): it
+  requires a real device-side library build (the profile runtime linked into the
+  device image), which is not expressible in the lit-lite harness via the clang
+  driver. The dedup tests do prove the HSA pass finds and dedups the same code
+  objects the host-shadow pass drains; validating the no-host-shadow drain needs
+  an actual RCCL-style binary in downstream CI.
+
+## Manual workflow (for reference)
+
+```bash
+CLANG=<builddir>/bin/clang++
+# 1. Instrumented build (host + device).
+$CLANG -O2 -fprofile-instr-generate -fcoverage-mapping \
+    --offload-arch=gfx1100 -xhip app.hip -o app
+
+# 2. Run. Produces a host .profraw and a device
+#    <name>.amdgcn-amd-amdhsa.<arch>.profraw drained by clang_rt.profile_rocm.
+LLVM_PROFILE_FILE='app-%p.profraw' ./app
+
+# 3. Merge (device profiles are merged per GPU arch).
+<builddir>/bin/llvm-profdata merge -o app.profdata app-*.profraw
+
+# 4. Coverage report (device).
+<builddir>/bin/llvm-cov show ./app -instr-profile=app.profdata
+```
+
+## Notes / environment-specific knobs
+
+- `--offload-arch` must match your GPU; the amdgcn device runtime is target
+  generic but the app's device code is per-arch. The build installs
+  `offload-arch` (and the `amdgpu-arch` alias) into `<builddir>/bin`, so
+  `--offload-arch=native` works without a system ROCm `amdgpu-arch`.
+- The amdgcn runtime target requires LLVM libc for amdgcn; if your environment
+  cannot build it, drop `libc` from
+  `RUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES` only if your headers are
+  otherwise provided.
diff --git a/compiler-rt/test/profile/device-pgo/build.sh b/compiler-rt/test/profile/device-pgo/build.sh
new file mode 100755
index 0000000000000..edf90f42fb8c1
--- /dev/null
+++ b/compiler-rt/test/profile/device-pgo/build.sh
@@ -0,0 +1,56 @@
+#!/usr/bin/env bash
+# Standalone (non-TheRock) build of the toolchain + host/device runtimes used by
+# the HIP device-PGO / code-coverage tests. See toolchain-cache.cmake and
+# README.md for details.
+#
+#   ./build.sh [BUILD_DIR]
+#
+# Env knobs:
+#   LLVM_SRC   path to the llvm-project checkout (default: repo root inferred
+#              from this script's location)
+#   JOBS       parallelism for ninja (default: nproc)
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+# .../compiler-rt/test/profile/device-pgo -> repo root is four levels up.
+LLVM_SRC="${LLVM_SRC:-$(cd "${SCRIPT_DIR}/../../../.." && pwd)}"
+BUILD_DIR="${1:-${LLVM_SRC}/build/device-pgo}"
+JOBS="${JOBS:-$(nproc)}"
+
+echo "llvm-project source : ${LLVM_SRC}"
+echo "build directory     : ${BUILD_DIR}"
+echo "parallel jobs       : ${JOBS}"
+
+cmake -G Ninja \
+  -S "${LLVM_SRC}/llvm" \
+  -B "${BUILD_DIR}" \
+  -C "${SCRIPT_DIR}/toolchain-cache.cmake"
+
+# The 'clang' target also produces the clang++ symlink. The offload toolchain
+# tools (clang-offload-bundler, clang-linker-wrapper, llvm-link,
+# llvm-offload-binary) and offload-arch (also installed as amdgpu-arch) are
+# needed to compile/link a HIP program and to resolve --offload-arch=native /
+# the multi-device test feature. 'runtimes' builds both the host (default) and
+# amdgcn device runtime targets.
+ninja -C "${BUILD_DIR}" -j "${JOBS}" \
+  clang lld \
+  clang-offload-bundler clang-linker-wrapper llvm-link llvm-offload-binary \
+  offload-arch \
+  llvm-profdata llvm-cov FileCheck not \
+  runtimes
+
+cat <<EOF
+
+Build complete.
+
+Toolchain bin : ${BUILD_DIR}/bin
+Run the GPU tests with, e.g.:
+
+  python3 ${SCRIPT_DIR}/../run_gpu_tests.py \\
+      --toolchain-bin ${BUILD_DIR}/bin \\
+      --hip-lib-path \${ROCM_PATH:-/opt/rocm}/lib \\
+      ${SCRIPT_DIR}/../GPU ${SCRIPT_DIR}/../AMDGPU
+
+(--toolchain-bin must be an absolute path; the runner executes RUN lines from a
+temp dir. See README.md for more.)
+EOF
diff --git a/compiler-rt/test/profile/device-pgo/toolchain-cache.cmake b/compiler-rt/test/profile/device-pgo/toolchain-cache.cmake
new file mode 100644
index 0000000000000..f48656c66d82d
--- /dev/null
+++ b/compiler-rt/test/profile/device-pgo/toolchain-cache.cmake
@@ -0,0 +1,55 @@
+# CMake cache for a standalone (non-TheRock) build of everything needed to
+# compile, run, and FileCheck the HIP device-PGO / code-coverage tests under
+# compiler-rt/test/profile/{GPU,AMDGPU}.
+#
+# It produces, in a single configure:
+#   * the host toolchain: clang, clang++, lld, llvm-profdata, llvm-cov, plus the
+#     test utilities FileCheck and not (LLVM_INSTALL_UTILS);
+#   * the host ROCm drain runtime clang_rt.profile_rocm (opt-in, links the
+#     sanitizer interception object libs -- hence COMPILER_RT_BUILD_SANITIZERS);
+#   * the amdgcn device profile runtime libclang_rt.profile.a (the baremetal
+#     profile subset providing __llvm_profile_instrument_gpu and the
+#     __llvm_profile_sections bounds table), built for the amdgcn-amd-amdhsa
+#     runtime target via compiler-rt/cmake/caches/AMDGPU.cmake. Building the
+#     device runtime requires LLVM libc for amdgcn, so libc is enabled for that
+#     runtime target.
+#
+# Usage (see ./build.sh for a wrapper):
+#   cmake -G Ninja -S llvm -B build/device-pgo \
+#         -C compiler-rt/test/profile/device-pgo/toolchain-cache.cmake
+#   ninja -C build/device-pgo clang lld clang-offload-bundler \
+#         clang-linker-wrapper llvm-link llvm-offload-binary offload-arch \
+#         llvm-profdata llvm-cov FileCheck not runtimes
+#
+# Outputs (under build/device-pgo):
+#   bin/{clang,clang++,lld,llvm-profdata,llvm-cov,FileCheck,not}
+#   lib/clang/<ver>/lib/<host-triple>/libclang_rt.profile_rocm.a
+#   lib/clang/<ver>/lib/amdgcn-amd-amdhsa/libclang_rt.profile.a
+
+set(CMAKE_BUILD_TYPE Release CACHE STRING "")
+
+set(LLVM_ENABLE_PROJECTS "clang;lld" CACHE STRING "")
+set(LLVM_ENABLE_RUNTIMES "compiler-rt" CACHE STRING "")
+set(LLVM_TARGETS_TO_BUILD "host;AMDGPU" CACHE STRING "")
+set(LLVM_ENABLE_PER_TARGET_RUNTIME_DIR ON CACHE BOOL "")
+set(LLVM_ENABLE_ASSERTIONS ON CACHE BOOL "")
+
+set(CLANG_DEFAULT_LINKER "lld" CACHE STRING "")
+set(CLANG_DEFAULT_RTLIB "compiler-rt" CACHE STRING "")
+
+# Make FileCheck / not available in the install/bin tree for the lit-lite runner.
+set(LLVM_INSTALL_UTILS ON CACHE BOOL "")
+
+# Build host (default) and device (amdgcn) runtimes in one tree.
+set(LLVM_RUNTIME_TARGETS "default;amdgcn-amd-amdhsa" CACHE STRING "")
+
+# Host runtimes: turn on the opt-in ROCm host drain library. It pulls in the
+# sanitizer interception object libs, so sanitizers must be built too.
+set(RUNTIMES_default_COMPILER_RT_BUILD_PROFILE_ROCM ON CACHE BOOL "")
+set(RUNTIMES_default_COMPILER_RT_BUILD_SANITIZERS ON CACHE BOOL "")
+
+# Device runtime: the amdgcn baremetal profile subset, built with LLVM libc for
+# amdgcn (freestanding C headers).
+set(RUNTIMES_amdgcn-amd-amdhsa_CACHE_FILES
+  "${CMAKE_SOURCE_DIR}/../compiler-rt/cmake/caches/AMDGPU.cmake" CACHE STRING "")
+set(RUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES "compiler-rt;libc" CACHE STRING "")
diff --git a/compiler-rt/test/profile/run_gpu_tests.py b/compiler-rt/test/profile/run_gpu_tests.py
new file mode 100644
index 0000000000000..27563219ba0ad
--- /dev/null
+++ b/compiler-rt/test/profile/run_gpu_tests.py
@@ -0,0 +1,408 @@
+#!/usr/bin/env python3
+"""Minimal lit-style runner for the HIP device-PGO tests.
+
+The compiler-rt profile lit suite (and llvm-lit / FileCheck) is not part of the
+installed ROCm artifact, but the toolchain, the amdgcn device profile runtime,
+and the HIP runtime are. This runner executes the
+``compiler-rt/test/profile/{GPU,AMDGPU}/*.hip`` tests directly against an
+installed toolchain on a real GPU runner, interpreting just the slice of lit
+markup those tests use:
+
+  - ``// REQUIRES:`` / ``// UNSUPPORTED:`` boolean feature gating,
+  - ``// RUN:`` lines (with ``\\`` continuations) and the fixed substitution set
+    (%clang, %s, %t[.*], %amdgpu_arch, %hip_lib_path, %run, %%),
+  - delegation to ``FileCheck`` / ``not`` (real binaries if present on PATH,
+    otherwise shims backed by the ``filecheck`` PyPI package and a tiny
+    exit-code inverter).
+
+Each RUN line is executed via ``bash -e -o pipefail -c`` so pipes, redirection
+and globbing behave as under lit. A test passes iff all its RUN lines exit 0.
+"""
+
+import argparse
+import os
+import re
+import shutil
+import stat
+import subprocess
+import sys
+import tempfile
+from pathlib import Path
+
+# --- feature detection ------------------------------------------------------
+
+
+def _count_visible_gpus(toolchain_bin):
+    """Number of GPUs actually visible to the runtime, or 0 if unknown.
+
+    Uses the toolchain's ``amdgpu-arch`` (one line per visible device). Unlike
+    the KFD topology under ``/sys/class/kfd`` this reflects what HIP/ROCr really
+    exposes -- it honours ``ROCR_VISIBLE_DEVICES`` / ``HIP_VISIBLE_DEVICES`` and
+    container device limits, so it matches what a test's ``hipGetDeviceCount``
+    will see. It is also portable: Windows has no ``/dev/kfd``, but does ship
+    ``amdgpu-arch``.
+    """
+    if not toolchain_bin:
+        return 0
+    tb = Path(toolchain_bin)
+    exe = next(
+        (str(tb / c) for c in ("amdgpu-arch", "amdgpu-arch.exe") if (tb / c).exists()),
+        None,
+    )
+    if exe is None:
+        return 0
+    try:
+        proc = subprocess.run(exe, capture_output=True, text=True, timeout=60)
+    except (OSError, subprocess.SubprocessError):
+        return 0
+    if proc.returncode != 0:
+        return 0
+    return sum(1 for line in proc.stdout.splitlines() if line.strip())
+
+
+def detect_features(toolchain_bin=None, force=None):
+    """Return the set of lit features available on this runner.
+
+    hip/amdgpu are assumed present (this runner only ever drives GPU tests on a
+    runner that has the toolchain + HIP). ``multi-device`` is derived from the
+    number of GPUs the runtime actually exposes (>= 2), via ``amdgpu-arch``.
+    """
+    features = {"hip", "amdgpu"}
+    if sys.platform.startswith("linux"):
+        features.add("linux")
+    elif sys.platform.startswith("win"):
+        features.add("windows")
+
+    if _count_visible_gpus(toolchain_bin) >= 2:
+        features.add("multi-device")
+
+    if force:
+        for f in force:
+            features.add(f)
+    return features
+
+
+# --- boolean expression evaluation (REQUIRES / UNSUPPORTED) ------------------
+
+_TOKEN_RE = re.compile(r"\s*(\(|\)|\|\||&&|!|[\w.+-]+)\s*")
+
+
+def _clause_to_py(clause):
+    out = []
+    for tok in _TOKEN_RE.findall(clause):
+        if tok == "||":
+            out.append(" or ")
+        elif tok == "&&":
+            out.append(" and ")
+        elif tok == "!":
+            out.append(" not ")
+        elif tok in ("(", ")"):
+            out.append(tok)
+        elif tok == "true":
+            out.append("True")
+        elif tok == "false":
+            out.append("False")
+        else:
+            out.append("(%r in FEATURES)" % tok)
+    return "".join(out) or "True"
+
+
+def eval_requires(expr, features):
+    """All comma-separated clauses must be true."""
+    return all(
+        eval(_clause_to_py(c), {"__builtins__": {}}, {"FEATURES": features})
+        for c in expr.split(",")
+        if c.strip()
+    )
+
+
+def eval_unsupported(expr, features):
+    """Unsupported if any comma-separated clause is true."""
+    return any(
+        eval(_clause_to_py(c), {"__builtins__": {}}, {"FEATURES": features})
+        for c in expr.split(",")
+        if c.strip()
+    )
+
+
+# --- test parsing -----------------------------------------------------------
+
+_DIRECTIVE_RE = re.compile(r"(?://|#)\s*(RUN|REQUIRES|UNSUPPORTED):\s?(.*)")
+
+
+def parse_test(path):
+    """Return (run_lines, requires, unsupported) for a test file."""
+    runs, requires, unsupported = [], [], []
+    cont = None
+    for raw in Path(path).read_text(errors="replace").splitlines():
+        m = _DIRECTIVE_RE.search(raw)
+        if cont is not None:
+            # Continuation of a previous RUN line.
+            text = raw
+            cm = re.search(r"(?://|#)\s*RUN:\s?(.*)", raw)
+            if cm:
+                text = cm.group(1)
+            cont += " " + text.strip()
+            if cont.rstrip().endswith("\\"):
+                cont = cont.rstrip()[:-1]
+            else:
+                runs.append(cont)
+                cont = None
+            continue
+        if not m:
+            continue
+        kind, body = m.group(1), m.group(2)
+        if kind == "REQUIRES":
+            requires.append(body.strip())
+        elif kind == "UNSUPPORTED":
+            unsupported.append(body.strip())
+        elif kind == "RUN":
+            if body.rstrip().endswith("\\"):
+                cont = body.rstrip()[:-1]
+            else:
+                runs.append(body)
+    return runs, requires, unsupported
+
+
+# --- substitutions ----------------------------------------------------------
+
+
+def make_substitutions(clang, clangxx, src, tprefix, arch, hip_lib_path):
+    # Order matters: longer / more specific tokens first; %% resolved last.
+    return [
+        ("%clangxx", clangxx),
+        ("%clang", clang),
+        ("%amdgpu_arch", arch),
+        ("%hip_lib_path", hip_lib_path),
+        ("%run ", ""),
+        ("%s", str(src)),
+        ("%t", tprefix),
+        ("%%", "%"),
+    ]
+
+
+def apply_substitutions(line, subs):
+    for token, value in subs:
+        line = line.replace(token, value)
+    return line
+
+
+# --- tool shims (FileCheck / not) -------------------------------------------
+
+
+def ensure_tools(toolchain_bin, workdir):
+    """Build a PATH that resolves clang/llvm-*, FileCheck and not.
+
+    Prefers real binaries under toolchain_bin; falls back to shims for FileCheck
+    (PyPI ``filecheck``) and ``not`` (exit-code inverter).
+    """
+    shim_dir = workdir / "shims"
+    shim_dir.mkdir(parents=True, exist_ok=True)
+    path = os.pathsep.join(
+        [str(toolchain_bin), str(shim_dir), os.environ.get("PATH", "")]
+    )
+
+    def have(tool):
+        # File-based check (shutil.which is quirky across OSes / Git Bash). The
+        # shims are extensionless bash scripts, which Git Bash resolves via the
+        # shebang, so a real binary is anything matching tool or tool.exe.
+        tb = Path(toolchain_bin)
+        return (tb / tool).exists() or (tb / (tool + ".exe")).exists()
+
+    def write_shim(name, body):
+        p = shim_dir / name
+        p.write_text(body)
+        p.chmod(p.stat().st_mode | stat.S_IXUSR | stat.S_IXGRP | stat.S_IXOTH)
+
+    if not have("FileCheck"):
+        write_shim(
+            "FileCheck",
+            "#!/usr/bin/env bash\n"
+            'if command -v filecheck >/dev/null 2>&1; then exec filecheck "$@"; fi\n'
+            'exec python3 -m filecheck "$@"\n',
+        )
+    if not have("not"):
+        write_shim(
+            "not",
+            "#!/usr/bin/env bash\n"
+            'if [ "$1" = "--crash" ]; then shift; "$@"; ec=$?; '
+            "[ $ec -ge 128 ] && exit 0 || exit 1; fi\n"
+            '"$@"; ec=$?; [ $ec -eq 0 ] && exit 1 || exit 0\n',
+        )
+    return path
+
+
+# --- execution --------------------------------------------------------------
+
+
+def run_one(path, args, features, base_env):
+    runs, requires, unsupported = parse_test(path)
+
+    for expr in requires:
+        if not eval_requires(expr, features):
+            return "UNSUPPORTED", "missing requirement: %s" % expr
+    for expr in unsupported:
+        if eval_unsupported(expr, features):
+            return "UNSUPPORTED", "unsupported: %s" % expr
+    if not runs:
+        return "UNSUPPORTED", "no RUN lines"
+
+    workdir = Path(tempfile.mkdtemp(prefix="profgpu-"))
+    tprefix = str(workdir / "t")
+    subs = make_substitutions(
+        args.clang,
+        args.clangxx,
+        Path(path).resolve(),
+        tprefix,
+        args.amdgpu_arch,
+        args.hip_lib_path,
+    )
+
+    if args.dry_run:
+        print("# %s" % path)
+        for line in runs:
+            print("    " + apply_substitutions(line, subs).strip())
+        return "DRYRUN", ""
+
+    env = dict(base_env)
+    env["PATH"] = ensure_tools(Path(args.toolchain_bin), workdir)
+    timeout = args.timeout if args.timeout and args.timeout > 0 else None
+    for line in runs:
+        cmd = apply_substitutions(line, subs).strip()
+        try:
+            proc = subprocess.run(
+                ["bash", "-e", "-o", "pipefail", "-c", cmd],
+                cwd=str(workdir),
+                env=env,
+                capture_output=True,
+                text=True,
+                timeout=timeout,
+            )
+        except subprocess.TimeoutExpired as e:
+            out = e.stdout or ""
+            err = e.stderr or ""
+            if isinstance(out, bytes):
+                out = out.decode("utf-8", "replace")
+            if isinstance(err, bytes):
+                err = err.decode("utf-8", "replace")
+            detail = "RUN timed out after %gs: %s\n%s%s" % (
+                timeout,
+                cmd,
+                out,
+                err,
+            )
+            if not args.keep:
+                shutil.rmtree(workdir, ignore_errors=True)
+            return "FAIL", detail
+        if proc.returncode != 0:
+            detail = "RUN failed (rc=%d): %s\n%s%s" % (
+                proc.returncode,
+                cmd,
+                proc.stdout,
+                proc.stderr,
+            )
+            if not args.keep:
+                shutil.rmtree(workdir, ignore_errors=True)
+            return "FAIL", detail
+    if not args.keep:
+        shutil.rmtree(workdir, ignore_errors=True)
+    return "PASS", ""
+
+
+def discover(paths):
+    tests = []
+    for p in paths:
+        p = Path(p)
+        if p.is_dir():
+            tests.extend(sorted(str(x) for x in p.rglob("*.hip")))
+        elif p.is_file():
+            tests.append(str(p))
+    return tests
+
+
+def main():
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("tests", nargs="+", help="Test files or directories")
+    ap.add_argument(
+        "--toolchain-bin", required=False, help="Directory with clang and llvm-* tools"
+    )
+    ap.add_argument("--hip-lib-path", default="", help="Directory with libamdhip64")
+    ap.add_argument("--amdgpu-arch", default="native")
+    ap.add_argument("--clang", help="Override clang path")
+    ap.add_argument("--clangxx", help="Override clang++ path")
+    ap.add_argument(
+        "--feature",
+        action="append",
+        default=[],
+        help="Force-enable an extra lit feature",
+    )
+    ap.add_argument(
+        "--dry-run",
+        action="store_true",
+        help="Print resolved RUN lines without executing",
+    )
+    ap.add_argument("--keep", action="store_true", help="Keep per-test temp dirs")
+    ap.add_argument(
+        "--timeout",
+        type=float,
+        default=600,
+        help="Per-RUN-line timeout in seconds (<=0 disables); "
+        "guards against a hung GPU/compiler wedging the run",
+    )
+    args = ap.parse_args()
+
+    if not args.dry_run and not args.toolchain_bin:
+        ap.error("--toolchain-bin is required unless --dry-run is given")
+
+    if args.toolchain_bin:
+        binp = Path(args.toolchain_bin)
+        args.clang = args.clang or str(binp / "clang")
+        args.clangxx = args.clangxx or str(binp / "clang++")
+    else:
+        args.clang = args.clang or "clang"
+        args.clangxx = args.clangxx or "clang++"
+
+    features = detect_features(args.toolchain_bin, args.feature)
+    print("# features: %s" % ", ".join(sorted(features)))
+
+    base_env = dict(os.environ)
+    if args.toolchain_bin:
+        lib_dirs = [
+            str(Path(args.toolchain_bin).parent / "lib"),  # toolchain libs
+        ]
+        if args.hip_lib_path:
+            lib_dirs.append(args.hip_lib_path)
+        existing = base_env.get("LD_LIBRARY_PATH", "")
+        base_env["LD_LIBRARY_PATH"] = os.pathsep.join(
+            [d for d in lib_dirs if d] + ([existing] if existing else [])
+        )
+
+    tests = discover(args.tests)
+    if not tests:
+        print("error: no tests found", file=sys.stderr)
+        return 2
+
+    results = {"PASS": [], "FAIL": [], "UNSUPPORTED": [], "DRYRUN": []}
+    for t in tests:
+        status, detail = run_one(t, args, features, base_env)
+        results[status].append(t)
+        if status == "FAIL":
+            print("FAIL: %s" % t)
+            print(detail)
+        elif status in ("PASS", "UNSUPPORTED"):
+            print("%s: %s" % (status, t))
+
+    print(
+        "\n# summary: %d passed, %d failed, %d unsupported (of %d)"
+        % (
+            len(results["PASS"]),
+            len(results["FAIL"]),
+            len(results["UNSUPPORTED"]),
+            len(tests),
+        )
+    )
+    return 1 if results["FAIL"] else 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())

>From c72c3119c25f70a9d0b62c5a7a40bc9b9382b80f Mon Sep 17 00:00:00 2001
From: Larry Meadows <Lawrence.Meadows at amd.com>
Date: Mon, 15 Jun 2026 16:00:53 -0500
Subject: [PATCH 2/7] [PGO][HIP] Grow HSA drain dedup table dynamically

The (data, counters, names) dedup table used to compose the host-shadow and
HSA-introspection device drains was a fixed 256-entry static array, and entries
past the cap were silently dropped. In non-RDC mode the entry count scales like
num_code_objects * num_agents, so the cap could be exceeded, dropping a tuple
from the dedup set and risking a section being drained twice (double-counted).

Replace it with a realloc-backed array that doubles on demand (initial cap 64),
matching the existing growth idiom in this file (growPtrArray and the shadow
arrays). On allocation failure the existing table is kept and recording is
skipped; the worst case is one duplicate profraw record, never a crash.

Addresses review feedback on PR #203056.

Co-authored-by: Cursor <cursoragent at cursor.com>
---
 .../profile/InstrProfilingPlatformROCm.cpp    | 29 ++++++++++++++-----
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
index b1db1d8a74041..f39fee37fa328 100644
--- a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
+++ b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
@@ -1378,9 +1378,14 @@ struct ProfBoundsTuple {
 };
 } // namespace
 
-#define PROF_MAX_SEEN_BOUNDS 256
-static ProfBoundsTuple SeenBounds[PROF_MAX_SEEN_BOUNDS];
+/* Grown on demand (doubling) rather than fixed-cap: in non-RDC mode the entry
+ * count scales like num_code_objects * num_agents, so any fixed cap could be
+ * exceeded and silently lose dedup coverage (double-counting drained sections).
+ * Starts at PROF_SEEN_BOUNDS_INIT_CAP. */
+#define PROF_SEEN_BOUNDS_INIT_CAP 64
+static ProfBoundsTuple *SeenBounds = nullptr;
 static int NumSeenBounds = 0;
+static int CapSeenBounds = 0;
 
 /* Pure check: has this bounds tuple already been drained? Does not mutate
  * state, so a transient failure does not permanently suppress retries. */
@@ -1399,12 +1404,22 @@ static void profRecordDrainedBounds(const void *D, const void *C,
                                     const void *N) {
   if (profBoundsAlreadyDrained(D, C, N))
     return;
-  if (NumSeenBounds < PROF_MAX_SEEN_BOUNDS) {
-    SeenBounds[NumSeenBounds].data = D;
-    SeenBounds[NumSeenBounds].cnts = C;
-    SeenBounds[NumSeenBounds].names = N;
-    NumSeenBounds++;
+  if (NumSeenBounds == CapSeenBounds) {
+    int NewCap = CapSeenBounds ? CapSeenBounds * 2 : PROF_SEEN_BOUNDS_INIT_CAP;
+    ProfBoundsTuple *New =
+        (ProfBoundsTuple *)realloc(SeenBounds, NewCap * sizeof(*New));
+    /* Best-effort: on OOM keep the existing table and skip recording. The
+     * worst case is that this one section is drained again later (a duplicate
+     * profraw record), never a crash. */
+    if (!New)
+      return;
+    SeenBounds = New;
+    CapSeenBounds = NewCap;
   }
+  SeenBounds[NumSeenBounds].data = D;
+  SeenBounds[NumSeenBounds].cnts = C;
+  SeenBounds[NumSeenBounds].names = N;
+  NumSeenBounds++;
 }
 
 #define PROF_MAX_GPU_AGENTS 64

>From 4a5bf40a3830698b8fe9d98a449fc2409471dfcb Mon Sep 17 00:00:00 2001
From: Larry Meadows <Lawrence.Meadows at amd.com>
Date: Tue, 16 Jun 2026 04:33:52 -0500
Subject: [PATCH 3/7] [PGO][HIP] Drop device profile RT link from the legacy
 lld path

Remove the device-profile-runtime forwarding added to
AMDGCN::Linker::constructLldCommand (and its VirtualFileSystem.h include).

HIP defaults to the new offload driver (UseNewOffloadingDriver is true whenever
an offload kind is active), so the device link goes through
clang-linker-wrapper / LinkerWrapper::ConstructJob, which forwards -fprofile*
and links the device profile runtime via addProfileRTLibs. constructLldCommand
is only reached under --no-offload-new-driver, the legacy path that is being
deprecated. Per review on PR #203056, drop this so the PR carries no driver
changes; device PGO relies on the default (new-driver) link path.

Co-authored-by: Cursor <cursoragent at cursor.com>
---
 clang/lib/Driver/ToolChains/HIPAMD.cpp | 20 --------------------
 1 file changed, 20 deletions(-)

diff --git a/clang/lib/Driver/ToolChains/HIPAMD.cpp b/clang/lib/Driver/ToolChains/HIPAMD.cpp
index 45e71ac802a89..84664bcddbb94 100644
--- a/clang/lib/Driver/ToolChains/HIPAMD.cpp
+++ b/clang/lib/Driver/ToolChains/HIPAMD.cpp
@@ -19,7 +19,6 @@
 #include "clang/Options/Options.h"
 #include "llvm/Support/FileSystem.h"
 #include "llvm/Support/Path.h"
-#include "llvm/Support/VirtualFileSystem.h"
 #include "llvm/TargetParser/TargetParser.h"
 
 using namespace clang::driver;
@@ -143,25 +142,6 @@ void AMDGCN::Linker::constructLldCommand(Compilation &C, const JobAction &JA,
 
   LldArgs.push_back("--no-whole-archive");
 
-  // With PGO/coverage instrumentation, instrumented device code references the
-  // device profile runtime (__llvm_profile_instrument_gpu and the
-  // __llvm_profile_sections bounds table emitted by InstrProfilingPlatformGPU).
-  // The new-offload-driver path injects this in LinkerWrapper::ConstructJob,
-  // but HIP using the traditional offload path (e.g. on Windows, which does not
-  // route device linking through clang-linker-wrapper) reaches the device link
-  // here instead. Forward the static device profile runtime to this lld device
-  // link so the runtime is pulled in regardless of offload-driver/host OS. The
-  // archive is arch-suffixed, so pass its full path rather than a -l name.
-  if (ToolChain::needsProfileRT(Args)) {
-    std::string ProfileRT =
-        TC.getCompilerRT(Args, "profile", ToolChain::FT_Static);
-    // Use the ToolChain VFS (matches the new-offload-driver path in
-    // Clang.cpp) so overlay/virtual filesystems used by the driver are
-    // honored; llvm::sys::fs bypasses them and can wrongly skip the runtime.
-    if (TC.getVFS().exists(ProfileRT))
-      LldArgs.push_back(Args.MakeArgString(ProfileRT));
-  }
-
   const char *Lld = Args.MakeArgStringRef(getToolChain().GetProgramPath("lld"));
   C.addCommand(std::make_unique<Command>(JA, *this, ResponseFileSupport::None(),
                                          Lld, LldArgs, Inputs, Output));

>From f1ba931df8cf5d842be63ab685cbe6df4784c222 Mon Sep 17 00:00:00 2001
From: Larry Meadows <Lawrence.Meadows at amd.com>
Date: Wed, 17 Jun 2026 02:37:45 -0500
Subject: [PATCH 4/7] [PGO][HIP] Split HSA device drain into its own file

Move the Linux-only supplemental HSA-introspection drain out of
InstrProfilingPlatformROCm.cpp into InstrProfilingPlatformROCmHSA.cpp,
with a private InstrProfilingPlatformROCmInternal.h declaring the shared
__prof_rocm interface (HIP helpers, processDeviceOffloadPrf, UniqueFree,
and the drained-bounds dedup hooks).

The new file is guarded by __linux__ && !_WIN32 and compiles to an empty
TU elsewhere; there is no Windows HSA path (Windows uses only the
host-shadow HIP drain). No behavior change; wired into CMake and the
Bazel overlay.

Co-authored-by: Cursor <cursoragent at cursor.com>
---
 compiler-rt/lib/profile/CMakeLists.txt        |   4 +-
 .../profile/InstrProfilingPlatformROCm.cpp    | 617 +-----------------
 .../profile/InstrProfilingPlatformROCmHSA.cpp | 582 +++++++++++++++++
 .../InstrProfilingPlatformROCmInternal.h      |  74 +++
 .../compiler-rt/BUILD.bazel                   |   1 +
 5 files changed, 687 insertions(+), 591 deletions(-)
 create mode 100644 compiler-rt/lib/profile/InstrProfilingPlatformROCmHSA.cpp
 create mode 100644 compiler-rt/lib/profile/InstrProfilingPlatformROCmInternal.h

diff --git a/compiler-rt/lib/profile/CMakeLists.txt b/compiler-rt/lib/profile/CMakeLists.txt
index b41843ad555b2..cdb17ce7d189c 100644
--- a/compiler-rt/lib/profile/CMakeLists.txt
+++ b/compiler-rt/lib/profile/CMakeLists.txt
@@ -233,7 +233,9 @@ if(COMPILER_RT_BUILD_PROFILE_ROCM AND NOT COMPILER_RT_PROFILE_BAREMETAL
    AND TARGET RTSanitizerCommon.${COMPILER_RT_DEFAULT_TARGET_ARCH}
    AND TARGET RTSanitizerCommonLibc.${COMPILER_RT_DEFAULT_TARGET_ARCH})
 
-  set(PROFILE_ROCM_SOURCES ${PROFILE_SOURCES} InstrProfilingPlatformROCm.cpp)
+  set(PROFILE_ROCM_SOURCES ${PROFILE_SOURCES}
+    InstrProfilingPlatformROCm.cpp
+    InstrProfilingPlatformROCmHSA.cpp)
 
   # Enables the device-collection call in InstrProfilingFile.c.
   set(PROFILE_ROCM_FLAGS ${EXTRA_FLAGS} -DCOMPILER_RT_BUILD_PROFILE_ROCM=1)
diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
index f39fee37fa328..faedf66ba89a3 100644
--- a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
+++ b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
@@ -33,6 +33,13 @@ extern "C" {
 #include <pthread.h>
 #endif
 
+#include "InstrProfilingPlatformROCmInternal.h"
+
+// Shared helpers below are defined as members of __prof_rocm (so the Linux HSA
+// drain in InstrProfilingPlatformROCmHSA.cpp can call them); this directive
+// lets the rest of this file keep calling them unqualified.
+using namespace __prof_rocm;
+
 /* Serialize one-time HIP loader resolution and DynamicModules mutations.
  * Inline to avoid a sanitizer_common dependency. */
 #ifdef _WIN32
@@ -62,20 +69,11 @@ static void unlockDynamicModules(void) {
 }
 #endif
 
-struct OffloadSectionShadowGroup;
-static int processDeviceOffloadPrf(void *DeviceOffloadPrf, const char *Target,
-                                   const OffloadSectionShadowGroup *Sections);
-
-#if defined(__linux__) && !defined(_WIN32)
-// Record a drained section-bounds tuple so the supplemental HSA-introspection
-// pass (Linux only) skips any code object the host-shadow path already
-// drained. Defined alongside the HSA drain below; forward-declared here so
-// processDeviceOffloadPrf can register every successful host-shadow drain.
-static void profRecordDrainedBounds(const void *Data, const void *Counters,
-                                    const void *Names);
-#endif
+// processDeviceOffloadPrf(), the dedup-recording profRecordDrainedBounds(), and
+// the OffloadSectionShadowGroup forward declaration all come from
+// InstrProfilingPlatformROCmInternal.h.
 
-static int isVerboseMode() {
+int __prof_rocm::isVerboseMode() {
   static int IsVerbose = -1;
   if (IsVerbose == -1)
     IsVerbose = getenv("LLVM_PROFILE_VERBOSE") != nullptr;
@@ -274,7 +272,7 @@ static BOOL CALLBACK ensureHipLoadedCb(PINIT_ONCE, PVOID, PVOID *) {
 }
 #endif
 
-static void ensureHipLoaded(void) {
+void __prof_rocm::ensureHipLoaded(void) {
 #ifdef _WIN32
   InitOnceExecuteOnce(&HipLoadedOnce, ensureHipLoadedCb, NULL, NULL);
 #else
@@ -282,6 +280,10 @@ static void ensureHipLoaded(void) {
 #endif
 }
 
+// Accessor for the HSA drain: true once the loaded HIP runtime exposes
+// hipMemcpy. Kept here so pHipMemcpy stays file-private to this TU.
+int __prof_rocm::hipMemcpyAvailable() { return pHipMemcpy != nullptr; }
+
 /* -------------------------------------------------------------------------- */
 /*  Public wrappers that forward to the loaded HIP symbols                   */
 /* -------------------------------------------------------------------------- */
@@ -304,7 +306,7 @@ static int hipMemcpy(void *dest, const void *src, size_t len,
 
 /* Device section symbols must be registered with CLR first; otherwise
  * hipMemcpy may take a CPU path and crash. */
-static int memcpyDeviceToHost(void *Dst, const void *Src, size_t Size) {
+int __prof_rocm::memcpyDeviceToHost(void *Dst, const void *Src, size_t Size) {
   return hipMemcpy(Dst, Src, Size, 2 /* DToH */);
 }
 
@@ -740,28 +742,8 @@ __llvm_profile_offload_register_section_shadow_variable(void *ptr) {
   ++Group->NumSections;
 }
 
-namespace {
-
-// free()-based scope guard. Use .release() to transfer ownership.
-struct UniqueFree {
-  void *Ptr;
-  explicit UniqueFree(void *P = nullptr) : Ptr(P) {}
-  ~UniqueFree() { free(Ptr); }
-  UniqueFree(const UniqueFree &) = delete;
-  UniqueFree &operator=(const UniqueFree &) = delete;
-  char *get() const { return static_cast<char *>(Ptr); }
-  void reset(void *P) {
-    free(Ptr);
-    Ptr = P;
-  }
-  void *release() {
-    void *P = Ptr;
-    Ptr = nullptr;
-    return P;
-  }
-};
-
-} // namespace
+// UniqueFree (free()-based scope guard) lives in
+// InstrProfilingPlatformROCmInternal.h so the HSA drain can share it.
 
 static int getRegisteredSectionBounds(void *Shadow, void **DevicePtr,
                                       size_t *Size) {
@@ -796,8 +778,9 @@ hasCompleteSectionShadows(const OffloadSectionShadowGroup *Sections) {
   return 1;
 }
 
-static int processDeviceOffloadPrf(void *DeviceOffloadPrf, const char *Target,
-                                   const OffloadSectionShadowGroup *Sections) {
+int __prof_rocm::processDeviceOffloadPrf(
+    void *DeviceOffloadPrf, const char *Target,
+    const OffloadSectionShadowGroup *Sections) {
   __llvm_profile_gpu_sections HostSections;
 
   if (hipMemcpy(&HostSections, DeviceOffloadPrf, sizeof(HostSections),
@@ -1163,557 +1146,11 @@ static int isHipAvailable(void) {
   return pHipMemcpy != nullptr && pHipGetSymbolAddress != nullptr;
 }
 
-/* ========================================================================== */
-/*  Supplemental HSA-introspection drain (Linux only)                         */
-/*                                                                            */
-/*  The host-shadow drain above only sees device code objects registered      */
-/*  host-side (__hipRegisterVar shadows) or loaded through an intercepted */
-/*  hipModuleLoad* call. Device code linked by the offload device linker with */
-/*  no host-side shadow -- e.g. RCCL, whose many device functions are glued */
-/*  into a single kernel with no source module -- is invisible to it. This */
-/*  pass walks every GPU agent's loaded executables via HSA, finds each */
-/*  __llvm_profile_sections table directly on the device, and drains the ones */
-/*  the host-shadow pass did not already handle (deduped by the device */
-/*  section-bounds tuple). It reuses processDeviceOffloadPrf() for the */
-/*  copy/relocate/write so the on-disk profraw layout is identical.           */
-/* ========================================================================== */
-#if defined(__linux__) && !defined(_WIN32)
-
-/* Minimal HSA type/enum stubs. compiler-rt cannot depend on ROCm headers at
- * build time, so mirror just the handful of HSA declarations the drain needs.
- * Values match hsa/hsa.h and hsa/hsa_ven_amd_loader.h. */
-typedef uint32_t prof_hsa_status_t;
-#define PROF_HSA_STATUS_SUCCESS ((prof_hsa_status_t)0x0)
-#define PROF_HSA_STATUS_INFO_BREAK ((prof_hsa_status_t)0x1)
-
-typedef struct {
-  uint64_t handle;
-} prof_hsa_agent_t;
-typedef struct {
-  uint64_t handle;
-} prof_hsa_executable_t;
-typedef struct {
-  uint64_t handle;
-} prof_hsa_executable_symbol_t;
-
-typedef uint32_t prof_hsa_agent_info_t;
-#define PROF_HSA_AGENT_INFO_NAME ((prof_hsa_agent_info_t)0)
-#define PROF_HSA_AGENT_INFO_DEVICE ((prof_hsa_agent_info_t)17)
-
-typedef uint32_t prof_hsa_device_type_t;
-#define PROF_HSA_DEVICE_TYPE_GPU ((prof_hsa_device_type_t)1)
-
-typedef uint32_t prof_hsa_symbol_kind_t;
-#define PROF_HSA_SYMBOL_KIND_VARIABLE ((prof_hsa_symbol_kind_t)0)
-
-typedef uint32_t prof_hsa_executable_symbol_info_t;
-#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_TYPE                                   \
-  ((prof_hsa_executable_symbol_info_t)0)
-#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME_LENGTH                            \
-  ((prof_hsa_executable_symbol_info_t)1)
-#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME                                   \
-  ((prof_hsa_executable_symbol_info_t)2)
-#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS                       \
-  ((prof_hsa_executable_symbol_info_t)21)
-
-#define PROF_HSA_EXTENSION_AMD_LOADER ((uint16_t)0x201)
-
-typedef uint32_t prof_hsa_loader_storage_type_t;
-
-typedef struct {
-  prof_hsa_agent_t agent;
-  prof_hsa_executable_t executable;
-  prof_hsa_loader_storage_type_t code_object_storage_type;
-  const void *code_object_storage_base;
-  size_t code_object_storage_size;
-  size_t code_object_storage_offset;
-  const void *segment_base;
-  size_t segment_size;
-} prof_hsa_loader_segment_descriptor_t;
-
-typedef prof_hsa_status_t (*hsa_init_ty)(void);
-typedef prof_hsa_status_t (*hsa_iterate_agents_ty)(
-    prof_hsa_status_t (*)(prof_hsa_agent_t, void *), void *);
-typedef prof_hsa_status_t (*hsa_agent_get_info_ty)(prof_hsa_agent_t,
-                                                   prof_hsa_agent_info_t,
-                                                   void *);
-typedef prof_hsa_status_t (*hsa_executable_iterate_agent_symbols_ty)(
-    prof_hsa_executable_t, prof_hsa_agent_t,
-    prof_hsa_status_t (*)(prof_hsa_executable_t, prof_hsa_agent_t,
-                          prof_hsa_executable_symbol_t, void *),
-    void *);
-typedef prof_hsa_status_t (*hsa_executable_symbol_get_info_ty)(
-    prof_hsa_executable_symbol_t, prof_hsa_executable_symbol_info_t, void *);
-typedef prof_hsa_status_t (*hsa_system_get_major_extension_table_ty)(uint16_t,
-                                                                     uint16_t,
-                                                                     size_t,
-                                                                     void *);
-typedef prof_hsa_status_t (*hsa_loader_query_segment_descriptors_ty)(
-    prof_hsa_loader_segment_descriptor_t *, size_t *);
-
-/* First two members of hsa_ven_amd_loader_1_00_pfn_t. Only
- * query_segment_descriptors is used; query_host_address keeps the offset. */
-typedef struct {
-  void *query_host_address;
-  hsa_loader_query_segment_descriptors_ty query_segment_descriptors;
-} prof_hsa_loader_pfn_t;
-
-static hsa_iterate_agents_ty pHsaIterateAgents = nullptr;
-static hsa_agent_get_info_ty pHsaAgentGetInfo = nullptr;
-static hsa_executable_iterate_agent_symbols_ty pHsaExecIterAgentSyms = nullptr;
-static hsa_executable_symbol_get_info_ty pHsaSymGetInfo = nullptr;
-static hsa_loader_query_segment_descriptors_ty pQuerySegDescs = nullptr;
-
-/* 0 = not yet attempted, 1 = ready, -1 = unavailable. Accessed with acquire/
- * release atomics: a thread observing HsaRuntimeState==1 (acquire) also sees
- * the fully-written p* function pointers (published before the release store
- * of HsaRuntimeState=1 below). */
-static int HsaRuntimeState = 0;
-
-static int setHsaRuntimeState(int S) {
-  __atomic_store_n(&HsaRuntimeState, S, __ATOMIC_RELEASE);
-  return S > 0 ? 0 : -1;
-}
-
-/* Resolve HSA entry points (and the AMD loader extension) once, and confirm
- * HIP's hipMemcpy is reachable for the device-to-host copies. HIP itself is
- * resolved by the shared ensureHipLoaded() above. */
-static int loadHsaRuntimePointers(void) {
-  int State = __atomic_load_n(&HsaRuntimeState, __ATOMIC_ACQUIRE);
-  if (State)
-    return State > 0 ? 0 : -1;
-
-  if (!__interception::DynamicLoaderAvailable()) {
-    if (isVerboseMode())
-      PROF_NOTE("%s", "Dynamic library loading not available - "
-                      "HSA device profiling disabled\n");
-    return setHsaRuntimeState(-1);
-  }
-
-  void *Hsa = __interception::OpenLibrary("libhsa-runtime64.so");
-  if (!Hsa)
-    Hsa = __interception::OpenLibrary("libhsa-runtime64.so.1");
-  if (!Hsa) {
-    if (isVerboseMode())
-      PROF_NOTE("%s", "libhsa-runtime64.so not loadable - "
-                      "HSA device profiling disabled\n");
-    return setHsaRuntimeState(-1);
-  }
-
-  hsa_init_ty pHsaInit =
-      (hsa_init_ty)__interception::LookupSymbol(Hsa, "hsa_init");
-  hsa_system_get_major_extension_table_ty pGetExtTable =
-      (hsa_system_get_major_extension_table_ty)__interception::LookupSymbol(
-          Hsa, "hsa_system_get_major_extension_table");
-  pHsaIterateAgents = (hsa_iterate_agents_ty)__interception::LookupSymbol(
-      Hsa, "hsa_iterate_agents");
-  pHsaAgentGetInfo = (hsa_agent_get_info_ty)__interception::LookupSymbol(
-      Hsa, "hsa_agent_get_info");
-  pHsaExecIterAgentSyms =
-      (hsa_executable_iterate_agent_symbols_ty)__interception::LookupSymbol(
-          Hsa, "hsa_executable_iterate_agent_symbols");
-  pHsaSymGetInfo =
-      (hsa_executable_symbol_get_info_ty)__interception::LookupSymbol(
-          Hsa, "hsa_executable_symbol_get_info");
-
-  if (!pHsaInit || !pGetExtTable || !pHsaIterateAgents || !pHsaAgentGetInfo ||
-      !pHsaExecIterAgentSyms || !pHsaSymGetInfo) {
-    PROF_WARN("%s",
-              "required HSA symbols missing - HSA device profiling disabled\n");
-    return setHsaRuntimeState(-1);
-  }
-
-  /* Bring HSA up (idempotent, refcounted). This runs lazily on the first drain
-   * rather than from the library constructor, so merely loading the
-   * instrumented library does not initialize HSA in the process -- which would
-   * break fork-based callers that deliberately keep HIP/HSA uninitialized in
-   * the parent (see the constructor note at the end of the HSA block). In the
-   * common case the drain runs from the profile write path while HSA is still
-   * alive; if it only runs after HSA's own atexit(hsa_shut_down) has executed,
-   * this simply re-initializes HSA (the process is exiting anyway). */
-  prof_hsa_status_t St = pHsaInit();
-  if (St != PROF_HSA_STATUS_SUCCESS && St != PROF_HSA_STATUS_INFO_BREAK) {
-    if (isVerboseMode())
-      PROF_NOTE("hsa_init failed (0x%x) - HSA device profiling disabled\n", St);
-    return setHsaRuntimeState(-1);
-  }
-
-  prof_hsa_loader_pfn_t LoaderApi;
-  __builtin_memset(&LoaderApi, 0, sizeof(LoaderApi));
-  St = pGetExtTable(PROF_HSA_EXTENSION_AMD_LOADER, 1, sizeof(LoaderApi),
-                    &LoaderApi);
-  if (St != PROF_HSA_STATUS_SUCCESS || !LoaderApi.query_segment_descriptors) {
-    PROF_WARN("AMD loader extension unavailable (0x%x) - "
-              "HSA device profiling disabled\n",
-              St);
-    return setHsaRuntimeState(-1);
-  }
-  pQuerySegDescs = LoaderApi.query_segment_descriptors;
-
-  /* The device-to-host copies go through the shared HIP loader. */
-  ensureHipLoaded();
-  if (!pHipMemcpy) {
-    PROF_WARN("%s", "hipMemcpy unavailable - HSA device profiling disabled\n");
-    return setHsaRuntimeState(-1);
-  }
-
-  if (isVerboseMode())
-    PROF_NOTE("%s", "HSA + HIP runtime resolved for device profiling\n");
-  return setHsaRuntimeState(1);
-}
-
-/* The canonical device bounds-table symbol from InstrProfilingPlatformGPU.c. */
-static const char ProfileSectionsSymbol[] = "__llvm_profile_sections";
-
-/* Dedup of drained section-bounds tuples, shared with the host-shadow path
- * (processDeviceOffloadPrf records here on every successful drain). A single
- * linked device code object exposes one __llvm_profile_sections, but the same
- * bounds may be seen via multiple agents, so each unique counter set is
- * drained exactly once across both paths. */
-namespace {
-struct ProfBoundsTuple {
-  const void *data;
-  const void *cnts;
-  const void *names;
-};
-} // namespace
-
-/* Grown on demand (doubling) rather than fixed-cap: in non-RDC mode the entry
- * count scales like num_code_objects * num_agents, so any fixed cap could be
- * exceeded and silently lose dedup coverage (double-counting drained sections).
- * Starts at PROF_SEEN_BOUNDS_INIT_CAP. */
-#define PROF_SEEN_BOUNDS_INIT_CAP 64
-static ProfBoundsTuple *SeenBounds = nullptr;
-static int NumSeenBounds = 0;
-static int CapSeenBounds = 0;
-
-/* Pure check: has this bounds tuple already been drained? Does not mutate
- * state, so a transient failure does not permanently suppress retries. */
-static int profBoundsAlreadyDrained(const void *D, const void *C,
-                                    const void *N) {
-  for (int i = 0; i < NumSeenBounds; ++i)
-    if (SeenBounds[i].data == D && SeenBounds[i].cnts == C &&
-        SeenBounds[i].names == N)
-      return 1;
-  return 0;
-}
-
-/* Record a drained bounds tuple. Idempotent. Called after a successful drain
- * (either path) so a failed attempt stays retryable. */
-static void profRecordDrainedBounds(const void *D, const void *C,
-                                    const void *N) {
-  if (profBoundsAlreadyDrained(D, C, N))
-    return;
-  if (NumSeenBounds == CapSeenBounds) {
-    int NewCap = CapSeenBounds ? CapSeenBounds * 2 : PROF_SEEN_BOUNDS_INIT_CAP;
-    ProfBoundsTuple *New =
-        (ProfBoundsTuple *)realloc(SeenBounds, NewCap * sizeof(*New));
-    /* Best-effort: on OOM keep the existing table and skip recording. The
-     * worst case is that this one section is drained again later (a duplicate
-     * profraw record), never a crash. */
-    if (!New)
-      return;
-    SeenBounds = New;
-    CapSeenBounds = NewCap;
-  }
-  SeenBounds[NumSeenBounds].data = D;
-  SeenBounds[NumSeenBounds].cnts = C;
-  SeenBounds[NumSeenBounds].names = N;
-  NumSeenBounds++;
-}
-
-#define PROF_MAX_GPU_AGENTS 64
-
-namespace {
-struct GpuAgent {
-  prof_hsa_agent_t agent;
-  char arch[64];
-};
-
-struct WalkState {
-  GpuAgent agents[PROF_MAX_GPU_AGENTS];
-  int num_agents;
-  int total_found;
-  int total_drained;
-};
-
-/* Per (agent, executable) symbol-iteration state. */
-struct SymbolState {
-  const char *arch;
-  int found;
-  int drained;
-};
-} // namespace
-
-/* HSA per-symbol callback: when it finds a __llvm_profile_sections variable,
- * drain it via processDeviceOffloadPrf() unless the host-shadow path (or an
- * earlier agent) already handled the same bounds. */
-static prof_hsa_status_t onSymbol(prof_hsa_executable_t, prof_hsa_agent_t,
-                                  prof_hsa_executable_symbol_t Sym,
-                                  void *Data) {
-  SymbolState *S = (SymbolState *)Data;
-
-  prof_hsa_symbol_kind_t Kind;
-  if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_TYPE, &Kind) !=
-          PROF_HSA_STATUS_SUCCESS ||
-      Kind != PROF_HSA_SYMBOL_KIND_VARIABLE)
-    return PROF_HSA_STATUS_SUCCESS;
-
-  uint32_t NameLen = 0;
-  if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME_LENGTH,
-                     &NameLen) != PROF_HSA_STATUS_SUCCESS ||
-      NameLen != sizeof(ProfileSectionsSymbol) - 1)
-    return PROF_HSA_STATUS_SUCCESS;
-
-  char NameBuf[64];
-  if (NameLen + 1 > sizeof(NameBuf))
-    return PROF_HSA_STATUS_SUCCESS;
-  if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME, NameBuf) !=
-      PROF_HSA_STATUS_SUCCESS)
-    return PROF_HSA_STATUS_SUCCESS;
-  NameBuf[NameLen] = '\0';
-
-  if (strcmp(NameBuf, ProfileSectionsSymbol) != 0)
-    return PROF_HSA_STATUS_SUCCESS;
-
-  uint64_t Addr = 0;
-  if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS,
-                     &Addr) != PROF_HSA_STATUS_SUCCESS ||
-      Addr == 0) {
-    if (isVerboseMode())
-      PROF_NOTE("%s", "failed to read __llvm_profile_sections address\n");
-    return PROF_HSA_STATUS_SUCCESS;
-  }
-
-  S->found++;
-
-  // Read the bounds table first to dedup (and detect empty sections) before
-  // the full copy/relocate done by processDeviceOffloadPrf.
-  __llvm_profile_gpu_sections Sec;
-  if (memcpyDeviceToHost(&Sec, (void *)(uintptr_t)Addr, sizeof(Sec)) != 0) {
-    PROF_WARN("%s", "failed to copy device bounds table\n");
-    return PROF_HSA_STATUS_SUCCESS;
-  }
-  if (profBoundsAlreadyDrained(Sec.DataStart, Sec.CountersStart,
-                               Sec.NamesStart)) {
-    if (isVerboseMode())
-      PROF_NOTE("%s", "device bounds already drained, skipping\n");
-    return PROF_HSA_STATUS_SUCCESS;
-  }
-
-  size_t DataBytes = (const char *)Sec.DataStop - (const char *)Sec.DataStart;
-  size_t CntsBytes =
-      (const char *)Sec.CountersStop - (const char *)Sec.CountersStart;
-  if (DataBytes == 0 || CntsBytes == 0) {
-    // Empty code object: nothing to write. Mark seen so we don't revisit it.
-    profRecordDrainedBounds(Sec.DataStart, Sec.CountersStart, Sec.NamesStart);
-    return PROF_HSA_STATUS_SUCCESS;
-  }
-
-  // Generate a collision-free target. Multiple distinct device code objects on
-  // the same arch (e.g. non-RDC multi-TU) must not clobber each other's file.
-  static int DrainIndex = 0;
-  char Target[96];
-  if (DrainIndex == 0)
-    snprintf(Target, sizeof(Target), "%s", S->arch);
-  else
-    snprintf(Target, sizeof(Target), "%s.%d", S->arch, DrainIndex);
-
-  // processDeviceOffloadPrf returns 0 on a successful write, -1 on error.
-  // Record the bounds (and advance the target index) only on success so a
-  // transient error stays retryable on a later agent or collect call.
-  int Rc = processDeviceOffloadPrf((void *)(uintptr_t)Addr, Target, nullptr);
-  if (Rc == 0) {
-    S->drained++;
-    DrainIndex++;
-    profRecordDrainedBounds(Sec.DataStart, Sec.CountersStart, Sec.NamesStart);
-  }
-
-  return PROF_HSA_STATUS_SUCCESS;
-}
-
-static prof_hsa_status_t collectAgent(prof_hsa_agent_t Agent, void *Data) {
-  prof_hsa_device_type_t DevType;
-  if (pHsaAgentGetInfo(Agent, PROF_HSA_AGENT_INFO_DEVICE, &DevType) !=
-          PROF_HSA_STATUS_SUCCESS ||
-      DevType != PROF_HSA_DEVICE_TYPE_GPU)
-    return PROF_HSA_STATUS_SUCCESS;
-
-  WalkState *W = (WalkState *)Data;
-  if (W->num_agents >= PROF_MAX_GPU_AGENTS)
-    return PROF_HSA_STATUS_SUCCESS;
-
-  GpuAgent &GA = W->agents[W->num_agents++];
-  GA.agent = Agent;
-  char Name[64];
-  __builtin_memset(Name, 0, sizeof(Name));
-  pHsaAgentGetInfo(Agent, PROF_HSA_AGENT_INFO_NAME, Name);
-  size_t N = strnlen(Name, sizeof(GA.arch) - 1);
-  __builtin_memcpy(GA.arch, Name, N);
-  GA.arch[N] = '\0';
-  if (!GA.arch[0])
-    strncpy(GA.arch, "amdgpu", sizeof(GA.arch) - 1);
-
-  if (isVerboseMode())
-    PROF_NOTE("GPU agent %d: %s\n", W->num_agents - 1, GA.arch);
-  return PROF_HSA_STATUS_SUCCESS;
-}
-
-/* Reentrancy guard and "drained data at least once" latch. The collect hook
- * may run more than once (an explicit early __llvm_profile_write_file plus the
- * exit write); a successful walk latches HsaDrainCompleted so we never re-emit
- * duplicate .profraw files, while transient no-op outcomes ("runtime not yet
- * loadable", "no GPU agents", "no loaded segments", "nothing instrumented")
- * stay retryable so a later call can still pick up code objects loaded later.
- * HsaDrainInProgress prevents a concurrent or reentrant call (e.g. a library
- * destructor) from corrupting the global SeenBounds table. Both flags use
- * acquire/release atomics. */
-static int HsaDrainInProgress = 0;
-static int HsaDrainCompleted = 0;
-
-static int drainDevicesViaHsa(void) {
-  if (__atomic_load_n(&HsaDrainCompleted, __ATOMIC_ACQUIRE))
-    return 0;
-
-  int Expected = 0;
-  if (!__atomic_compare_exchange_n(&HsaDrainInProgress, &Expected, 1,
-                                   /*weak=*/0, __ATOMIC_ACQ_REL,
-                                   __ATOMIC_ACQUIRE))
-    return 0;
-
-  struct InProgressGuard {
-    ~InProgressGuard() {
-      __atomic_store_n(&HsaDrainInProgress, 0, __ATOMIC_RELEASE);
-    }
-  } _Guard;
-
-  if (loadHsaRuntimePointers() != 0)
-    return 0; /* Runtime unavailable: stay retryable. */
-
-  WalkState W;
-  __builtin_memset(&W, 0, sizeof(W));
-  prof_hsa_status_t St = pHsaIterateAgents(collectAgent, &W);
-  if (St != PROF_HSA_STATUS_SUCCESS && St != PROF_HSA_STATUS_INFO_BREAK) {
-    PROF_WARN("hsa_iterate_agents failed (0x%x)\n", St);
-    return -1;
-  }
-  if (W.num_agents == 0) {
-    if (isVerboseMode())
-      PROF_NOTE("%s", "no GPU agents present; nothing to drain (will retry)\n");
-    return 0;
-  }
-
-  /* query_segment_descriptors ships in every loader-extension version and is
-   * more permissive than iterate_executables on ROCm. It yields the loaded
-   * (agent, executable) pairs directly. */
-  size_t NumSegs = 0;
-  St = pQuerySegDescs(nullptr, &NumSegs);
-  if (St != PROF_HSA_STATUS_SUCCESS) {
-    PROF_WARN("query_segment_descriptors(count) failed (0x%x)\n", St);
-    return -1;
-  }
-  if (NumSegs == 0) {
-    if (isVerboseMode())
-      PROF_NOTE("%s", "no loaded segments; nothing to drain (will retry)\n");
-    return 0;
-  }
-
-  prof_hsa_loader_segment_descriptor_t *Segs =
-      (prof_hsa_loader_segment_descriptor_t *)calloc(NumSegs, sizeof(*Segs));
-  if (!Segs) {
-    PROF_ERR("%s\n", "failed to allocate segment descriptor array");
-    return -1;
-  }
-  UniqueFree SegsOwner(Segs);
-
-  St = pQuerySegDescs(Segs, &NumSegs);
-  if (St != PROF_HSA_STATUS_SUCCESS) {
-    PROF_WARN("query_segment_descriptors(fetch) failed (0x%x)\n", St);
-    return -1;
-  }
-
-  if (isVerboseMode())
-    PROF_NOTE("query_segment_descriptors: %zu segments\n", NumSegs);
-
-  /* Walk unique (agent, executable) pairs. */
-  enum { kMaxPairs = 512 };
-  uint64_t SeenAgents[kMaxPairs];
-  uint64_t SeenExecs[kMaxPairs];
-  int NumPairs = 0;
-  int IterFailures = 0;
-
-  for (size_t i = 0; i < NumSegs; ++i) {
-    if (Segs[i].executable.handle == 0 || Segs[i].agent.handle == 0)
-      continue;
-
-    int Seen = 0;
-    for (int j = 0; j < NumPairs; ++j)
-      if (SeenAgents[j] == Segs[i].agent.handle &&
-          SeenExecs[j] == Segs[i].executable.handle) {
-        Seen = 1;
-        break;
-      }
-    if (Seen)
-      continue;
-    if (NumPairs < kMaxPairs) {
-      SeenAgents[NumPairs] = Segs[i].agent.handle;
-      SeenExecs[NumPairs] = Segs[i].executable.handle;
-      NumPairs++;
-    }
-
-    const char *Arch = nullptr;
-    for (int k = 0; k < W.num_agents; ++k)
-      if (W.agents[k].agent.handle == Segs[i].agent.handle) {
-        Arch = W.agents[k].arch;
-        break;
-      }
-    if (!Arch)
-      continue; /* not a GPU agent we collected */
-
-    SymbolState S;
-    __builtin_memset(&S, 0, sizeof(S));
-    S.arch = Arch;
-    if (isVerboseMode())
-      PROF_NOTE("walking executable 0x%llx on %s\n",
-                (unsigned long long)Segs[i].executable.handle, Arch);
-    prof_hsa_status_t IterSt =
-        pHsaExecIterAgentSyms(Segs[i].executable, Segs[i].agent, onSymbol, &S);
-    if (IterSt != PROF_HSA_STATUS_SUCCESS &&
-        IterSt != PROF_HSA_STATUS_INFO_BREAK) {
-      PROF_WARN("hsa_executable_iterate_agent_symbols on executable 0x%llx "
-                "failed (0x%x)\n",
-                (unsigned long long)Segs[i].executable.handle, IterSt);
-      IterFailures++;
-    }
-    W.total_found += S.found;
-    W.total_drained += S.drained;
-  }
-
-  if (isVerboseMode())
-    PROF_NOTE("HSA walk complete: agents=%d pairs=%d found=%d drained=%d "
-              "iter-failures=%d\n",
-              W.num_agents, NumPairs, W.total_found, W.total_drained,
-              IterFailures);
-
-  /* Latch only when we actually drained data. Deliberately do NOT latch the
-   * "walked everything but found nothing new" case: an early collect call can
-   * run before any kernel launch, and latching it would suppress the real
-   * exit-time drain once kernels do run. Repeating a no-op walk is cheap. */
-  if (W.total_drained > 0)
-    __atomic_store_n(&HsaDrainCompleted, 1, __ATOMIC_RELEASE);
-  return (IterFailures > 0) ? -1 : 0;
-}
-
-/* NOTE: deliberately no library constructor that calls hsa_init() here.
- * Bringing HSA up merely because the instrumented library was loaded poisons
- * fork-based callers: frameworks and tests (e.g. RCCL's unit tests) keep
- * HIP/HSA uninitialized in the parent and only touch HIP inside forked
- * children. A parent that has already hsa_init()'d makes those children crash
- * inside HSA (HSA state is not valid across fork()). HSA is instead brought up
- * lazily from drainDevicesViaHsa() -> loadHsaRuntimePointers(); see the init
- * rationale there. */
-
-#endif /* defined(__linux__) && !defined(_WIN32) -- HSA drain */
+/* The supplemental HSA-introspection drain (Linux only) lives in
+ * InstrProfilingPlatformROCmHSA.cpp. It reuses processDeviceOffloadPrf() and
+ * shares the drained-bounds dedup with the host-shadow path above via
+ * profRecordDrainedBounds(); both are declared in
+ * InstrProfilingPlatformROCmInternal.h. There is no Windows counterpart. */
 
 /* -------------------------------------------------------------------------- */
 /*  Collect device-side profile data                                          */
diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformROCmHSA.cpp b/compiler-rt/lib/profile/InstrProfilingPlatformROCmHSA.cpp
new file mode 100644
index 0000000000000..92a7e4fab60ad
--- /dev/null
+++ b/compiler-rt/lib/profile/InstrProfilingPlatformROCmHSA.cpp
@@ -0,0 +1,582 @@
+//===- InstrProfilingPlatformROCmHSA.cpp - ROCm HSA device drain ---------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// Supplemental HSA-introspection drain (Linux only).
+//
+// The host-shadow drain in InstrProfilingPlatformROCm.cpp only sees device
+// code objects registered host-side (__hipRegisterVar shadows) or loaded
+// through an intercepted hipModuleLoad* call. Device code linked by the offload
+// device linker with no host-side shadow -- e.g. RCCL, whose many device
+// functions are glued into a single kernel with no source module -- is
+// invisible to it. This pass walks every GPU agent's loaded executables via
+// HSA, finds each __llvm_profile_sections table directly on the device, and
+// drains the ones the host-shadow pass did not already handle (deduped by the
+// device section-bounds tuple). It reuses processDeviceOffloadPrf() for the
+// copy/relocate/write so the on-disk profraw layout is identical.
+//
+// There is deliberately no Windows counterpart: HSA introspection is Linux-only
+// and Windows relies entirely on the host-shadow HIP drain. On any non-Linux
+// target this file compiles to an empty translation unit.
+//
+//===----------------------------------------------------------------------===//
+
+#if defined(__linux__) && !defined(_WIN32)
+
+extern "C" {
+#include "InstrProfiling.h"
+#include "InstrProfilingInternal.h"
+#include "InstrProfilingPort.h"
+}
+
+#include "InstrProfilingPlatformROCmInternal.h"
+#include "interception/interception.h"
+// C library headers (not <cstdio> etc.): clang_rt.profile is built with
+// -nostdinc++ and avoids the C++ standard library (see profile/CMakeLists.txt).
+#include <stddef.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+using namespace __prof_rocm;
+
+/* Minimal HSA type/enum stubs. compiler-rt cannot depend on ROCm headers at
+ * build time, so mirror just the handful of HSA declarations the drain needs.
+ * Values match hsa/hsa.h and hsa/hsa_ven_amd_loader.h. */
+typedef uint32_t prof_hsa_status_t;
+#define PROF_HSA_STATUS_SUCCESS ((prof_hsa_status_t)0x0)
+#define PROF_HSA_STATUS_INFO_BREAK ((prof_hsa_status_t)0x1)
+
+typedef struct {
+  uint64_t handle;
+} prof_hsa_agent_t;
+typedef struct {
+  uint64_t handle;
+} prof_hsa_executable_t;
+typedef struct {
+  uint64_t handle;
+} prof_hsa_executable_symbol_t;
+
+typedef uint32_t prof_hsa_agent_info_t;
+#define PROF_HSA_AGENT_INFO_NAME ((prof_hsa_agent_info_t)0)
+#define PROF_HSA_AGENT_INFO_DEVICE ((prof_hsa_agent_info_t)17)
+
+typedef uint32_t prof_hsa_device_type_t;
+#define PROF_HSA_DEVICE_TYPE_GPU ((prof_hsa_device_type_t)1)
+
+typedef uint32_t prof_hsa_symbol_kind_t;
+#define PROF_HSA_SYMBOL_KIND_VARIABLE ((prof_hsa_symbol_kind_t)0)
+
+typedef uint32_t prof_hsa_executable_symbol_info_t;
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_TYPE                                   \
+  ((prof_hsa_executable_symbol_info_t)0)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME_LENGTH                            \
+  ((prof_hsa_executable_symbol_info_t)1)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME                                   \
+  ((prof_hsa_executable_symbol_info_t)2)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS                       \
+  ((prof_hsa_executable_symbol_info_t)21)
+
+#define PROF_HSA_EXTENSION_AMD_LOADER ((uint16_t)0x201)
+
+typedef uint32_t prof_hsa_loader_storage_type_t;
+
+typedef struct {
+  prof_hsa_agent_t agent;
+  prof_hsa_executable_t executable;
+  prof_hsa_loader_storage_type_t code_object_storage_type;
+  const void *code_object_storage_base;
+  size_t code_object_storage_size;
+  size_t code_object_storage_offset;
+  const void *segment_base;
+  size_t segment_size;
+} prof_hsa_loader_segment_descriptor_t;
+
+typedef prof_hsa_status_t (*hsa_init_ty)(void);
+typedef prof_hsa_status_t (*hsa_iterate_agents_ty)(
+    prof_hsa_status_t (*)(prof_hsa_agent_t, void *), void *);
+typedef prof_hsa_status_t (*hsa_agent_get_info_ty)(prof_hsa_agent_t,
+                                                   prof_hsa_agent_info_t,
+                                                   void *);
+typedef prof_hsa_status_t (*hsa_executable_iterate_agent_symbols_ty)(
+    prof_hsa_executable_t, prof_hsa_agent_t,
+    prof_hsa_status_t (*)(prof_hsa_executable_t, prof_hsa_agent_t,
+                          prof_hsa_executable_symbol_t, void *),
+    void *);
+typedef prof_hsa_status_t (*hsa_executable_symbol_get_info_ty)(
+    prof_hsa_executable_symbol_t, prof_hsa_executable_symbol_info_t, void *);
+typedef prof_hsa_status_t (*hsa_system_get_major_extension_table_ty)(uint16_t,
+                                                                     uint16_t,
+                                                                     size_t,
+                                                                     void *);
+typedef prof_hsa_status_t (*hsa_loader_query_segment_descriptors_ty)(
+    prof_hsa_loader_segment_descriptor_t *, size_t *);
+
+/* First two members of hsa_ven_amd_loader_1_00_pfn_t. Only
+ * query_segment_descriptors is used; query_host_address keeps the offset. */
+typedef struct {
+  void *query_host_address;
+  hsa_loader_query_segment_descriptors_ty query_segment_descriptors;
+} prof_hsa_loader_pfn_t;
+
+static hsa_iterate_agents_ty pHsaIterateAgents = nullptr;
+static hsa_agent_get_info_ty pHsaAgentGetInfo = nullptr;
+static hsa_executable_iterate_agent_symbols_ty pHsaExecIterAgentSyms = nullptr;
+static hsa_executable_symbol_get_info_ty pHsaSymGetInfo = nullptr;
+static hsa_loader_query_segment_descriptors_ty pQuerySegDescs = nullptr;
+
+/* 0 = not yet attempted, 1 = ready, -1 = unavailable. Accessed with acquire/
+ * release atomics: a thread observing HsaRuntimeState==1 (acquire) also sees
+ * the fully-written p* function pointers (published before the release store
+ * of HsaRuntimeState=1 below). */
+static int HsaRuntimeState = 0;
+
+static int setHsaRuntimeState(int S) {
+  __atomic_store_n(&HsaRuntimeState, S, __ATOMIC_RELEASE);
+  return S > 0 ? 0 : -1;
+}
+
+/* Resolve HSA entry points (and the AMD loader extension) once, and confirm
+ * HIP's hipMemcpy is reachable for the device-to-host copies. HIP itself is
+ * resolved by the shared ensureHipLoaded() above. */
+static int loadHsaRuntimePointers(void) {
+  int State = __atomic_load_n(&HsaRuntimeState, __ATOMIC_ACQUIRE);
+  if (State)
+    return State > 0 ? 0 : -1;
+
+  if (!__interception::DynamicLoaderAvailable()) {
+    if (isVerboseMode())
+      PROF_NOTE("%s", "Dynamic library loading not available - "
+                      "HSA device profiling disabled\n");
+    return setHsaRuntimeState(-1);
+  }
+
+  void *Hsa = __interception::OpenLibrary("libhsa-runtime64.so");
+  if (!Hsa)
+    Hsa = __interception::OpenLibrary("libhsa-runtime64.so.1");
+  if (!Hsa) {
+    if (isVerboseMode())
+      PROF_NOTE("%s", "libhsa-runtime64.so not loadable - "
+                      "HSA device profiling disabled\n");
+    return setHsaRuntimeState(-1);
+  }
+
+  hsa_init_ty pHsaInit =
+      (hsa_init_ty)__interception::LookupSymbol(Hsa, "hsa_init");
+  hsa_system_get_major_extension_table_ty pGetExtTable =
+      (hsa_system_get_major_extension_table_ty)__interception::LookupSymbol(
+          Hsa, "hsa_system_get_major_extension_table");
+  pHsaIterateAgents = (hsa_iterate_agents_ty)__interception::LookupSymbol(
+      Hsa, "hsa_iterate_agents");
+  pHsaAgentGetInfo = (hsa_agent_get_info_ty)__interception::LookupSymbol(
+      Hsa, "hsa_agent_get_info");
+  pHsaExecIterAgentSyms =
+      (hsa_executable_iterate_agent_symbols_ty)__interception::LookupSymbol(
+          Hsa, "hsa_executable_iterate_agent_symbols");
+  pHsaSymGetInfo =
+      (hsa_executable_symbol_get_info_ty)__interception::LookupSymbol(
+          Hsa, "hsa_executable_symbol_get_info");
+
+  if (!pHsaInit || !pGetExtTable || !pHsaIterateAgents || !pHsaAgentGetInfo ||
+      !pHsaExecIterAgentSyms || !pHsaSymGetInfo) {
+    PROF_WARN("%s",
+              "required HSA symbols missing - HSA device profiling disabled\n");
+    return setHsaRuntimeState(-1);
+  }
+
+  /* Bring HSA up (idempotent, refcounted). This runs lazily on the first drain
+   * rather than from the library constructor, so merely loading the
+   * instrumented library does not initialize HSA in the process -- which would
+   * break fork-based callers that deliberately keep HIP/HSA uninitialized in
+   * the parent (see the constructor note at the end of the HSA block). In the
+   * common case the drain runs from the profile write path while HSA is still
+   * alive; if it only runs after HSA's own atexit(hsa_shut_down) has executed,
+   * this simply re-initializes HSA (the process is exiting anyway). */
+  prof_hsa_status_t St = pHsaInit();
+  if (St != PROF_HSA_STATUS_SUCCESS && St != PROF_HSA_STATUS_INFO_BREAK) {
+    if (isVerboseMode())
+      PROF_NOTE("hsa_init failed (0x%x) - HSA device profiling disabled\n", St);
+    return setHsaRuntimeState(-1);
+  }
+
+  prof_hsa_loader_pfn_t LoaderApi;
+  __builtin_memset(&LoaderApi, 0, sizeof(LoaderApi));
+  St = pGetExtTable(PROF_HSA_EXTENSION_AMD_LOADER, 1, sizeof(LoaderApi),
+                    &LoaderApi);
+  if (St != PROF_HSA_STATUS_SUCCESS || !LoaderApi.query_segment_descriptors) {
+    PROF_WARN("AMD loader extension unavailable (0x%x) - "
+              "HSA device profiling disabled\n",
+              St);
+    return setHsaRuntimeState(-1);
+  }
+  pQuerySegDescs = LoaderApi.query_segment_descriptors;
+
+  /* The device-to-host copies go through the shared HIP loader. */
+  ensureHipLoaded();
+  if (!hipMemcpyAvailable()) {
+    PROF_WARN("%s", "hipMemcpy unavailable - HSA device profiling disabled\n");
+    return setHsaRuntimeState(-1);
+  }
+
+  if (isVerboseMode())
+    PROF_NOTE("%s", "HSA + HIP runtime resolved for device profiling\n");
+  return setHsaRuntimeState(1);
+}
+
+/* The canonical device bounds-table symbol from InstrProfilingPlatformGPU.c. */
+static const char ProfileSectionsSymbol[] = "__llvm_profile_sections";
+
+/* Dedup of drained section-bounds tuples, shared with the host-shadow path
+ * (processDeviceOffloadPrf records here on every successful drain). A single
+ * linked device code object exposes one __llvm_profile_sections, but the same
+ * bounds may be seen via multiple agents, so each unique counter set is
+ * drained exactly once across both paths. */
+namespace {
+struct ProfBoundsTuple {
+  const void *data;
+  const void *cnts;
+  const void *names;
+};
+} // namespace
+
+/* Grown on demand (doubling) rather than fixed-cap: in non-RDC mode the entry
+ * count scales like num_code_objects * num_agents, so any fixed cap could be
+ * exceeded and silently lose dedup coverage (double-counting drained sections).
+ * Starts at PROF_SEEN_BOUNDS_INIT_CAP. */
+#define PROF_SEEN_BOUNDS_INIT_CAP 64
+static ProfBoundsTuple *SeenBounds = nullptr;
+static int NumSeenBounds = 0;
+static int CapSeenBounds = 0;
+
+/* Pure check: has this bounds tuple already been drained? Does not mutate
+ * state, so a transient failure does not permanently suppress retries. */
+static int profBoundsAlreadyDrained(const void *D, const void *C,
+                                    const void *N) {
+  for (int i = 0; i < NumSeenBounds; ++i)
+    if (SeenBounds[i].data == D && SeenBounds[i].cnts == C &&
+        SeenBounds[i].names == N)
+      return 1;
+  return 0;
+}
+
+/* Record a drained bounds tuple. Idempotent. Called after a successful drain
+ * (either path) so a failed attempt stays retryable. */
+void __prof_rocm::profRecordDrainedBounds(const void *D, const void *C,
+                                          const void *N) {
+  if (profBoundsAlreadyDrained(D, C, N))
+    return;
+  if (NumSeenBounds == CapSeenBounds) {
+    int NewCap = CapSeenBounds ? CapSeenBounds * 2 : PROF_SEEN_BOUNDS_INIT_CAP;
+    ProfBoundsTuple *New =
+        (ProfBoundsTuple *)realloc(SeenBounds, NewCap * sizeof(*New));
+    /* Best-effort: on OOM keep the existing table and skip recording. The
+     * worst case is that this one section is drained again later (a duplicate
+     * profraw record), never a crash. */
+    if (!New)
+      return;
+    SeenBounds = New;
+    CapSeenBounds = NewCap;
+  }
+  SeenBounds[NumSeenBounds].data = D;
+  SeenBounds[NumSeenBounds].cnts = C;
+  SeenBounds[NumSeenBounds].names = N;
+  NumSeenBounds++;
+}
+
+#define PROF_MAX_GPU_AGENTS 64
+
+namespace {
+struct GpuAgent {
+  prof_hsa_agent_t agent;
+  char arch[64];
+};
+
+struct WalkState {
+  GpuAgent agents[PROF_MAX_GPU_AGENTS];
+  int num_agents;
+  int total_found;
+  int total_drained;
+};
+
+/* Per (agent, executable) symbol-iteration state. */
+struct SymbolState {
+  const char *arch;
+  int found;
+  int drained;
+};
+} // namespace
+
+/* HSA per-symbol callback: when it finds a __llvm_profile_sections variable,
+ * drain it via processDeviceOffloadPrf() unless the host-shadow path (or an
+ * earlier agent) already handled the same bounds. */
+static prof_hsa_status_t onSymbol(prof_hsa_executable_t, prof_hsa_agent_t,
+                                  prof_hsa_executable_symbol_t Sym,
+                                  void *Data) {
+  SymbolState *S = (SymbolState *)Data;
+
+  prof_hsa_symbol_kind_t Kind;
+  if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_TYPE, &Kind) !=
+          PROF_HSA_STATUS_SUCCESS ||
+      Kind != PROF_HSA_SYMBOL_KIND_VARIABLE)
+    return PROF_HSA_STATUS_SUCCESS;
+
+  uint32_t NameLen = 0;
+  if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME_LENGTH,
+                     &NameLen) != PROF_HSA_STATUS_SUCCESS ||
+      NameLen != sizeof(ProfileSectionsSymbol) - 1)
+    return PROF_HSA_STATUS_SUCCESS;
+
+  char NameBuf[64];
+  if (NameLen + 1 > sizeof(NameBuf))
+    return PROF_HSA_STATUS_SUCCESS;
+  if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME, NameBuf) !=
+      PROF_HSA_STATUS_SUCCESS)
+    return PROF_HSA_STATUS_SUCCESS;
+  NameBuf[NameLen] = '\0';
+
+  if (strcmp(NameBuf, ProfileSectionsSymbol) != 0)
+    return PROF_HSA_STATUS_SUCCESS;
+
+  uint64_t Addr = 0;
+  if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS,
+                     &Addr) != PROF_HSA_STATUS_SUCCESS ||
+      Addr == 0) {
+    if (isVerboseMode())
+      PROF_NOTE("%s", "failed to read __llvm_profile_sections address\n");
+    return PROF_HSA_STATUS_SUCCESS;
+  }
+
+  S->found++;
+
+  // Read the bounds table first to dedup (and detect empty sections) before
+  // the full copy/relocate done by processDeviceOffloadPrf.
+  __llvm_profile_gpu_sections Sec;
+  if (memcpyDeviceToHost(&Sec, (void *)(uintptr_t)Addr, sizeof(Sec)) != 0) {
+    PROF_WARN("%s", "failed to copy device bounds table\n");
+    return PROF_HSA_STATUS_SUCCESS;
+  }
+  if (profBoundsAlreadyDrained(Sec.DataStart, Sec.CountersStart,
+                               Sec.NamesStart)) {
+    if (isVerboseMode())
+      PROF_NOTE("%s", "device bounds already drained, skipping\n");
+    return PROF_HSA_STATUS_SUCCESS;
+  }
+
+  size_t DataBytes = (const char *)Sec.DataStop - (const char *)Sec.DataStart;
+  size_t CntsBytes =
+      (const char *)Sec.CountersStop - (const char *)Sec.CountersStart;
+  if (DataBytes == 0 || CntsBytes == 0) {
+    // Empty code object: nothing to write. Mark seen so we don't revisit it.
+    profRecordDrainedBounds(Sec.DataStart, Sec.CountersStart, Sec.NamesStart);
+    return PROF_HSA_STATUS_SUCCESS;
+  }
+
+  // Generate a collision-free target. Multiple distinct device code objects on
+  // the same arch (e.g. non-RDC multi-TU) must not clobber each other's file.
+  static int DrainIndex = 0;
+  char Target[96];
+  if (DrainIndex == 0)
+    snprintf(Target, sizeof(Target), "%s", S->arch);
+  else
+    snprintf(Target, sizeof(Target), "%s.%d", S->arch, DrainIndex);
+
+  // processDeviceOffloadPrf returns 0 on a successful write, -1 on error.
+  // Record the bounds (and advance the target index) only on success so a
+  // transient error stays retryable on a later agent or collect call.
+  int Rc = processDeviceOffloadPrf((void *)(uintptr_t)Addr, Target, nullptr);
+  if (Rc == 0) {
+    S->drained++;
+    DrainIndex++;
+    profRecordDrainedBounds(Sec.DataStart, Sec.CountersStart, Sec.NamesStart);
+  }
+
+  return PROF_HSA_STATUS_SUCCESS;
+}
+
+static prof_hsa_status_t collectAgent(prof_hsa_agent_t Agent, void *Data) {
+  prof_hsa_device_type_t DevType;
+  if (pHsaAgentGetInfo(Agent, PROF_HSA_AGENT_INFO_DEVICE, &DevType) !=
+          PROF_HSA_STATUS_SUCCESS ||
+      DevType != PROF_HSA_DEVICE_TYPE_GPU)
+    return PROF_HSA_STATUS_SUCCESS;
+
+  WalkState *W = (WalkState *)Data;
+  if (W->num_agents >= PROF_MAX_GPU_AGENTS)
+    return PROF_HSA_STATUS_SUCCESS;
+
+  GpuAgent &GA = W->agents[W->num_agents++];
+  GA.agent = Agent;
+  char Name[64];
+  __builtin_memset(Name, 0, sizeof(Name));
+  pHsaAgentGetInfo(Agent, PROF_HSA_AGENT_INFO_NAME, Name);
+  size_t N = strnlen(Name, sizeof(GA.arch) - 1);
+  __builtin_memcpy(GA.arch, Name, N);
+  GA.arch[N] = '\0';
+  if (!GA.arch[0])
+    strncpy(GA.arch, "amdgpu", sizeof(GA.arch) - 1);
+
+  if (isVerboseMode())
+    PROF_NOTE("GPU agent %d: %s\n", W->num_agents - 1, GA.arch);
+  return PROF_HSA_STATUS_SUCCESS;
+}
+
+/* Reentrancy guard and "drained data at least once" latch. The collect hook
+ * may run more than once (an explicit early __llvm_profile_write_file plus the
+ * exit write); a successful walk latches HsaDrainCompleted so we never re-emit
+ * duplicate .profraw files, while transient no-op outcomes ("runtime not yet
+ * loadable", "no GPU agents", "no loaded segments", "nothing instrumented")
+ * stay retryable so a later call can still pick up code objects loaded later.
+ * HsaDrainInProgress prevents a concurrent or reentrant call (e.g. a library
+ * destructor) from corrupting the global SeenBounds table. Both flags use
+ * acquire/release atomics. */
+static int HsaDrainInProgress = 0;
+static int HsaDrainCompleted = 0;
+
+int __prof_rocm::drainDevicesViaHsa(void) {
+  if (__atomic_load_n(&HsaDrainCompleted, __ATOMIC_ACQUIRE))
+    return 0;
+
+  int Expected = 0;
+  if (!__atomic_compare_exchange_n(&HsaDrainInProgress, &Expected, 1,
+                                   /*weak=*/0, __ATOMIC_ACQ_REL,
+                                   __ATOMIC_ACQUIRE))
+    return 0;
+
+  struct InProgressGuard {
+    ~InProgressGuard() {
+      __atomic_store_n(&HsaDrainInProgress, 0, __ATOMIC_RELEASE);
+    }
+  } _Guard;
+
+  if (loadHsaRuntimePointers() != 0)
+    return 0; /* Runtime unavailable: stay retryable. */
+
+  WalkState W;
+  __builtin_memset(&W, 0, sizeof(W));
+  prof_hsa_status_t St = pHsaIterateAgents(collectAgent, &W);
+  if (St != PROF_HSA_STATUS_SUCCESS && St != PROF_HSA_STATUS_INFO_BREAK) {
+    PROF_WARN("hsa_iterate_agents failed (0x%x)\n", St);
+    return -1;
+  }
+  if (W.num_agents == 0) {
+    if (isVerboseMode())
+      PROF_NOTE("%s", "no GPU agents present; nothing to drain (will retry)\n");
+    return 0;
+  }
+
+  /* query_segment_descriptors ships in every loader-extension version and is
+   * more permissive than iterate_executables on ROCm. It yields the loaded
+   * (agent, executable) pairs directly. */
+  size_t NumSegs = 0;
+  St = pQuerySegDescs(nullptr, &NumSegs);
+  if (St != PROF_HSA_STATUS_SUCCESS) {
+    PROF_WARN("query_segment_descriptors(count) failed (0x%x)\n", St);
+    return -1;
+  }
+  if (NumSegs == 0) {
+    if (isVerboseMode())
+      PROF_NOTE("%s", "no loaded segments; nothing to drain (will retry)\n");
+    return 0;
+  }
+
+  prof_hsa_loader_segment_descriptor_t *Segs =
+      (prof_hsa_loader_segment_descriptor_t *)calloc(NumSegs, sizeof(*Segs));
+  if (!Segs) {
+    PROF_ERR("%s\n", "failed to allocate segment descriptor array");
+    return -1;
+  }
+  UniqueFree SegsOwner(Segs);
+
+  St = pQuerySegDescs(Segs, &NumSegs);
+  if (St != PROF_HSA_STATUS_SUCCESS) {
+    PROF_WARN("query_segment_descriptors(fetch) failed (0x%x)\n", St);
+    return -1;
+  }
+
+  if (isVerboseMode())
+    PROF_NOTE("query_segment_descriptors: %zu segments\n", NumSegs);
+
+  /* Walk unique (agent, executable) pairs. */
+  enum { kMaxPairs = 512 };
+  uint64_t SeenAgents[kMaxPairs];
+  uint64_t SeenExecs[kMaxPairs];
+  int NumPairs = 0;
+  int IterFailures = 0;
+
+  for (size_t i = 0; i < NumSegs; ++i) {
+    if (Segs[i].executable.handle == 0 || Segs[i].agent.handle == 0)
+      continue;
+
+    int Seen = 0;
+    for (int j = 0; j < NumPairs; ++j)
+      if (SeenAgents[j] == Segs[i].agent.handle &&
+          SeenExecs[j] == Segs[i].executable.handle) {
+        Seen = 1;
+        break;
+      }
+    if (Seen)
+      continue;
+    if (NumPairs < kMaxPairs) {
+      SeenAgents[NumPairs] = Segs[i].agent.handle;
+      SeenExecs[NumPairs] = Segs[i].executable.handle;
+      NumPairs++;
+    }
+
+    const char *Arch = nullptr;
+    for (int k = 0; k < W.num_agents; ++k)
+      if (W.agents[k].agent.handle == Segs[i].agent.handle) {
+        Arch = W.agents[k].arch;
+        break;
+      }
+    if (!Arch)
+      continue; /* not a GPU agent we collected */
+
+    SymbolState S;
+    __builtin_memset(&S, 0, sizeof(S));
+    S.arch = Arch;
+    if (isVerboseMode())
+      PROF_NOTE("walking executable 0x%llx on %s\n",
+                (unsigned long long)Segs[i].executable.handle, Arch);
+    prof_hsa_status_t IterSt =
+        pHsaExecIterAgentSyms(Segs[i].executable, Segs[i].agent, onSymbol, &S);
+    if (IterSt != PROF_HSA_STATUS_SUCCESS &&
+        IterSt != PROF_HSA_STATUS_INFO_BREAK) {
+      PROF_WARN("hsa_executable_iterate_agent_symbols on executable 0x%llx "
+                "failed (0x%x)\n",
+                (unsigned long long)Segs[i].executable.handle, IterSt);
+      IterFailures++;
+    }
+    W.total_found += S.found;
+    W.total_drained += S.drained;
+  }
+
+  if (isVerboseMode())
+    PROF_NOTE("HSA walk complete: agents=%d pairs=%d found=%d drained=%d "
+              "iter-failures=%d\n",
+              W.num_agents, NumPairs, W.total_found, W.total_drained,
+              IterFailures);
+
+  /* Latch only when we actually drained data. Deliberately do NOT latch the
+   * "walked everything but found nothing new" case: an early collect call can
+   * run before any kernel launch, and latching it would suppress the real
+   * exit-time drain once kernels do run. Repeating a no-op walk is cheap. */
+  if (W.total_drained > 0)
+    __atomic_store_n(&HsaDrainCompleted, 1, __ATOMIC_RELEASE);
+  return (IterFailures > 0) ? -1 : 0;
+}
+
+/* NOTE: deliberately no library constructor that calls hsa_init() here.
+ * Bringing HSA up merely because the instrumented library was loaded poisons
+ * fork-based callers: frameworks and tests (e.g. RCCL's unit tests) keep
+ * HIP/HSA uninitialized in the parent and only touch HIP inside forked
+ * children. A parent that has already hsa_init()'d makes those children crash
+ * inside HSA (HSA state is not valid across fork()). HSA is instead brought up
+ * lazily from drainDevicesViaHsa() -> loadHsaRuntimePointers(); see the init
+ * rationale there. */
+
+#endif /* defined(__linux__) && !defined(_WIN32) -- HSA drain */
diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformROCmInternal.h b/compiler-rt/lib/profile/InstrProfilingPlatformROCmInternal.h
new file mode 100644
index 0000000000000..4d5d46f433c3e
--- /dev/null
+++ b/compiler-rt/lib/profile/InstrProfilingPlatformROCmInternal.h
@@ -0,0 +1,74 @@
+//===- InstrProfilingPlatformROCmInternal.h - ROCm shared interface -------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// Private interface shared between the ROCm host-shadow drain
+// (InstrProfilingPlatformROCm.cpp) and the Linux-only supplemental
+// HSA-introspection drain (InstrProfilingPlatformROCmHSA.cpp). Not a public
+// runtime header; everything lives in the __prof_rocm namespace with
+// archive-internal linkage.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef PROFILE_INSTRPROFILINGPLATFORMROCMINTERNAL_H
+#define PROFILE_INSTRPROFILINGPLATFORMROCMINTERNAL_H
+
+#include <stddef.h>
+#include <stdlib.h>
+
+// Defined at global scope in InstrProfilingPlatformROCm.cpp. Forward-declared
+// here (not redefined) so the HSA drain can name it in prototypes; the HSA path
+// only ever passes a null group, so it never needs the full definition.
+struct OffloadSectionShadowGroup;
+
+namespace __prof_rocm {
+
+// free()-based scope guard. Use .release() to transfer ownership.
+struct UniqueFree {
+  void *Ptr;
+  explicit UniqueFree(void *P = nullptr) : Ptr(P) {}
+  ~UniqueFree() { free(Ptr); }
+  UniqueFree(const UniqueFree &) = delete;
+  UniqueFree &operator=(const UniqueFree &) = delete;
+  char *get() const { return static_cast<char *>(Ptr); }
+  void reset(void *P) {
+    free(Ptr);
+    Ptr = P;
+  }
+  void *release() {
+    void *P = Ptr;
+    Ptr = nullptr;
+    return P;
+  }
+};
+
+// HIP/host-shadow helpers defined in InstrProfilingPlatformROCm.cpp and reused
+// by the HSA drain.
+int isVerboseMode();
+void ensureHipLoaded();
+// True once the loaded HIP runtime exposes hipMemcpy (device-to-host copies).
+int hipMemcpyAvailable();
+int memcpyDeviceToHost(void *Dst, const void *Src, size_t Size);
+int processDeviceOffloadPrf(void *DeviceOffloadPrf, const char *Target,
+                            const ::OffloadSectionShadowGroup *Sections);
+
+#if defined(__linux__) && !defined(_WIN32)
+// Implemented in InstrProfilingPlatformROCmHSA.cpp.
+
+// Record a drained section-bounds tuple so the supplemental HSA pass skips any
+// code object the host-shadow path already drained.
+void profRecordDrainedBounds(const void *Data, const void *Counters,
+                             const void *Names);
+
+// Walk every GPU agent's loaded executables via HSA and drain each
+// __llvm_profile_sections table the host-shadow pass did not already handle.
+int drainDevicesViaHsa(void);
+#endif
+
+} // namespace __prof_rocm
+
+#endif // PROFILE_INSTRPROFILINGPLATFORMROCMINTERNAL_H
diff --git a/utils/bazel/llvm-project-overlay/compiler-rt/BUILD.bazel b/utils/bazel/llvm-project-overlay/compiler-rt/BUILD.bazel
index ff4b381abe064..0c5e0af4cb483 100644
--- a/utils/bazel/llvm-project-overlay/compiler-rt/BUILD.bazel
+++ b/utils/bazel/llvm-project-overlay/compiler-rt/BUILD.bazel
@@ -33,6 +33,7 @@ WIN32_ONLY_FILES = [
 
 PROFILE_ROCM_FILES = [
     "lib/profile/InstrProfilingPlatformROCm.cpp",
+    "lib/profile/InstrProfilingPlatformROCmHSA.cpp",
 ]
 
 cc_library(

>From 5ec6d5ba79dff01e03d9c474aea7cac42d12ddae Mon Sep 17 00:00:00 2001
From: Larry Meadows <Lawrence.Meadows at amd.com>
Date: Wed, 17 Jun 2026 04:57:51 -0500
Subject: [PATCH 5/7] [PGO][HIP] Tidy HSA drain comments and fix device profraw
 target collision

Trim the verbose, AI-flavored comments in the supplemental HSA drain down to
concise intent comments, and clarify that the shared dedup table relies on
device collection being single-threaded (the acquire/release atomics only guard
the drain flags, not SeenBounds).

Fix a device-profraw filename collision: the host-shadow path and the HSA pass
could both emit the bare `arch` target for two distinct code objects on the same
arch, truncating each other's file. HSA-drained objects now use a separate
`.hsaN` suffix space, which cannot collide with the host path's `arch`/`arch.<i>`
names. HSA idempotency is already provided by the HsaDrainCompleted latch, so the
host-shadow path keeps its stable per-shadow names (overwrite-on-repeat), which
explicit-collect idempotency depends on.

Also drop the unused InstrProfilingInternal.h include from both ROCm TUs and
silence misc-use-internal-linkage on the INTERCEPTOR-generated real_* pointers
(which must keep external linkage).

Validated on 2x MI210 (gfx90a): run_gpu_tests.py GPU+AMDGPU = 15 passed.

Co-authored-by: Cursor <cursoragent at cursor.com>
---
 .../profile/InstrProfilingPlatformROCm.cpp    |  12 +-
 .../profile/InstrProfilingPlatformROCmHSA.cpp | 137 +++++++-----------
 2 files changed, 63 insertions(+), 86 deletions(-)

diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
index faedf66ba89a3..e58fa06fe6fbb 100644
--- a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
+++ b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
@@ -8,7 +8,6 @@
 
 extern "C" {
 #include "InstrProfiling.h"
-#include "InstrProfilingInternal.h"
 #include "InstrProfilingPort.h"
 }
 
@@ -1201,8 +1200,10 @@ extern "C" int __llvm_profile_hip_collect_device_data(void) {
           PROF_NOTE("Collecting static profile data from device %d (%s)\n", Dev,
                     ArchName);
         for (int i = 0; i < NumShadowVariables; ++i) {
-          /* RDC-mode multi-shadow drains need a distinct profraw per TU;
-           * single-TU programs keep the bare arch target. */
+          /* Stable name per shadow so a repeated drain (explicit collect plus
+           * the atexit drain) overwrites its own profraw rather than emitting a
+           * second one: bare arch for a single TU, arch.<i> for RDC multi-TU.
+           */
           const char *Target = ArchName;
           char TargetWithIdx[64];
           if (NumShadowVariables > 1) {
@@ -1298,6 +1299,10 @@ static int recordHipMultiDeviceLaunchResult(int Rc,
   return Rc;
 }
 
+// The INTERCEPTOR macro defines a `real_<func>` trampoline pointer that the
+// interception runtime must see with external linkage, so it cannot be made
+// static or anonymous as misc-use-internal-linkage would otherwise suggest.
+// NOLINTBEGIN(misc-use-internal-linkage)
 INTERCEPTOR(int, hipLaunchKernel, const void *Function, HipDim3 GridDim,
             HipDim3 BlockDim, void **Args, size_t SharedMemBytes,
             HipStream Stream) {
@@ -1428,6 +1433,7 @@ INTERCEPTOR(int, hipModuleUnload, void *module) {
   __llvm_profile_offload_unregister_dynamic_module(module);
   return REAL(hipModuleUnload)(module);
 }
+// NOLINTEND(misc-use-internal-linkage)
 
 __attribute__((constructor)) static void installHipInterceptors() {
   /* Avoid interception unless the HIP runtime is already loaded. */
diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformROCmHSA.cpp b/compiler-rt/lib/profile/InstrProfilingPlatformROCmHSA.cpp
index 92a7e4fab60ad..433f6661c0448 100644
--- a/compiler-rt/lib/profile/InstrProfilingPlatformROCmHSA.cpp
+++ b/compiler-rt/lib/profile/InstrProfilingPlatformROCmHSA.cpp
@@ -8,20 +8,15 @@
 //
 // Supplemental HSA-introspection drain (Linux only).
 //
-// The host-shadow drain in InstrProfilingPlatformROCm.cpp only sees device
-// code objects registered host-side (__hipRegisterVar shadows) or loaded
-// through an intercepted hipModuleLoad* call. Device code linked by the offload
-// device linker with no host-side shadow -- e.g. RCCL, whose many device
-// functions are glued into a single kernel with no source module -- is
+// The host-shadow drain in InstrProfilingPlatformROCm.cpp only sees device code
+// objects with a host-side shadow (__hipRegisterVar) or an intercepted
+// hipModuleLoad*. Device-linked code with no host shadow (e.g. RCCL) is
 // invisible to it. This pass walks every GPU agent's loaded executables via
-// HSA, finds each __llvm_profile_sections table directly on the device, and
-// drains the ones the host-shadow pass did not already handle (deduped by the
-// device section-bounds tuple). It reuses processDeviceOffloadPrf() for the
-// copy/relocate/write so the on-disk profraw layout is identical.
+// HSA, finds each __llvm_profile_sections table on the device, and drains the
+// ones the host-shadow pass missed (deduped by the section-bounds tuple). It
+// reuses processDeviceOffloadPrf() so the profraw layout is identical.
 //
-// There is deliberately no Windows counterpart: HSA introspection is Linux-only
-// and Windows relies entirely on the host-shadow HIP drain. On any non-Linux
-// target this file compiles to an empty translation unit.
+// HSA introspection is Linux-only; on any other target this is an empty TU.
 //
 //===----------------------------------------------------------------------===//
 
@@ -29,14 +24,12 @@
 
 extern "C" {
 #include "InstrProfiling.h"
-#include "InstrProfilingInternal.h"
 #include "InstrProfilingPort.h"
 }
 
 #include "InstrProfilingPlatformROCmInternal.h"
 #include "interception/interception.h"
-// C library headers (not <cstdio> etc.): clang_rt.profile is built with
-// -nostdinc++ and avoids the C++ standard library (see profile/CMakeLists.txt).
+// C (not C++) headers: clang_rt.profile is built -nostdinc++.
 #include <stddef.h>
 #include <stdint.h>
 #include <stdio.h>
@@ -45,9 +38,9 @@ extern "C" {
 
 using namespace __prof_rocm;
 
-/* Minimal HSA type/enum stubs. compiler-rt cannot depend on ROCm headers at
- * build time, so mirror just the handful of HSA declarations the drain needs.
- * Values match hsa/hsa.h and hsa/hsa_ven_amd_loader.h. */
+/* Minimal HSA type/enum stubs: compiler-rt cannot depend on ROCm headers, so
+ * mirror the few declarations the drain needs. Values match hsa/hsa.h and
+ * hsa/hsa_ven_amd_loader.h. */
 typedef uint32_t prof_hsa_status_t;
 #define PROF_HSA_STATUS_SUCCESS ((prof_hsa_status_t)0x0)
 #define PROF_HSA_STATUS_INFO_BREAK ((prof_hsa_status_t)0x1)
@@ -117,8 +110,8 @@ typedef prof_hsa_status_t (*hsa_system_get_major_extension_table_ty)(uint16_t,
 typedef prof_hsa_status_t (*hsa_loader_query_segment_descriptors_ty)(
     prof_hsa_loader_segment_descriptor_t *, size_t *);
 
-/* First two members of hsa_ven_amd_loader_1_00_pfn_t. Only
- * query_segment_descriptors is used; query_host_address keeps the offset. */
+/* First two members of hsa_ven_amd_loader_1_00_pfn_t; query_host_address only
+ * pads the offset to query_segment_descriptors. */
 typedef struct {
   void *query_host_address;
   hsa_loader_query_segment_descriptors_ty query_segment_descriptors;
@@ -130,10 +123,8 @@ static hsa_executable_iterate_agent_symbols_ty pHsaExecIterAgentSyms = nullptr;
 static hsa_executable_symbol_get_info_ty pHsaSymGetInfo = nullptr;
 static hsa_loader_query_segment_descriptors_ty pQuerySegDescs = nullptr;
 
-/* 0 = not yet attempted, 1 = ready, -1 = unavailable. Accessed with acquire/
- * release atomics: a thread observing HsaRuntimeState==1 (acquire) also sees
- * the fully-written p* function pointers (published before the release store
- * of HsaRuntimeState=1 below). */
+/* 0 = not attempted, 1 = ready, -1 = unavailable. Acquire/release atomics: a
+ * thread observing HsaRuntimeState==1 also sees the published p* pointers. */
 static int HsaRuntimeState = 0;
 
 static int setHsaRuntimeState(int S) {
@@ -141,9 +132,8 @@ static int setHsaRuntimeState(int S) {
   return S > 0 ? 0 : -1;
 }
 
-/* Resolve HSA entry points (and the AMD loader extension) once, and confirm
- * HIP's hipMemcpy is reachable for the device-to-host copies. HIP itself is
- * resolved by the shared ensureHipLoaded() above. */
+/* Resolve HSA entry points and the AMD loader extension once, and confirm HIP's
+ * hipMemcpy is reachable for the device-to-host copies. */
 static int loadHsaRuntimePointers(void) {
   int State = __atomic_load_n(&HsaRuntimeState, __ATOMIC_ACQUIRE);
   if (State)
@@ -189,14 +179,8 @@ static int loadHsaRuntimePointers(void) {
     return setHsaRuntimeState(-1);
   }
 
-  /* Bring HSA up (idempotent, refcounted). This runs lazily on the first drain
-   * rather than from the library constructor, so merely loading the
-   * instrumented library does not initialize HSA in the process -- which would
-   * break fork-based callers that deliberately keep HIP/HSA uninitialized in
-   * the parent (see the constructor note at the end of the HSA block). In the
-   * common case the drain runs from the profile write path while HSA is still
-   * alive; if it only runs after HSA's own atexit(hsa_shut_down) has executed,
-   * this simply re-initializes HSA (the process is exiting anyway). */
+  /* Bring HSA up lazily on the first drain (idempotent, refcounted), never from
+   * a library constructor -- see the fork-safety note at end of file. */
   prof_hsa_status_t St = pHsaInit();
   if (St != PROF_HSA_STATUS_SUCCESS && St != PROF_HSA_STATUS_INFO_BREAK) {
     if (isVerboseMode())
@@ -232,10 +216,8 @@ static int loadHsaRuntimePointers(void) {
 static const char ProfileSectionsSymbol[] = "__llvm_profile_sections";
 
 /* Dedup of drained section-bounds tuples, shared with the host-shadow path
- * (processDeviceOffloadPrf records here on every successful drain). A single
- * linked device code object exposes one __llvm_profile_sections, but the same
- * bounds may be seen via multiple agents, so each unique counter set is
- * drained exactly once across both paths. */
+ * (processDeviceOffloadPrf records here on every successful drain) so each
+ * unique counter set is drained exactly once across both paths. */
 namespace {
 struct ProfBoundsTuple {
   const void *data;
@@ -244,17 +226,14 @@ struct ProfBoundsTuple {
 };
 } // namespace
 
-/* Grown on demand (doubling) rather than fixed-cap: in non-RDC mode the entry
- * count scales like num_code_objects * num_agents, so any fixed cap could be
- * exceeded and silently lose dedup coverage (double-counting drained sections).
- * Starts at PROF_SEEN_BOUNDS_INIT_CAP. */
+/* Grown on demand (doubling) rather than fixed-cap: a fixed cap could be
+ * exceeded and silently lose dedup coverage, double-counting sections. */
 #define PROF_SEEN_BOUNDS_INIT_CAP 64
 static ProfBoundsTuple *SeenBounds = nullptr;
 static int NumSeenBounds = 0;
 static int CapSeenBounds = 0;
 
-/* Pure check: has this bounds tuple already been drained? Does not mutate
- * state, so a transient failure does not permanently suppress retries. */
+/* Has this bounds tuple already been drained? Pure check, no state mutation. */
 static int profBoundsAlreadyDrained(const void *D, const void *C,
                                     const void *N) {
   for (int i = 0; i < NumSeenBounds; ++i)
@@ -264,8 +243,8 @@ static int profBoundsAlreadyDrained(const void *D, const void *C,
   return 0;
 }
 
-/* Record a drained bounds tuple. Idempotent. Called after a successful drain
- * (either path) so a failed attempt stays retryable. */
+/* Record a drained bounds tuple. Idempotent; call only after a successful drain
+ * so a failed attempt stays retryable. */
 void __prof_rocm::profRecordDrainedBounds(const void *D, const void *C,
                                           const void *N) {
   if (profBoundsAlreadyDrained(D, C, N))
@@ -274,9 +253,8 @@ void __prof_rocm::profRecordDrainedBounds(const void *D, const void *C,
     int NewCap = CapSeenBounds ? CapSeenBounds * 2 : PROF_SEEN_BOUNDS_INIT_CAP;
     ProfBoundsTuple *New =
         (ProfBoundsTuple *)realloc(SeenBounds, NewCap * sizeof(*New));
-    /* Best-effort: on OOM keep the existing table and skip recording. The
-     * worst case is that this one section is drained again later (a duplicate
-     * profraw record), never a crash. */
+    /* On OOM, keep the old table and skip recording: worst case this section is
+     * drained again later (a duplicate record), never a crash. */
     if (!New)
       return;
     SeenBounds = New;
@@ -376,20 +354,17 @@ static prof_hsa_status_t onSymbol(prof_hsa_executable_t, prof_hsa_agent_t,
     return PROF_HSA_STATUS_SUCCESS;
   }
 
-  // Generate a collision-free target. Multiple distinct device code objects on
-  // the same arch (e.g. non-RDC multi-TU) must not clobber each other's file.
+  // Name HSA-drained objects in their own ".hsaN" suffix space so they never
+  // collide with the host-shadow path's "arch"/"arch.<i>" filenames. The drain
+  // latch (HsaDrainCompleted) already prevents re-draining an object, so a
+  // plain per-drain counter is enough for uniqueness.
   static int DrainIndex = 0;
   char Target[96];
-  if (DrainIndex == 0)
-    snprintf(Target, sizeof(Target), "%s", S->arch);
-  else
-    snprintf(Target, sizeof(Target), "%s.%d", S->arch, DrainIndex);
+  snprintf(Target, sizeof(Target), "%s.hsa%d", S->arch, DrainIndex);
 
-  // processDeviceOffloadPrf returns 0 on a successful write, -1 on error.
-  // Record the bounds (and advance the target index) only on success so a
+  // Record the bounds (and advance the index) only on a successful write so a
   // transient error stays retryable on a later agent or collect call.
-  int Rc = processDeviceOffloadPrf((void *)(uintptr_t)Addr, Target, nullptr);
-  if (Rc == 0) {
+  if (processDeviceOffloadPrf((void *)(uintptr_t)Addr, Target, nullptr) == 0) {
     S->drained++;
     DrainIndex++;
     profRecordDrainedBounds(Sec.DataStart, Sec.CountersStart, Sec.NamesStart);
@@ -425,15 +400,14 @@ static prof_hsa_status_t collectAgent(prof_hsa_agent_t Agent, void *Data) {
   return PROF_HSA_STATUS_SUCCESS;
 }
 
-/* Reentrancy guard and "drained data at least once" latch. The collect hook
- * may run more than once (an explicit early __llvm_profile_write_file plus the
- * exit write); a successful walk latches HsaDrainCompleted so we never re-emit
- * duplicate .profraw files, while transient no-op outcomes ("runtime not yet
- * loadable", "no GPU agents", "no loaded segments", "nothing instrumented")
- * stay retryable so a later call can still pick up code objects loaded later.
- * HsaDrainInProgress prevents a concurrent or reentrant call (e.g. a library
- * destructor) from corrupting the global SeenBounds table. Both flags use
- * acquire/release atomics. */
+/* Reentrancy guard and "drained at least once" latch (both acquire/release).
+ * The collect hook may run more than once (an early __llvm_profile_write_file
+ * plus the exit write): a successful walk latches HsaDrainCompleted so we never
+ * re-emit duplicate .profraw files, while no-op outcomes stay retryable for a
+ * later call. HsaDrainInProgress serializes reentrant HSA walks (e.g. from a
+ * library destructor); note it does not guard against a host-shadow
+ * processDeviceOffloadPrf() on another thread mutating SeenBounds concurrently
+ * -- the dedup table relies on device collection being single-threaded. */
 static int HsaDrainInProgress = 0;
 static int HsaDrainCompleted = 0;
 
@@ -469,8 +443,8 @@ int __prof_rocm::drainDevicesViaHsa(void) {
     return 0;
   }
 
-  /* query_segment_descriptors ships in every loader-extension version and is
-   * more permissive than iterate_executables on ROCm. It yields the loaded
+  /* query_segment_descriptors ships in every loader-extension version, is more
+   * permissive than iterate_executables on ROCm, and yields the loaded
    * (agent, executable) pairs directly. */
   size_t NumSegs = 0;
   St = pQuerySegDescs(nullptr, &NumSegs);
@@ -561,22 +535,19 @@ int __prof_rocm::drainDevicesViaHsa(void) {
               W.num_agents, NumPairs, W.total_found, W.total_drained,
               IterFailures);
 
-  /* Latch only when we actually drained data. Deliberately do NOT latch the
-   * "walked everything but found nothing new" case: an early collect call can
-   * run before any kernel launch, and latching it would suppress the real
-   * exit-time drain once kernels do run. Repeating a no-op walk is cheap. */
+  /* Latch only when we actually drained data. A "found nothing new" walk is
+   * deliberately not latched: an early collect can precede any kernel launch,
+   * and latching it would suppress the real exit-time drain. No-op walks are
+   * cheap to repeat. */
   if (W.total_drained > 0)
     __atomic_store_n(&HsaDrainCompleted, 1, __ATOMIC_RELEASE);
   return (IterFailures > 0) ? -1 : 0;
 }
 
-/* NOTE: deliberately no library constructor that calls hsa_init() here.
- * Bringing HSA up merely because the instrumented library was loaded poisons
- * fork-based callers: frameworks and tests (e.g. RCCL's unit tests) keep
- * HIP/HSA uninitialized in the parent and only touch HIP inside forked
- * children. A parent that has already hsa_init()'d makes those children crash
- * inside HSA (HSA state is not valid across fork()). HSA is instead brought up
- * lazily from drainDevicesViaHsa() -> loadHsaRuntimePointers(); see the init
- * rationale there. */
+/* Fork-safety: deliberately no library constructor calling hsa_init(). HSA
+ * state is invalid across fork(), so initializing it just because the
+ * instrumented library loaded would crash fork-based callers (e.g. RCCL's unit
+ * tests) that keep HIP/HSA uninitialized in the parent and only touch HIP in
+ * forked children. HSA is instead brought up lazily during the drain. */
 
 #endif /* defined(__linux__) && !defined(_WIN32) -- HSA drain */

>From bd8e98f6a346cda7bb464c0b7210dd5fad775c11 Mon Sep 17 00:00:00 2001
From: Larry Meadows <Lawrence.Meadows at amd.com>
Date: Wed, 17 Jun 2026 05:31:46 -0500
Subject: [PATCH 6/7] [PGO][HIP] Extract host-shadow drain into a helper

Factor the host-shadow drain body out of __llvm_profile_hip_collect_device_data
into collectHostShadowData(). The collect entry point now just gates the
host-shadow pass and runs the supplemental HSA drain, flattening the previous
four-level nesting and removing the whole-block reindentation. No behavior
change.

Validated on 2x MI210 (gfx90a): run_gpu_tests.py GPU+AMDGPU = 15 passed.

Co-authored-by: Cursor <cursoragent at cursor.com>
---
 .../profile/InstrProfilingPlatformROCm.cpp    | 143 +++++++++---------
 1 file changed, 74 insertions(+), 69 deletions(-)

diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
index e58fa06fe6fbb..ba70625e98fa0 100644
--- a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
+++ b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
@@ -1155,87 +1155,92 @@ static int isHipAvailable(void) {
 /*  Collect device-side profile data                                          */
 /* -------------------------------------------------------------------------- */
 
-extern "C" int __llvm_profile_hip_collect_device_data(void) {
+/* Host-shadow drain: static-linked kernels (host __hipRegisterVar shadows) and
+ * intercepted dynamic modules. The caller gates this on
+ * (NumShadowVariables || NumDynamicModules) && isHipAvailable(); pure
+ * device-linked programs (RCCL) are handled by the supplemental HSA pass. */
+static int collectHostShadowData(void) {
   int Ret = 0;
 
-  /* Host-shadow drain: static-linked kernels (host __hipRegisterVar shadows)
-   * and intercepted dynamic modules. Only meaningful when something registered
-   * host-side; skipped entirely for pure device-linked programs (RCCL), which
-   * the supplemental HSA pass below handles. */
-  if ((NumShadowVariables != 0 || NumDynamicModules != 0) && isHipAvailable()) {
-    /* Shadow variables (static-linked kernels): drain from every device. */
-    if (NumShadowVariables > 0) {
-      int OrigDevice = -1;
-      hipGetDevice(&OrigDevice);
-
-      for (int Dev = 0; Dev < NumDevices; ++Dev) {
-        if (!shouldCollectDevice(Dev)) {
-          if (isVerboseMode())
-            PROF_NOTE("Skipping unused device %d\n", Dev);
-          continue;
-        }
+  /* Shadow variables (static-linked kernels): drain from every device. */
+  if (NumShadowVariables > 0) {
+    int OrigDevice = -1;
+    hipGetDevice(&OrigDevice);
+
+    for (int Dev = 0; Dev < NumDevices; ++Dev) {
+      if (!shouldCollectDevice(Dev)) {
+        if (isVerboseMode())
+          PROF_NOTE("Skipping unused device %d\n", Dev);
+        continue;
+      }
 #if defined(__linux__) && !defined(_WIN32)
-        /* When no kernel launch was tracked at all, shouldCollectDevice()
-         * falls back to collect-all, which can fault/hang reading a
-         * non-resident device's sections on a multi-GPU host (e.g. a program
-         * that never launches, collects before its first launch, or launches
-         * only via an untracked API). On Linux the supplemental HSA drain
-         * below covers those cases safely -- it walks only code objects
-         * actually resident on each agent -- so skip the host-shadow pass
-         * entirely rather than take the unsafe fallback. */
-        if (!__atomic_load_n(&AnyDeviceUsed, __ATOMIC_ACQUIRE)) {
-          if (isVerboseMode())
-            PROF_NOTE("No tracked launch; deferring device %d to HSA drain\n",
-                      Dev);
-          continue;
-        }
+      /* When no kernel launch was tracked at all, shouldCollectDevice() falls
+       * back to collect-all, which can fault/hang reading a non-resident
+       * device's sections on a multi-GPU host (e.g. a program that never
+       * launches, collects before its first launch, or launches only via an
+       * untracked API). On Linux the supplemental HSA drain covers those cases
+       * safely -- it walks only code objects actually resident on each agent --
+       * so skip the host-shadow pass rather than take the unsafe fallback. */
+      if (!__atomic_load_n(&AnyDeviceUsed, __ATOMIC_ACQUIRE)) {
+        if (isVerboseMode())
+          PROF_NOTE("No tracked launch; deferring device %d to HSA drain\n",
+                    Dev);
+        continue;
+      }
 #endif
-        if (hipSetDevice(Dev) != 0) {
-          if (isVerboseMode())
-            PROF_NOTE("Failed to set device %d, skipping\n", Dev);
-          continue;
-        }
-        const char *ArchName = getDeviceArchName(Dev);
+      if (hipSetDevice(Dev) != 0) {
         if (isVerboseMode())
-          PROF_NOTE("Collecting static profile data from device %d (%s)\n", Dev,
-                    ArchName);
-        for (int i = 0; i < NumShadowVariables; ++i) {
-          /* Stable name per shadow so a repeated drain (explicit collect plus
-           * the atexit drain) overwrites its own profraw rather than emitting a
-           * second one: bare arch for a single TU, arch.<i> for RDC multi-TU.
-           */
-          const char *Target = ArchName;
-          char TargetWithIdx[64];
-          if (NumShadowVariables > 1) {
-            snprintf(TargetWithIdx, sizeof(TargetWithIdx), "%s.%d", ArchName,
-                     i);
-            Target = TargetWithIdx;
-          }
-          if (processShadowVariable(i, Target) != 0)
-            Ret = -1;
+          PROF_NOTE("Failed to set device %d, skipping\n", Dev);
+        continue;
+      }
+      const char *ArchName = getDeviceArchName(Dev);
+      if (isVerboseMode())
+        PROF_NOTE("Collecting static profile data from device %d (%s)\n", Dev,
+                  ArchName);
+      for (int i = 0; i < NumShadowVariables; ++i) {
+        /* Stable name per shadow so a repeated drain (explicit collect plus the
+         * atexit drain) overwrites its own profraw rather than emitting a
+         * second one: bare arch for a single TU, arch.<i> for RDC multi-TU. */
+        const char *Target = ArchName;
+        char TargetWithIdx[64];
+        if (NumShadowVariables > 1) {
+          snprintf(TargetWithIdx, sizeof(TargetWithIdx), "%s.%d", ArchName, i);
+          Target = TargetWithIdx;
         }
+        if (processShadowVariable(i, Target) != 0)
+          Ret = -1;
       }
-
-      if (OrigDevice >= 0)
-        hipSetDevice(OrigDevice);
     }
 
-    /* Warn about unprocessed TUs; skip cleared slots (already drained). */
-    lockDynamicModules();
-    for (int i = 0; i < NumDynamicModules; ++i) {
-      OffloadDynamicModuleInfo *MI = &DynamicModules[i];
-      if (!MI->ModulePtr)
-        continue;
-      for (int t = 0; t < MI->NumTUs; ++t) {
-        if (!MI->TUs[t].Processed) {
-          PROF_WARN("dynamic module %p TU %d was not processed before exit\n",
-                    MI->ModulePtr, t);
-          Ret = -1;
-        }
+    if (OrigDevice >= 0)
+      hipSetDevice(OrigDevice);
+  }
+
+  /* Warn about unprocessed TUs; skip cleared slots (already drained). */
+  lockDynamicModules();
+  for (int i = 0; i < NumDynamicModules; ++i) {
+    OffloadDynamicModuleInfo *MI = &DynamicModules[i];
+    if (!MI->ModulePtr)
+      continue;
+    for (int t = 0; t < MI->NumTUs; ++t) {
+      if (!MI->TUs[t].Processed) {
+        PROF_WARN("dynamic module %p TU %d was not processed before exit\n",
+                  MI->ModulePtr, t);
+        Ret = -1;
       }
     }
-    unlockDynamicModules();
   }
+  unlockDynamicModules();
+
+  return Ret;
+}
+
+extern "C" int __llvm_profile_hip_collect_device_data(void) {
+  int Ret = 0;
+
+  if ((NumShadowVariables != 0 || NumDynamicModules != 0) && isHipAvailable() &&
+      collectHostShadowData() != 0)
+    Ret = -1;
 
 #if defined(__linux__) && !defined(_WIN32)
   /* Supplemental HSA-introspection drain: catches device code objects with no

>From fe13b8bcbf52ab54d3d44aba35ba4939d38a565c Mon Sep 17 00:00:00 2001
From: Larry Meadows <Lawrence.Meadows at amd.com>
Date: Wed, 17 Jun 2026 10:43:32 -0500
Subject: [PATCH 7/7] [PGO][HIP] Isolate mirrored HSA decls and verify them
 against real headers

HSA is dlopened, not linked, so the drain hand-mirrors the few HSA
declarations it needs. Move that block out of InstrProfilingPlatformROCmHSA.cpp
into a dedicated InstrProfilingPlatformROCmHSADefs.h to confine the maintenance
surface, mirroring offload's dynamic_hsa/ approach.

Add an opt-in build-time drift guard: when the real ROCm headers are available
(find_package(hsa-runtime64) in CMakeLists.txt defines PROFILE_VERIFY_HSA_ABI
and adds the include dir), static_asserts cross-check every mirrored enum value
and the loader-segment-descriptor / pfn-table layout against hsa/hsa.h and
hsa/hsa_ven_amd_loader.h. It is never a build requirement -- on hosts without
ROCm the checks are skipped and the mirror stands alone. Linux only.

Validated on 2x MI210 (gfx90a): the HSA TU built with -DPROFILE_VERIFY_HSA_ABI
against ROCm 7.1 headers (asserts pass), run_gpu_tests.py GPU+AMDGPU = 15 passed.

Co-authored-by: Cursor <cursoragent at cursor.com>
---
 compiler-rt/lib/profile/CMakeLists.txt        |  22 +++
 .../profile/InstrProfilingPlatformROCmHSA.cpp | 147 ++++++++----------
 .../InstrProfilingPlatformROCmHSADefs.h       | 105 +++++++++++++
 3 files changed, 196 insertions(+), 78 deletions(-)
 create mode 100644 compiler-rt/lib/profile/InstrProfilingPlatformROCmHSADefs.h

diff --git a/compiler-rt/lib/profile/CMakeLists.txt b/compiler-rt/lib/profile/CMakeLists.txt
index cdb17ce7d189c..59cf523eef63a 100644
--- a/compiler-rt/lib/profile/CMakeLists.txt
+++ b/compiler-rt/lib/profile/CMakeLists.txt
@@ -245,6 +245,28 @@ if(COMPILER_RT_BUILD_PROFILE_ROCM AND NOT COMPILER_RT_PROFILE_BAREMETAL
   append_list_if(COMPILER_RT_HAS_FNO_EXCEPTIONS_FLAG -fno-exceptions
                  PROFILE_ROCM_FLAGS)
 
+  # Optional build-time verification of the mirrored HSA ABI in
+  # InstrProfilingPlatformROCmHSA.cpp. HSA is dlopened (never linked), so the
+  # declarations are hand-mirrored; when the real ROCm headers happen to be
+  # available, compile the static_assert cross-checks against them. This is
+  # never a build requirement -- if the package is absent, the checks are simply
+  # skipped. Linux only, matching the supplemental HSA drain.
+  if("${CMAKE_SYSTEM_NAME}" STREQUAL "Linux")
+    find_package(hsa-runtime64 QUIET HINTS ${CMAKE_INSTALL_PREFIX} PATHS /opt/rocm)
+    if(hsa-runtime64_FOUND)
+      get_target_property(_profile_hsa_inc hsa-runtime64::hsa-runtime64
+                          INTERFACE_INCLUDE_DIRECTORIES)
+      if(_profile_hsa_inc)
+        message(STATUS "clang_rt.profile_rocm: verifying HSA ABI against "
+                       "${_profile_hsa_inc}")
+        list(APPEND PROFILE_ROCM_FLAGS -DPROFILE_VERIFY_HSA_ABI=1)
+        foreach(_inc ${_profile_hsa_inc})
+          list(APPEND PROFILE_ROCM_FLAGS "-isystem${_inc}")
+        endforeach()
+      endif()
+    endif()
+  endif()
+
   # The interceptor path needs sanitizer_common symbols; merge the same object
   # libs as clang_rt.cfi so the archive stays self-contained.
   set(PROFILE_ROCM_OBJECT_LIBS
diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformROCmHSA.cpp b/compiler-rt/lib/profile/InstrProfilingPlatformROCmHSA.cpp
index 433f6661c0448..8a269f1ca1470 100644
--- a/compiler-rt/lib/profile/InstrProfilingPlatformROCmHSA.cpp
+++ b/compiler-rt/lib/profile/InstrProfilingPlatformROCmHSA.cpp
@@ -38,84 +38,75 @@ extern "C" {
 
 using namespace __prof_rocm;
 
-/* Minimal HSA type/enum stubs: compiler-rt cannot depend on ROCm headers, so
- * mirror the few declarations the drain needs. Values match hsa/hsa.h and
- * hsa/hsa_ven_amd_loader.h. */
-typedef uint32_t prof_hsa_status_t;
-#define PROF_HSA_STATUS_SUCCESS ((prof_hsa_status_t)0x0)
-#define PROF_HSA_STATUS_INFO_BREAK ((prof_hsa_status_t)0x1)
-
-typedef struct {
-  uint64_t handle;
-} prof_hsa_agent_t;
-typedef struct {
-  uint64_t handle;
-} prof_hsa_executable_t;
-typedef struct {
-  uint64_t handle;
-} prof_hsa_executable_symbol_t;
-
-typedef uint32_t prof_hsa_agent_info_t;
-#define PROF_HSA_AGENT_INFO_NAME ((prof_hsa_agent_info_t)0)
-#define PROF_HSA_AGENT_INFO_DEVICE ((prof_hsa_agent_info_t)17)
-
-typedef uint32_t prof_hsa_device_type_t;
-#define PROF_HSA_DEVICE_TYPE_GPU ((prof_hsa_device_type_t)1)
-
-typedef uint32_t prof_hsa_symbol_kind_t;
-#define PROF_HSA_SYMBOL_KIND_VARIABLE ((prof_hsa_symbol_kind_t)0)
-
-typedef uint32_t prof_hsa_executable_symbol_info_t;
-#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_TYPE                                   \
-  ((prof_hsa_executable_symbol_info_t)0)
-#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME_LENGTH                            \
-  ((prof_hsa_executable_symbol_info_t)1)
-#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME                                   \
-  ((prof_hsa_executable_symbol_info_t)2)
-#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS                       \
-  ((prof_hsa_executable_symbol_info_t)21)
-
-#define PROF_HSA_EXTENSION_AMD_LOADER ((uint16_t)0x201)
-
-typedef uint32_t prof_hsa_loader_storage_type_t;
-
-typedef struct {
-  prof_hsa_agent_t agent;
-  prof_hsa_executable_t executable;
-  prof_hsa_loader_storage_type_t code_object_storage_type;
-  const void *code_object_storage_base;
-  size_t code_object_storage_size;
-  size_t code_object_storage_offset;
-  const void *segment_base;
-  size_t segment_size;
-} prof_hsa_loader_segment_descriptor_t;
-
-typedef prof_hsa_status_t (*hsa_init_ty)(void);
-typedef prof_hsa_status_t (*hsa_iterate_agents_ty)(
-    prof_hsa_status_t (*)(prof_hsa_agent_t, void *), void *);
-typedef prof_hsa_status_t (*hsa_agent_get_info_ty)(prof_hsa_agent_t,
-                                                   prof_hsa_agent_info_t,
-                                                   void *);
-typedef prof_hsa_status_t (*hsa_executable_iterate_agent_symbols_ty)(
-    prof_hsa_executable_t, prof_hsa_agent_t,
-    prof_hsa_status_t (*)(prof_hsa_executable_t, prof_hsa_agent_t,
-                          prof_hsa_executable_symbol_t, void *),
-    void *);
-typedef prof_hsa_status_t (*hsa_executable_symbol_get_info_ty)(
-    prof_hsa_executable_symbol_t, prof_hsa_executable_symbol_info_t, void *);
-typedef prof_hsa_status_t (*hsa_system_get_major_extension_table_ty)(uint16_t,
-                                                                     uint16_t,
-                                                                     size_t,
-                                                                     void *);
-typedef prof_hsa_status_t (*hsa_loader_query_segment_descriptors_ty)(
-    prof_hsa_loader_segment_descriptor_t *, size_t *);
-
-/* First two members of hsa_ven_amd_loader_1_00_pfn_t; query_host_address only
- * pads the offset to query_segment_descriptors. */
-typedef struct {
-  void *query_host_address;
-  hsa_loader_query_segment_descriptors_ty query_segment_descriptors;
-} prof_hsa_loader_pfn_t;
+// Mirrored HSA declarations the drain needs (dlopen'd, not linked). See the
+// header for the rationale; the values are HSA's stable C ABI.
+#include "InstrProfilingPlatformROCmHSADefs.h"
+
+#ifdef PROFILE_VERIFY_HSA_ABI
+// When the real ROCm headers are available at build time (developer installs
+// and the downstream GPU CI), assert the mirror above still matches them. This
+// is never required to build -- on hosts without ROCm the macro is undefined
+// and the mirror stands alone (see find_package(hsa-runtime64) in
+// CMakeLists.txt).
+#include <hsa/hsa.h>
+#include <hsa/hsa_ven_amd_loader.h>
+
+static_assert(PROF_HSA_STATUS_SUCCESS == HSA_STATUS_SUCCESS, "HSA ABI drift");
+static_assert(PROF_HSA_STATUS_INFO_BREAK == HSA_STATUS_INFO_BREAK,
+              "HSA ABI drift");
+static_assert(PROF_HSA_AGENT_INFO_NAME == HSA_AGENT_INFO_NAME, "HSA ABI drift");
+static_assert(PROF_HSA_AGENT_INFO_DEVICE == HSA_AGENT_INFO_DEVICE,
+              "HSA ABI drift");
+static_assert(PROF_HSA_DEVICE_TYPE_GPU == HSA_DEVICE_TYPE_GPU, "HSA ABI drift");
+static_assert(PROF_HSA_SYMBOL_KIND_VARIABLE == HSA_SYMBOL_KIND_VARIABLE,
+              "HSA ABI drift");
+static_assert(PROF_HSA_EXECUTABLE_SYMBOL_INFO_TYPE ==
+                  HSA_EXECUTABLE_SYMBOL_INFO_TYPE,
+              "HSA ABI drift");
+static_assert(PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME_LENGTH ==
+                  HSA_EXECUTABLE_SYMBOL_INFO_NAME_LENGTH,
+              "HSA ABI drift");
+static_assert(PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME ==
+                  HSA_EXECUTABLE_SYMBOL_INFO_NAME,
+              "HSA ABI drift");
+static_assert(PROF_HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS ==
+                  HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS,
+              "HSA ABI drift");
+static_assert(PROF_HSA_EXTENSION_AMD_LOADER == HSA_EXTENSION_AMD_LOADER,
+              "HSA ABI drift");
+
+static_assert(sizeof(prof_hsa_agent_t) == sizeof(hsa_agent_t), "HSA ABI drift");
+static_assert(sizeof(prof_hsa_executable_t) == sizeof(hsa_executable_t),
+              "HSA ABI drift");
+static_assert(sizeof(prof_hsa_executable_symbol_t) ==
+                  sizeof(hsa_executable_symbol_t),
+              "HSA ABI drift");
+
+static_assert(sizeof(prof_hsa_loader_segment_descriptor_t) ==
+                  sizeof(hsa_ven_amd_loader_segment_descriptor_t),
+              "HSA ABI drift");
+static_assert(offsetof(prof_hsa_loader_segment_descriptor_t, agent) ==
+                  offsetof(hsa_ven_amd_loader_segment_descriptor_t, agent),
+              "HSA ABI drift");
+static_assert(offsetof(prof_hsa_loader_segment_descriptor_t, executable) ==
+                  offsetof(hsa_ven_amd_loader_segment_descriptor_t, executable),
+              "HSA ABI drift");
+static_assert(offsetof(prof_hsa_loader_segment_descriptor_t, segment_base) ==
+                  offsetof(hsa_ven_amd_loader_segment_descriptor_t,
+                           segment_base),
+              "HSA ABI drift");
+static_assert(offsetof(prof_hsa_loader_segment_descriptor_t, segment_size) ==
+                  offsetof(hsa_ven_amd_loader_segment_descriptor_t,
+                           segment_size),
+              "HSA ABI drift");
+
+// We fetch the loader pfn table by raw layout, so query_segment_descriptors
+// must sit at the same offset as in the real table.
+static_assert(offsetof(prof_hsa_loader_pfn_t, query_segment_descriptors) ==
+                  offsetof(hsa_ven_amd_loader_1_00_pfn_t,
+                           hsa_ven_amd_loader_query_segment_descriptors),
+              "HSA ABI drift");
+#endif // PROFILE_VERIFY_HSA_ABI
 
 static hsa_iterate_agents_ty pHsaIterateAgents = nullptr;
 static hsa_agent_get_info_ty pHsaAgentGetInfo = nullptr;
diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformROCmHSADefs.h b/compiler-rt/lib/profile/InstrProfilingPlatformROCmHSADefs.h
new file mode 100644
index 0000000000000..17bb96277e445
--- /dev/null
+++ b/compiler-rt/lib/profile/InstrProfilingPlatformROCmHSADefs.h
@@ -0,0 +1,105 @@
+//===- InstrProfilingPlatformROCmHSADefs.h - mirrored HSA decls ----------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// Minimal HSA type/enum/function-pointer declarations used by the Linux-only
+// supplemental HSA drain (InstrProfilingPlatformROCmHSA.cpp). compiler-rt
+// cannot depend on the ROCm headers at build time, and the runtime dlopens
+// libhsa-runtime64.so rather than linking it, so the handful of declarations
+// the drain needs are mirrored here under a prof_hsa_* prefix.
+//
+// Values mirror hsa/hsa.h and hsa/hsa_ven_amd_loader.h. These are part of HSA's
+// stable, versioned C ABI (libhsa-runtime64.so.1), so they do not shift. When
+// the real ROCm headers are available at build time the mirror is cross-checked
+// against them with static_asserts (see PROFILE_VERIFY_HSA_ABI in
+// InstrProfilingPlatformROCmHSA.cpp); update both together if HSA ever changes.
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef PROFILE_INSTRPROFILINGPLATFORMROCMHSADEFS_H
+#define PROFILE_INSTRPROFILINGPLATFORMROCMHSADEFS_H
+
+#include <stddef.h>
+#include <stdint.h>
+
+typedef uint32_t prof_hsa_status_t;
+#define PROF_HSA_STATUS_SUCCESS ((prof_hsa_status_t)0x0)
+#define PROF_HSA_STATUS_INFO_BREAK ((prof_hsa_status_t)0x1)
+
+typedef struct {
+  uint64_t handle;
+} prof_hsa_agent_t;
+typedef struct {
+  uint64_t handle;
+} prof_hsa_executable_t;
+typedef struct {
+  uint64_t handle;
+} prof_hsa_executable_symbol_t;
+
+typedef uint32_t prof_hsa_agent_info_t;
+#define PROF_HSA_AGENT_INFO_NAME ((prof_hsa_agent_info_t)0)
+#define PROF_HSA_AGENT_INFO_DEVICE ((prof_hsa_agent_info_t)17)
+
+typedef uint32_t prof_hsa_device_type_t;
+#define PROF_HSA_DEVICE_TYPE_GPU ((prof_hsa_device_type_t)1)
+
+typedef uint32_t prof_hsa_symbol_kind_t;
+#define PROF_HSA_SYMBOL_KIND_VARIABLE ((prof_hsa_symbol_kind_t)0)
+
+typedef uint32_t prof_hsa_executable_symbol_info_t;
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_TYPE                                   \
+  ((prof_hsa_executable_symbol_info_t)0)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME_LENGTH                            \
+  ((prof_hsa_executable_symbol_info_t)1)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME                                   \
+  ((prof_hsa_executable_symbol_info_t)2)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS                       \
+  ((prof_hsa_executable_symbol_info_t)21)
+
+#define PROF_HSA_EXTENSION_AMD_LOADER ((uint16_t)0x201)
+
+typedef uint32_t prof_hsa_loader_storage_type_t;
+
+typedef struct {
+  prof_hsa_agent_t agent;
+  prof_hsa_executable_t executable;
+  prof_hsa_loader_storage_type_t code_object_storage_type;
+  const void *code_object_storage_base;
+  size_t code_object_storage_size;
+  size_t code_object_storage_offset;
+  const void *segment_base;
+  size_t segment_size;
+} prof_hsa_loader_segment_descriptor_t;
+
+typedef prof_hsa_status_t (*hsa_init_ty)(void);
+typedef prof_hsa_status_t (*hsa_iterate_agents_ty)(
+    prof_hsa_status_t (*)(prof_hsa_agent_t, void *), void *);
+typedef prof_hsa_status_t (*hsa_agent_get_info_ty)(prof_hsa_agent_t,
+                                                   prof_hsa_agent_info_t,
+                                                   void *);
+typedef prof_hsa_status_t (*hsa_executable_iterate_agent_symbols_ty)(
+    prof_hsa_executable_t, prof_hsa_agent_t,
+    prof_hsa_status_t (*)(prof_hsa_executable_t, prof_hsa_agent_t,
+                          prof_hsa_executable_symbol_t, void *),
+    void *);
+typedef prof_hsa_status_t (*hsa_executable_symbol_get_info_ty)(
+    prof_hsa_executable_symbol_t, prof_hsa_executable_symbol_info_t, void *);
+typedef prof_hsa_status_t (*hsa_system_get_major_extension_table_ty)(uint16_t,
+                                                                     uint16_t,
+                                                                     size_t,
+                                                                     void *);
+typedef prof_hsa_status_t (*hsa_loader_query_segment_descriptors_ty)(
+    prof_hsa_loader_segment_descriptor_t *, size_t *);
+
+/* First two members of hsa_ven_amd_loader_1_00_pfn_t; query_host_address only
+ * pads the offset to query_segment_descriptors. */
+typedef struct {
+  void *query_host_address;
+  hsa_loader_query_segment_descriptors_ty query_segment_descriptors;
+} prof_hsa_loader_pfn_t;
+
+#endif // PROFILE_INSTRPROFILINGPLATFORMROCMHSADEFS_H