[clang] [compiler-rt] [PGO][HIP] HSA-introspection device profile drain + GPU PGO tests (PR #203056)

Sat Jun 13 14:53:47 PDT 2026

https://github.com/lfmeadow updated https://github.com/llvm/llvm-project/pull/203056

>From 575f97951812585229c35ed82ff9bb3ce5c5519e Mon Sep 17 00:00:00 2001
From: Larry Meadows <Lawrence.Meadows at amd.com>
Date: Sat, 13 Jun 2026 11:48:37 -0500
Subject: [PATCH] [PGO][HIP] Add supplemental HSA-introspection device drain
 and GPU PGO tests

The host-shadow device-profile drain (InstrProfilingPlatformROCm.cpp) can only
collect device counters for kernels that registered a host-side shadow via
__hipRegisterVar. Device-linked programs (e.g. RCCL) link the instrumented code
object directly into the device image with no host shadow, so their counters are
never drained.

Add a supplemental, Linux-only drain that introspects the loaded code objects via
the HSA runtime: it walks each GPU agent, enumerates only the code objects
actually resident there, reads each one's __llvm_profile_sections, and routes
them through the existing processDeviceOffloadPrf() path so the emitted profraw
layout is identical. A small content-dedup set keyed on the
(data, counters, names) device-pointer triple ensures a section already drained
by the host-shadow pass is not drained again, so the two passes compose without
double-counting. HSA is brought up lazily from the drain (never from a library
constructor) to avoid poisoning fork-based callers.

Because the HSA walk only ever touches resident code objects, it also makes the
host-shadow pass's collect-all fallback unnecessary on Linux: when no kernel
launch was tracked (a program that never launches, collects before its first
launch, or launches only via an untracked API), the host-shadow pass is skipped
and the HSA drain covers it safely instead of faulting/hanging on a non-resident
device on a multi-GPU host.

Also add a GPU-executed test suite (compiler-rt/test/profile/{GPU,AMDGPU}/*.hip)
and a dependency-free "lit-lite" runner (run_gpu_tests.py) so the device drain
can be exercised on a real AMD GPU runner: basic/coverage/pgo-use, multi-kernel,
device-branching, multi-GPU and non-default-device drain + dedup, early-collect /
no-kernel edges, RDC vs non-RDC __llvm_profile_sections, fork-safety (the RCCL
parent-no-HIP / kernel-in-forked-child pattern), quantitative device-counter
correctness, multi-process offline accumulation, and explicit-collect
idempotency. A standalone device-pgo/ build helper reproduces the toolchain
locally.

Link the device profile runtime on both the new-offload-driver
(LinkerWrapper::ConstructJob) and traditional (HIPAMD constructLldCommand) link
paths so instrumented device images resolve the runtime symbols.

Co-authored-by: Cursor <cursoragent at cursor.com>
---
 clang/lib/Driver/ToolChains/Clang.cpp         |  15 +
 clang/lib/Driver/ToolChains/HIPAMD.cpp        |  20 +
 .../profile/InstrProfilingPlatformROCm.cpp    | 682 ++++++++++++++++--
 .../test/profile/AMDGPU/device-basic.hip      |  67 ++
 .../profile/AMDGPU/device-early-collect.hip   |  68 ++
 .../test/profile/AMDGPU/device-no-kernel.hip  |  44 ++
 .../test/profile/AMDGPU/device-symbols.hip    |  42 ++
 .../test/profile/AMDGPU/lit.local.cfg.py      |   4 +
 .../test/profile/GPU/instrprof-hip-basic.hip  |  51 ++
 .../GPU/instrprof-hip-collect-after.hip       |  63 ++
 .../GPU/instrprof-hip-counter-correctness.hip |  56 ++
 .../profile/GPU/instrprof-hip-coverage.hip    |  51 ++
 .../GPU/instrprof-hip-device-branching.hip    |  67 ++
 .../profile/GPU/instrprof-hip-fork-safety.hip |  61 ++
 .../profile/GPU/instrprof-hip-multi-gpu.hip   |  57 ++
 .../GPU/instrprof-hip-multi-process-merge.hip |  63 ++
 .../GPU/instrprof-hip-multiple-kernels.hip    |  58 ++
 .../GPU/instrprof-hip-nondefault-device.hip   |  60 ++
 .../profile/GPU/instrprof-hip-pgo-use.hip     |  63 ++
 compiler-rt/test/profile/device-pgo/README.md | 125 ++++
 compiler-rt/test/profile/device-pgo/build.sh  |  56 ++
 .../profile/device-pgo/toolchain-cache.cmake  |  55 ++
 compiler-rt/test/profile/run_gpu_tests.py     | 380 ++++++++++
 23 files changed, 2156 insertions(+), 52 deletions(-)
 create mode 100644 compiler-rt/test/profile/AMDGPU/device-basic.hip
 create mode 100644 compiler-rt/test/profile/AMDGPU/device-early-collect.hip
 create mode 100644 compiler-rt/test/profile/AMDGPU/device-no-kernel.hip
 create mode 100644 compiler-rt/test/profile/AMDGPU/device-symbols.hip
 create mode 100644 compiler-rt/test/profile/AMDGPU/lit.local.cfg.py
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-basic.hip
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-collect-after.hip
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-counter-correctness.hip
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-coverage.hip
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-device-branching.hip
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-fork-safety.hip
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-multi-gpu.hip
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-multi-process-merge.hip
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-multiple-kernels.hip
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-nondefault-device.hip
 create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-pgo-use.hip
 create mode 100644 compiler-rt/test/profile/device-pgo/README.md
 create mode 100755 compiler-rt/test/profile/device-pgo/build.sh
 create mode 100644 compiler-rt/test/profile/device-pgo/toolchain-cache.cmake
 create mode 100644 compiler-rt/test/profile/run_gpu_tests.py

diff --git a/clang/lib/Driver/ToolChains/Clang.cpp b/clang/lib/Driver/ToolChains/Clang.cpp
index c2ac478d84929..3b8bc46820af6 100644
--- a/clang/lib/Driver/ToolChains/Clang.cpp
+++ b/clang/lib/Driver/ToolChains/Clang.cpp
@@ -9658,6 +9658,21 @@ void LinkerWrapper::ConstructJob(Compilation &C, const JobAction &JA,
           (TC->getTriple().isAMDGPU() || TC->getTriple().isNVPTX()))
         LinkerArgs.emplace_back("-lompdevice");
 
+      // With PGO/coverage instrumentation, GPU device code references the
+      // device profile runtime (__llvm_profile_instrument_gpu and the
+      // __llvm_profile_sections bounds table emitted by
+      // InstrProfilingPlatformGPU). The offload device link does not otherwise
+      // pull it in, so forward the static device profile runtime to the GPU
+      // device linker. The archive is arch-suffixed, so pass its full path
+      // rather than a -l name.
+      if (ToolChain::needsProfileRT(Args) &&
+          (TC->getTriple().isAMDGPU() || TC->getTriple().isNVPTX())) {
+        std::string ProfileRT =
+            TC->getCompilerRT(Args, "profile", ToolChain::FT_Static);
+        if (TC->getVFS().exists(ProfileRT))
+          LinkerArgs.emplace_back(Args.MakeArgString(ProfileRT));
+      }
+
       // For SPIR-V, pass some extra flags to `spirv-link`, the out-of-tree
       // SPIR-V linker. `spirv-link` isn't called in LTO mode so restrict these
       // flags to normal compilation.
diff --git a/clang/lib/Driver/ToolChains/HIPAMD.cpp b/clang/lib/Driver/ToolChains/HIPAMD.cpp
index 01cb23d0aa230..1bd4e073b4e27 100644
--- a/clang/lib/Driver/ToolChains/HIPAMD.cpp
+++ b/clang/lib/Driver/ToolChains/HIPAMD.cpp
@@ -19,6 +19,7 @@
 #include "clang/Options/Options.h"
 #include "llvm/Support/FileSystem.h"
 #include "llvm/Support/Path.h"
+#include "llvm/Support/VirtualFileSystem.h"
 #include "llvm/TargetParser/TargetParser.h"
 
 using namespace clang::driver;
@@ -142,6 +143,25 @@ void AMDGCN::Linker::constructLldCommand(Compilation &C, const JobAction &JA,
 
   LldArgs.push_back("--no-whole-archive");
 
+  // With PGO/coverage instrumentation, instrumented device code references the
+  // device profile runtime (__llvm_profile_instrument_gpu and the
+  // __llvm_profile_sections bounds table emitted by InstrProfilingPlatformGPU).
+  // The new-offload-driver path injects this in LinkerWrapper::ConstructJob,
+  // but HIP using the traditional offload path (e.g. on Windows, which does not
+  // route device linking through clang-linker-wrapper) reaches the device link
+  // here instead. Forward the static device profile runtime to this lld device
+  // link so the runtime is pulled in regardless of offload-driver/host OS. The
+  // archive is arch-suffixed, so pass its full path rather than a -l name.
+  if (ToolChain::needsProfileRT(Args)) {
+    std::string ProfileRT =
+        TC.getCompilerRT(Args, "profile", ToolChain::FT_Static);
+    // Use the ToolChain VFS (matches the new-offload-driver path in
+    // Clang.cpp) so overlay/virtual filesystems used by the driver are
+    // honored; llvm::sys::fs bypasses them and can wrongly skip the runtime.
+    if (TC.getVFS().exists(ProfileRT))
+      LldArgs.push_back(Args.MakeArgString(ProfileRT));
+  }
+
   const char *Lld = Args.MakeArgStringRef(getToolChain().GetProgramPath("lld"));
   C.addCommand(std::make_unique<Command>(JA, *this, ResponseFileSupport::None(),
                                          Lld, LldArgs, Inputs, Output));
diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
index d0d9b1ea8f61d..b1db1d8a74041 100644
--- a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
+++ b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
@@ -66,6 +66,15 @@ struct OffloadSectionShadowGroup;
 static int processDeviceOffloadPrf(void *DeviceOffloadPrf, const char *Target,
                                    const OffloadSectionShadowGroup *Sections);
 
+#if defined(__linux__) && !defined(_WIN32)
+// Record a drained section-bounds tuple so the supplemental HSA-introspection
+// pass (Linux only) skips any code object the host-shadow path already
+// drained. Defined alongside the HSA drain below; forward-declared here so
+// processDeviceOffloadPrf can register every successful host-shadow drain.
+static void profRecordDrainedBounds(const void *Data, const void *Counters,
+                                    const void *Names);
+#endif
+
 static int isVerboseMode() {
   static int IsVerbose = -1;
   if (IsVerbose == -1)
@@ -1119,8 +1128,14 @@ static int processDeviceOffloadPrf(void *DeviceOffloadPrf, const char *Target,
 
   if (ret != 0) {
     PROF_ERR("%s\n", "failed to write device profile using shared API");
-  } else if (isVerboseMode()) {
-    PROF_NOTE("%s\n", "Successfully wrote device profile using shared API");
+  } else {
+#if defined(__linux__) && !defined(_WIN32)
+    // Dedup against the supplemental HSA pass: this section is now drained, so
+    // the HSA walk must not drain the same device code object again.
+    profRecordDrainedBounds(DevDataBegin, DevCntsBegin, DevNamesBegin);
+#endif
+    if (isVerboseMode())
+      PROF_NOTE("%s\n", "Successfully wrote device profile using shared API");
   }
 
   return ret;
@@ -1148,72 +1163,635 @@ static int isHipAvailable(void) {
   return pHipMemcpy != nullptr && pHipGetSymbolAddress != nullptr;
 }
 
-/* -------------------------------------------------------------------------- */
-/*  Collect device-side profile data                                          */
-/* -------------------------------------------------------------------------- */
+/* ========================================================================== */
+/*  Supplemental HSA-introspection drain (Linux only)                         */
+/*                                                                            */
+/*  The host-shadow drain above only sees device code objects registered      */
+/*  host-side (__hipRegisterVar shadows) or loaded through an intercepted */
+/*  hipModuleLoad* call. Device code linked by the offload device linker with */
+/*  no host-side shadow -- e.g. RCCL, whose many device functions are glued */
+/*  into a single kernel with no source module -- is invisible to it. This */
+/*  pass walks every GPU agent's loaded executables via HSA, finds each */
+/*  __llvm_profile_sections table directly on the device, and drains the ones */
+/*  the host-shadow pass did not already handle (deduped by the device */
+/*  section-bounds tuple). It reuses processDeviceOffloadPrf() for the */
+/*  copy/relocate/write so the on-disk profraw layout is identical.           */
+/* ========================================================================== */
+#if defined(__linux__) && !defined(_WIN32)
 
-extern "C" int __llvm_profile_hip_collect_device_data(void) {
-  if (NumShadowVariables == 0 && NumDynamicModules == 0)
+/* Minimal HSA type/enum stubs. compiler-rt cannot depend on ROCm headers at
+ * build time, so mirror just the handful of HSA declarations the drain needs.
+ * Values match hsa/hsa.h and hsa/hsa_ven_amd_loader.h. */
+typedef uint32_t prof_hsa_status_t;
+#define PROF_HSA_STATUS_SUCCESS ((prof_hsa_status_t)0x0)
+#define PROF_HSA_STATUS_INFO_BREAK ((prof_hsa_status_t)0x1)
+
+typedef struct {
+  uint64_t handle;
+} prof_hsa_agent_t;
+typedef struct {
+  uint64_t handle;
+} prof_hsa_executable_t;
+typedef struct {
+  uint64_t handle;
+} prof_hsa_executable_symbol_t;
+
+typedef uint32_t prof_hsa_agent_info_t;
+#define PROF_HSA_AGENT_INFO_NAME ((prof_hsa_agent_info_t)0)
+#define PROF_HSA_AGENT_INFO_DEVICE ((prof_hsa_agent_info_t)17)
+
+typedef uint32_t prof_hsa_device_type_t;
+#define PROF_HSA_DEVICE_TYPE_GPU ((prof_hsa_device_type_t)1)
+
+typedef uint32_t prof_hsa_symbol_kind_t;
+#define PROF_HSA_SYMBOL_KIND_VARIABLE ((prof_hsa_symbol_kind_t)0)
+
+typedef uint32_t prof_hsa_executable_symbol_info_t;
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_TYPE                                   \
+  ((prof_hsa_executable_symbol_info_t)0)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME_LENGTH                            \
+  ((prof_hsa_executable_symbol_info_t)1)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME                                   \
+  ((prof_hsa_executable_symbol_info_t)2)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS                       \
+  ((prof_hsa_executable_symbol_info_t)21)
+
+#define PROF_HSA_EXTENSION_AMD_LOADER ((uint16_t)0x201)
+
+typedef uint32_t prof_hsa_loader_storage_type_t;
+
+typedef struct {
+  prof_hsa_agent_t agent;
+  prof_hsa_executable_t executable;
+  prof_hsa_loader_storage_type_t code_object_storage_type;
+  const void *code_object_storage_base;
+  size_t code_object_storage_size;
+  size_t code_object_storage_offset;
+  const void *segment_base;
+  size_t segment_size;
+} prof_hsa_loader_segment_descriptor_t;
+
+typedef prof_hsa_status_t (*hsa_init_ty)(void);
+typedef prof_hsa_status_t (*hsa_iterate_agents_ty)(
+    prof_hsa_status_t (*)(prof_hsa_agent_t, void *), void *);
+typedef prof_hsa_status_t (*hsa_agent_get_info_ty)(prof_hsa_agent_t,
+                                                   prof_hsa_agent_info_t,
+                                                   void *);
+typedef prof_hsa_status_t (*hsa_executable_iterate_agent_symbols_ty)(
+    prof_hsa_executable_t, prof_hsa_agent_t,
+    prof_hsa_status_t (*)(prof_hsa_executable_t, prof_hsa_agent_t,
+                          prof_hsa_executable_symbol_t, void *),
+    void *);
+typedef prof_hsa_status_t (*hsa_executable_symbol_get_info_ty)(
+    prof_hsa_executable_symbol_t, prof_hsa_executable_symbol_info_t, void *);
+typedef prof_hsa_status_t (*hsa_system_get_major_extension_table_ty)(uint16_t,
+                                                                     uint16_t,
+                                                                     size_t,
+                                                                     void *);
+typedef prof_hsa_status_t (*hsa_loader_query_segment_descriptors_ty)(
+    prof_hsa_loader_segment_descriptor_t *, size_t *);
+
+/* First two members of hsa_ven_amd_loader_1_00_pfn_t. Only
+ * query_segment_descriptors is used; query_host_address keeps the offset. */
+typedef struct {
+  void *query_host_address;
+  hsa_loader_query_segment_descriptors_ty query_segment_descriptors;
+} prof_hsa_loader_pfn_t;
+
+static hsa_iterate_agents_ty pHsaIterateAgents = nullptr;
+static hsa_agent_get_info_ty pHsaAgentGetInfo = nullptr;
+static hsa_executable_iterate_agent_symbols_ty pHsaExecIterAgentSyms = nullptr;
+static hsa_executable_symbol_get_info_ty pHsaSymGetInfo = nullptr;
+static hsa_loader_query_segment_descriptors_ty pQuerySegDescs = nullptr;
+
+/* 0 = not yet attempted, 1 = ready, -1 = unavailable. Accessed with acquire/
+ * release atomics: a thread observing HsaRuntimeState==1 (acquire) also sees
+ * the fully-written p* function pointers (published before the release store
+ * of HsaRuntimeState=1 below). */
+static int HsaRuntimeState = 0;
+
+static int setHsaRuntimeState(int S) {
+  __atomic_store_n(&HsaRuntimeState, S, __ATOMIC_RELEASE);
+  return S > 0 ? 0 : -1;
+}
+
+/* Resolve HSA entry points (and the AMD loader extension) once, and confirm
+ * HIP's hipMemcpy is reachable for the device-to-host copies. HIP itself is
+ * resolved by the shared ensureHipLoaded() above. */
+static int loadHsaRuntimePointers(void) {
+  int State = __atomic_load_n(&HsaRuntimeState, __ATOMIC_ACQUIRE);
+  if (State)
+    return State > 0 ? 0 : -1;
+
+  if (!__interception::DynamicLoaderAvailable()) {
+    if (isVerboseMode())
+      PROF_NOTE("%s", "Dynamic library loading not available - "
+                      "HSA device profiling disabled\n");
+    return setHsaRuntimeState(-1);
+  }
+
+  void *Hsa = __interception::OpenLibrary("libhsa-runtime64.so");
+  if (!Hsa)
+    Hsa = __interception::OpenLibrary("libhsa-runtime64.so.1");
+  if (!Hsa) {
+    if (isVerboseMode())
+      PROF_NOTE("%s", "libhsa-runtime64.so not loadable - "
+                      "HSA device profiling disabled\n");
+    return setHsaRuntimeState(-1);
+  }
+
+  hsa_init_ty pHsaInit =
+      (hsa_init_ty)__interception::LookupSymbol(Hsa, "hsa_init");
+  hsa_system_get_major_extension_table_ty pGetExtTable =
+      (hsa_system_get_major_extension_table_ty)__interception::LookupSymbol(
+          Hsa, "hsa_system_get_major_extension_table");
+  pHsaIterateAgents = (hsa_iterate_agents_ty)__interception::LookupSymbol(
+      Hsa, "hsa_iterate_agents");
+  pHsaAgentGetInfo = (hsa_agent_get_info_ty)__interception::LookupSymbol(
+      Hsa, "hsa_agent_get_info");
+  pHsaExecIterAgentSyms =
+      (hsa_executable_iterate_agent_symbols_ty)__interception::LookupSymbol(
+          Hsa, "hsa_executable_iterate_agent_symbols");
+  pHsaSymGetInfo =
+      (hsa_executable_symbol_get_info_ty)__interception::LookupSymbol(
+          Hsa, "hsa_executable_symbol_get_info");
+
+  if (!pHsaInit || !pGetExtTable || !pHsaIterateAgents || !pHsaAgentGetInfo ||
+      !pHsaExecIterAgentSyms || !pHsaSymGetInfo) {
+    PROF_WARN("%s",
+              "required HSA symbols missing - HSA device profiling disabled\n");
+    return setHsaRuntimeState(-1);
+  }
+
+  /* Bring HSA up (idempotent, refcounted). This runs lazily on the first drain
+   * rather than from the library constructor, so merely loading the
+   * instrumented library does not initialize HSA in the process -- which would
+   * break fork-based callers that deliberately keep HIP/HSA uninitialized in
+   * the parent (see the constructor note at the end of the HSA block). In the
+   * common case the drain runs from the profile write path while HSA is still
+   * alive; if it only runs after HSA's own atexit(hsa_shut_down) has executed,
+   * this simply re-initializes HSA (the process is exiting anyway). */
+  prof_hsa_status_t St = pHsaInit();
+  if (St != PROF_HSA_STATUS_SUCCESS && St != PROF_HSA_STATUS_INFO_BREAK) {
+    if (isVerboseMode())
+      PROF_NOTE("hsa_init failed (0x%x) - HSA device profiling disabled\n", St);
+    return setHsaRuntimeState(-1);
+  }
+
+  prof_hsa_loader_pfn_t LoaderApi;
+  __builtin_memset(&LoaderApi, 0, sizeof(LoaderApi));
+  St = pGetExtTable(PROF_HSA_EXTENSION_AMD_LOADER, 1, sizeof(LoaderApi),
+                    &LoaderApi);
+  if (St != PROF_HSA_STATUS_SUCCESS || !LoaderApi.query_segment_descriptors) {
+    PROF_WARN("AMD loader extension unavailable (0x%x) - "
+              "HSA device profiling disabled\n",
+              St);
+    return setHsaRuntimeState(-1);
+  }
+  pQuerySegDescs = LoaderApi.query_segment_descriptors;
+
+  /* The device-to-host copies go through the shared HIP loader. */
+  ensureHipLoaded();
+  if (!pHipMemcpy) {
+    PROF_WARN("%s", "hipMemcpy unavailable - HSA device profiling disabled\n");
+    return setHsaRuntimeState(-1);
+  }
+
+  if (isVerboseMode())
+    PROF_NOTE("%s", "HSA + HIP runtime resolved for device profiling\n");
+  return setHsaRuntimeState(1);
+}
+
+/* The canonical device bounds-table symbol from InstrProfilingPlatformGPU.c. */
+static const char ProfileSectionsSymbol[] = "__llvm_profile_sections";
+
+/* Dedup of drained section-bounds tuples, shared with the host-shadow path
+ * (processDeviceOffloadPrf records here on every successful drain). A single
+ * linked device code object exposes one __llvm_profile_sections, but the same
+ * bounds may be seen via multiple agents, so each unique counter set is
+ * drained exactly once across both paths. */
+namespace {
+struct ProfBoundsTuple {
+  const void *data;
+  const void *cnts;
+  const void *names;
+};
+} // namespace
+
+#define PROF_MAX_SEEN_BOUNDS 256
+static ProfBoundsTuple SeenBounds[PROF_MAX_SEEN_BOUNDS];
+static int NumSeenBounds = 0;
+
+/* Pure check: has this bounds tuple already been drained? Does not mutate
+ * state, so a transient failure does not permanently suppress retries. */
+static int profBoundsAlreadyDrained(const void *D, const void *C,
+                                    const void *N) {
+  for (int i = 0; i < NumSeenBounds; ++i)
+    if (SeenBounds[i].data == D && SeenBounds[i].cnts == C &&
+        SeenBounds[i].names == N)
+      return 1;
+  return 0;
+}
+
+/* Record a drained bounds tuple. Idempotent. Called after a successful drain
+ * (either path) so a failed attempt stays retryable. */
+static void profRecordDrainedBounds(const void *D, const void *C,
+                                    const void *N) {
+  if (profBoundsAlreadyDrained(D, C, N))
+    return;
+  if (NumSeenBounds < PROF_MAX_SEEN_BOUNDS) {
+    SeenBounds[NumSeenBounds].data = D;
+    SeenBounds[NumSeenBounds].cnts = C;
+    SeenBounds[NumSeenBounds].names = N;
+    NumSeenBounds++;
+  }
+}
+
+#define PROF_MAX_GPU_AGENTS 64
+
+namespace {
+struct GpuAgent {
+  prof_hsa_agent_t agent;
+  char arch[64];
+};
+
+struct WalkState {
+  GpuAgent agents[PROF_MAX_GPU_AGENTS];
+  int num_agents;
+  int total_found;
+  int total_drained;
+};
+
+/* Per (agent, executable) symbol-iteration state. */
+struct SymbolState {
+  const char *arch;
+  int found;
+  int drained;
+};
+} // namespace
+
+/* HSA per-symbol callback: when it finds a __llvm_profile_sections variable,
+ * drain it via processDeviceOffloadPrf() unless the host-shadow path (or an
+ * earlier agent) already handled the same bounds. */
+static prof_hsa_status_t onSymbol(prof_hsa_executable_t, prof_hsa_agent_t,
+                                  prof_hsa_executable_symbol_t Sym,
+                                  void *Data) {
+  SymbolState *S = (SymbolState *)Data;
+
+  prof_hsa_symbol_kind_t Kind;
+  if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_TYPE, &Kind) !=
+          PROF_HSA_STATUS_SUCCESS ||
+      Kind != PROF_HSA_SYMBOL_KIND_VARIABLE)
+    return PROF_HSA_STATUS_SUCCESS;
+
+  uint32_t NameLen = 0;
+  if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME_LENGTH,
+                     &NameLen) != PROF_HSA_STATUS_SUCCESS ||
+      NameLen != sizeof(ProfileSectionsSymbol) - 1)
+    return PROF_HSA_STATUS_SUCCESS;
+
+  char NameBuf[64];
+  if (NameLen + 1 > sizeof(NameBuf))
+    return PROF_HSA_STATUS_SUCCESS;
+  if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME, NameBuf) !=
+      PROF_HSA_STATUS_SUCCESS)
+    return PROF_HSA_STATUS_SUCCESS;
+  NameBuf[NameLen] = '\0';
+
+  if (strcmp(NameBuf, ProfileSectionsSymbol) != 0)
+    return PROF_HSA_STATUS_SUCCESS;
+
+  uint64_t Addr = 0;
+  if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS,
+                     &Addr) != PROF_HSA_STATUS_SUCCESS ||
+      Addr == 0) {
+    if (isVerboseMode())
+      PROF_NOTE("%s", "failed to read __llvm_profile_sections address\n");
+    return PROF_HSA_STATUS_SUCCESS;
+  }
+
+  S->found++;
+
+  // Read the bounds table first to dedup (and detect empty sections) before
+  // the full copy/relocate done by processDeviceOffloadPrf.
+  __llvm_profile_gpu_sections Sec;
+  if (memcpyDeviceToHost(&Sec, (void *)(uintptr_t)Addr, sizeof(Sec)) != 0) {
+    PROF_WARN("%s", "failed to copy device bounds table\n");
+    return PROF_HSA_STATUS_SUCCESS;
+  }
+  if (profBoundsAlreadyDrained(Sec.DataStart, Sec.CountersStart,
+                               Sec.NamesStart)) {
+    if (isVerboseMode())
+      PROF_NOTE("%s", "device bounds already drained, skipping\n");
+    return PROF_HSA_STATUS_SUCCESS;
+  }
+
+  size_t DataBytes = (const char *)Sec.DataStop - (const char *)Sec.DataStart;
+  size_t CntsBytes =
+      (const char *)Sec.CountersStop - (const char *)Sec.CountersStart;
+  if (DataBytes == 0 || CntsBytes == 0) {
+    // Empty code object: nothing to write. Mark seen so we don't revisit it.
+    profRecordDrainedBounds(Sec.DataStart, Sec.CountersStart, Sec.NamesStart);
+    return PROF_HSA_STATUS_SUCCESS;
+  }
+
+  // Generate a collision-free target. Multiple distinct device code objects on
+  // the same arch (e.g. non-RDC multi-TU) must not clobber each other's file.
+  static int DrainIndex = 0;
+  char Target[96];
+  if (DrainIndex == 0)
+    snprintf(Target, sizeof(Target), "%s", S->arch);
+  else
+    snprintf(Target, sizeof(Target), "%s.%d", S->arch, DrainIndex);
+
+  // processDeviceOffloadPrf returns 0 on a successful write, -1 on error.
+  // Record the bounds (and advance the target index) only on success so a
+  // transient error stays retryable on a later agent or collect call.
+  int Rc = processDeviceOffloadPrf((void *)(uintptr_t)Addr, Target, nullptr);
+  if (Rc == 0) {
+    S->drained++;
+    DrainIndex++;
+    profRecordDrainedBounds(Sec.DataStart, Sec.CountersStart, Sec.NamesStart);
+  }
+
+  return PROF_HSA_STATUS_SUCCESS;
+}
+
+static prof_hsa_status_t collectAgent(prof_hsa_agent_t Agent, void *Data) {
+  prof_hsa_device_type_t DevType;
+  if (pHsaAgentGetInfo(Agent, PROF_HSA_AGENT_INFO_DEVICE, &DevType) !=
+          PROF_HSA_STATUS_SUCCESS ||
+      DevType != PROF_HSA_DEVICE_TYPE_GPU)
+    return PROF_HSA_STATUS_SUCCESS;
+
+  WalkState *W = (WalkState *)Data;
+  if (W->num_agents >= PROF_MAX_GPU_AGENTS)
+    return PROF_HSA_STATUS_SUCCESS;
+
+  GpuAgent &GA = W->agents[W->num_agents++];
+  GA.agent = Agent;
+  char Name[64];
+  __builtin_memset(Name, 0, sizeof(Name));
+  pHsaAgentGetInfo(Agent, PROF_HSA_AGENT_INFO_NAME, Name);
+  size_t N = strnlen(Name, sizeof(GA.arch) - 1);
+  __builtin_memcpy(GA.arch, Name, N);
+  GA.arch[N] = '\0';
+  if (!GA.arch[0])
+    strncpy(GA.arch, "amdgpu", sizeof(GA.arch) - 1);
+
+  if (isVerboseMode())
+    PROF_NOTE("GPU agent %d: %s\n", W->num_agents - 1, GA.arch);
+  return PROF_HSA_STATUS_SUCCESS;
+}
+
+/* Reentrancy guard and "drained data at least once" latch. The collect hook
+ * may run more than once (an explicit early __llvm_profile_write_file plus the
+ * exit write); a successful walk latches HsaDrainCompleted so we never re-emit
+ * duplicate .profraw files, while transient no-op outcomes ("runtime not yet
+ * loadable", "no GPU agents", "no loaded segments", "nothing instrumented")
+ * stay retryable so a later call can still pick up code objects loaded later.
+ * HsaDrainInProgress prevents a concurrent or reentrant call (e.g. a library
+ * destructor) from corrupting the global SeenBounds table. Both flags use
+ * acquire/release atomics. */
+static int HsaDrainInProgress = 0;
+static int HsaDrainCompleted = 0;
+
+static int drainDevicesViaHsa(void) {
+  if (__atomic_load_n(&HsaDrainCompleted, __ATOMIC_ACQUIRE))
     return 0;
 
-  if (!isHipAvailable())
+  int Expected = 0;
+  if (!__atomic_compare_exchange_n(&HsaDrainInProgress, &Expected, 1,
+                                   /*weak=*/0, __ATOMIC_ACQ_REL,
+                                   __ATOMIC_ACQUIRE))
     return 0;
 
-  int Ret = 0;
+  struct InProgressGuard {
+    ~InProgressGuard() {
+      __atomic_store_n(&HsaDrainInProgress, 0, __ATOMIC_RELEASE);
+    }
+  } _Guard;
 
-  /* Shadow variables (static-linked kernels): drain from every device. */
-  if (NumShadowVariables > 0) {
-    int OrigDevice = -1;
-    hipGetDevice(&OrigDevice);
+  if (loadHsaRuntimePointers() != 0)
+    return 0; /* Runtime unavailable: stay retryable. */
 
-    for (int Dev = 0; Dev < NumDevices; ++Dev) {
-      if (!shouldCollectDevice(Dev)) {
-        if (isVerboseMode())
-          PROF_NOTE("Skipping unused device %d\n", Dev);
-        continue;
+  WalkState W;
+  __builtin_memset(&W, 0, sizeof(W));
+  prof_hsa_status_t St = pHsaIterateAgents(collectAgent, &W);
+  if (St != PROF_HSA_STATUS_SUCCESS && St != PROF_HSA_STATUS_INFO_BREAK) {
+    PROF_WARN("hsa_iterate_agents failed (0x%x)\n", St);
+    return -1;
+  }
+  if (W.num_agents == 0) {
+    if (isVerboseMode())
+      PROF_NOTE("%s", "no GPU agents present; nothing to drain (will retry)\n");
+    return 0;
+  }
+
+  /* query_segment_descriptors ships in every loader-extension version and is
+   * more permissive than iterate_executables on ROCm. It yields the loaded
+   * (agent, executable) pairs directly. */
+  size_t NumSegs = 0;
+  St = pQuerySegDescs(nullptr, &NumSegs);
+  if (St != PROF_HSA_STATUS_SUCCESS) {
+    PROF_WARN("query_segment_descriptors(count) failed (0x%x)\n", St);
+    return -1;
+  }
+  if (NumSegs == 0) {
+    if (isVerboseMode())
+      PROF_NOTE("%s", "no loaded segments; nothing to drain (will retry)\n");
+    return 0;
+  }
+
+  prof_hsa_loader_segment_descriptor_t *Segs =
+      (prof_hsa_loader_segment_descriptor_t *)calloc(NumSegs, sizeof(*Segs));
+  if (!Segs) {
+    PROF_ERR("%s\n", "failed to allocate segment descriptor array");
+    return -1;
+  }
+  UniqueFree SegsOwner(Segs);
+
+  St = pQuerySegDescs(Segs, &NumSegs);
+  if (St != PROF_HSA_STATUS_SUCCESS) {
+    PROF_WARN("query_segment_descriptors(fetch) failed (0x%x)\n", St);
+    return -1;
+  }
+
+  if (isVerboseMode())
+    PROF_NOTE("query_segment_descriptors: %zu segments\n", NumSegs);
+
+  /* Walk unique (agent, executable) pairs. */
+  enum { kMaxPairs = 512 };
+  uint64_t SeenAgents[kMaxPairs];
+  uint64_t SeenExecs[kMaxPairs];
+  int NumPairs = 0;
+  int IterFailures = 0;
+
+  for (size_t i = 0; i < NumSegs; ++i) {
+    if (Segs[i].executable.handle == 0 || Segs[i].agent.handle == 0)
+      continue;
+
+    int Seen = 0;
+    for (int j = 0; j < NumPairs; ++j)
+      if (SeenAgents[j] == Segs[i].agent.handle &&
+          SeenExecs[j] == Segs[i].executable.handle) {
+        Seen = 1;
+        break;
       }
-      if (hipSetDevice(Dev) != 0) {
-        if (isVerboseMode())
-          PROF_NOTE("Failed to set device %d, skipping\n", Dev);
-        continue;
+    if (Seen)
+      continue;
+    if (NumPairs < kMaxPairs) {
+      SeenAgents[NumPairs] = Segs[i].agent.handle;
+      SeenExecs[NumPairs] = Segs[i].executable.handle;
+      NumPairs++;
+    }
+
+    const char *Arch = nullptr;
+    for (int k = 0; k < W.num_agents; ++k)
+      if (W.agents[k].agent.handle == Segs[i].agent.handle) {
+        Arch = W.agents[k].arch;
+        break;
       }
-      const char *ArchName = getDeviceArchName(Dev);
-      if (isVerboseMode())
-        PROF_NOTE("Collecting static profile data from device %d (%s)\n", Dev,
-                  ArchName);
-      for (int i = 0; i < NumShadowVariables; ++i) {
-        /* RDC-mode multi-shadow drains need a distinct profraw per TU;
-         * single-TU programs keep the bare arch target. */
-        const char *Target = ArchName;
-        char TargetWithIdx[64];
-        if (NumShadowVariables > 1) {
-          snprintf(TargetWithIdx, sizeof(TargetWithIdx), "%s.%d", ArchName, i);
-          Target = TargetWithIdx;
+    if (!Arch)
+      continue; /* not a GPU agent we collected */
+
+    SymbolState S;
+    __builtin_memset(&S, 0, sizeof(S));
+    S.arch = Arch;
+    if (isVerboseMode())
+      PROF_NOTE("walking executable 0x%llx on %s\n",
+                (unsigned long long)Segs[i].executable.handle, Arch);
+    prof_hsa_status_t IterSt =
+        pHsaExecIterAgentSyms(Segs[i].executable, Segs[i].agent, onSymbol, &S);
+    if (IterSt != PROF_HSA_STATUS_SUCCESS &&
+        IterSt != PROF_HSA_STATUS_INFO_BREAK) {
+      PROF_WARN("hsa_executable_iterate_agent_symbols on executable 0x%llx "
+                "failed (0x%x)\n",
+                (unsigned long long)Segs[i].executable.handle, IterSt);
+      IterFailures++;
+    }
+    W.total_found += S.found;
+    W.total_drained += S.drained;
+  }
+
+  if (isVerboseMode())
+    PROF_NOTE("HSA walk complete: agents=%d pairs=%d found=%d drained=%d "
+              "iter-failures=%d\n",
+              W.num_agents, NumPairs, W.total_found, W.total_drained,
+              IterFailures);
+
+  /* Latch only when we actually drained data. Deliberately do NOT latch the
+   * "walked everything but found nothing new" case: an early collect call can
+   * run before any kernel launch, and latching it would suppress the real
+   * exit-time drain once kernels do run. Repeating a no-op walk is cheap. */
+  if (W.total_drained > 0)
+    __atomic_store_n(&HsaDrainCompleted, 1, __ATOMIC_RELEASE);
+  return (IterFailures > 0) ? -1 : 0;
+}
+
+/* NOTE: deliberately no library constructor that calls hsa_init() here.
+ * Bringing HSA up merely because the instrumented library was loaded poisons
+ * fork-based callers: frameworks and tests (e.g. RCCL's unit tests) keep
+ * HIP/HSA uninitialized in the parent and only touch HIP inside forked
+ * children. A parent that has already hsa_init()'d makes those children crash
+ * inside HSA (HSA state is not valid across fork()). HSA is instead brought up
+ * lazily from drainDevicesViaHsa() -> loadHsaRuntimePointers(); see the init
+ * rationale there. */
+
+#endif /* defined(__linux__) && !defined(_WIN32) -- HSA drain */
+
+/* -------------------------------------------------------------------------- */
+/*  Collect device-side profile data                                          */
+/* -------------------------------------------------------------------------- */
+
+extern "C" int __llvm_profile_hip_collect_device_data(void) {
+  int Ret = 0;
+
+  /* Host-shadow drain: static-linked kernels (host __hipRegisterVar shadows)
+   * and intercepted dynamic modules. Only meaningful when something registered
+   * host-side; skipped entirely for pure device-linked programs (RCCL), which
+   * the supplemental HSA pass below handles. */
+  if ((NumShadowVariables != 0 || NumDynamicModules != 0) && isHipAvailable()) {
+    /* Shadow variables (static-linked kernels): drain from every device. */
+    if (NumShadowVariables > 0) {
+      int OrigDevice = -1;
+      hipGetDevice(&OrigDevice);
+
+      for (int Dev = 0; Dev < NumDevices; ++Dev) {
+        if (!shouldCollectDevice(Dev)) {
+          if (isVerboseMode())
+            PROF_NOTE("Skipping unused device %d\n", Dev);
+          continue;
+        }
+#if defined(__linux__) && !defined(_WIN32)
+        /* When no kernel launch was tracked at all, shouldCollectDevice()
+         * falls back to collect-all, which can fault/hang reading a
+         * non-resident device's sections on a multi-GPU host (e.g. a program
+         * that never launches, collects before its first launch, or launches
+         * only via an untracked API). On Linux the supplemental HSA drain
+         * below covers those cases safely -- it walks only code objects
+         * actually resident on each agent -- so skip the host-shadow pass
+         * entirely rather than take the unsafe fallback. */
+        if (!__atomic_load_n(&AnyDeviceUsed, __ATOMIC_ACQUIRE)) {
+          if (isVerboseMode())
+            PROF_NOTE("No tracked launch; deferring device %d to HSA drain\n",
+                      Dev);
+          continue;
+        }
+#endif
+        if (hipSetDevice(Dev) != 0) {
+          if (isVerboseMode())
+            PROF_NOTE("Failed to set device %d, skipping\n", Dev);
+          continue;
+        }
+        const char *ArchName = getDeviceArchName(Dev);
+        if (isVerboseMode())
+          PROF_NOTE("Collecting static profile data from device %d (%s)\n", Dev,
+                    ArchName);
+        for (int i = 0; i < NumShadowVariables; ++i) {
+          /* RDC-mode multi-shadow drains need a distinct profraw per TU;
+           * single-TU programs keep the bare arch target. */
+          const char *Target = ArchName;
+          char TargetWithIdx[64];
+          if (NumShadowVariables > 1) {
+            snprintf(TargetWithIdx, sizeof(TargetWithIdx), "%s.%d", ArchName,
+                     i);
+            Target = TargetWithIdx;
+          }
+          if (processShadowVariable(i, Target) != 0)
+            Ret = -1;
         }
-        if (processShadowVariable(i, Target) != 0)
-          Ret = -1;
       }
-    }
 
-    if (OrigDevice >= 0)
-      hipSetDevice(OrigDevice);
-  }
+      if (OrigDevice >= 0)
+        hipSetDevice(OrigDevice);
+    }
 
-  /* Warn about unprocessed TUs; skip cleared slots (already drained). */
-  lockDynamicModules();
-  for (int i = 0; i < NumDynamicModules; ++i) {
-    OffloadDynamicModuleInfo *MI = &DynamicModules[i];
-    if (!MI->ModulePtr)
-      continue;
-    for (int t = 0; t < MI->NumTUs; ++t) {
-      if (!MI->TUs[t].Processed) {
-        PROF_WARN("dynamic module %p TU %d was not processed before exit\n",
-                  MI->ModulePtr, t);
-        Ret = -1;
+    /* Warn about unprocessed TUs; skip cleared slots (already drained). */
+    lockDynamicModules();
+    for (int i = 0; i < NumDynamicModules; ++i) {
+      OffloadDynamicModuleInfo *MI = &DynamicModules[i];
+      if (!MI->ModulePtr)
+        continue;
+      for (int t = 0; t < MI->NumTUs; ++t) {
+        if (!MI->TUs[t].Processed) {
+          PROF_WARN("dynamic module %p TU %d was not processed before exit\n",
+                    MI->ModulePtr, t);
+          Ret = -1;
+        }
       }
     }
+    unlockDynamicModules();
   }
-  unlockDynamicModules();
+
+#if defined(__linux__) && !defined(_WIN32)
+  /* Supplemental HSA-introspection drain: catches device code objects with no
+   * host-side shadow (e.g. RCCL device-linked kernels). Runs after the
+   * host-shadow drain so already-drained sections are deduped out, and runs
+   * even when there are no host shadows at all (the common RCCL case). */
+  if (drainDevicesViaHsa() != 0)
+    Ret = -1;
+#endif
 
   if (Ret != 0)
     PROF_WARN("%s\n", "failed to collect device profile data");
diff --git a/compiler-rt/test/profile/AMDGPU/device-basic.hip b/compiler-rt/test/profile/AMDGPU/device-basic.hip
new file mode 100644
index 0000000000000..4fcf044802240
--- /dev/null
+++ b/compiler-rt/test/profile/AMDGPU/device-basic.hip
@@ -0,0 +1,67 @@
+// Basic HIP device PGO drain end-to-end: a host + device .profraw are written
+// at exit (the device one arch-prefixed), they merge, the merged profile
+// contains the device kernel's counters, and llvm-cov reports device-side
+// coverage. Covers both non-RDC and RDC device compiles.
+//
+// REQUIRES: hip, amdgpu
+
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+
+// --- non-RDC ---
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \
+// RUN:   -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN:   -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   %t.dir/a.out
+// A device profraw (arch-prefixed) must have been drained alongside the host one.
+// RUN: ls %t.dir/gfx*.profraw
+// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/a.profdata
+// RUN: llvm-profdata show --all-functions %t.dir/a.profdata \
+// RUN:   | FileCheck --check-prefix=FUNCS %s
+// Confirm the embedded device image is extractable (failure here is the real
+// cause of any downstream llvm-cov failure, so let it propagate).
+// RUN: llvm-objdump --offloading %t.dir/a.out > /dev/null
+// RUN: llvm-cov report %t.dir/a.out.0.hip-amdgcn-amd-amdhsa--*gfx* \
+// RUN:   -instr-profile=%t.dir/a.profdata 2>&1 | FileCheck --check-prefix=COV %s
+
+// --- RDC ---
+// RUN: rm -f %t.dir/*.profraw
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fgpu-rdc \
+// RUN:   -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/b.out \
+// RUN:   -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   %t.dir/b.out
+// RUN: ls %t.dir/gfx*.profraw
+// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/b.profdata
+// RUN: llvm-profdata show --all-functions %t.dir/b.profdata \
+// RUN:   | FileCheck --check-prefix=FUNCS %s
+
+#include <hip/hip_runtime.h>
+
+__global__ void addk(int *p) {
+  if (*p > 0)
+    *p += 1;
+  else
+    *p -= 1;
+}
+
+int main() {
+  int *d = nullptr;
+  if (hipMalloc(&d, sizeof(int)) != hipSuccess)
+    return 2;
+  int h = 5;
+  (void)hipMemcpy(d, &h, sizeof(int), hipMemcpyHostToDevice);
+  addk<<<1, 1>>>(d);
+  (void)hipMemcpy(&h, d, sizeof(int), hipMemcpyDeviceToHost);
+  (void)hipFree(d);
+  return h > 0 ? 0 : 1;
+}
+
+// The merged profile contains both the host main and the device kernel,
+// proving the device counters were drained and merged.
+// FUNCS-DAG: addk
+// FUNCS-DAG: main
+
+// COV: TOTAL
diff --git a/compiler-rt/test/profile/AMDGPU/device-early-collect.hip b/compiler-rt/test/profile/AMDGPU/device-early-collect.hip
new file mode 100644
index 0000000000000..3e2c6e84e26c2
--- /dev/null
+++ b/compiler-rt/test/profile/AMDGPU/device-early-collect.hip
@@ -0,0 +1,68 @@
+// M1 regression: calling __llvm_profile_hip_collect_device_data() before any
+// kernel has been launched must not poison the later atexit drain.  The early
+// call sees "no instrumented code object loaded yet" (a transient no-op) and
+// must not latch the drain as completed; otherwise the post-launch atexit
+// pass produces no device .profraw and we silently lose device counters.
+//
+// REQUIRES: hip, amdgpu
+// Guards the Linux introspection drain's DrainCompleted latch; the Windows
+// host-shadow drain has no such latch (it tracks per-TU Processed flags).
+// UNSUPPORTED: windows
+
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \
+// RUN:   -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN:   -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   %t.dir/a.out
+// Both the host profraw and at least one device profraw (gfx-prefixed) must
+// have been produced, despite the early collection attempt.
+// RUN: ls %t.dir/host.*.profraw
+// RUN: ls %t.dir/gfx*.profraw
+// And the merged profile must contain the device kernel that was launched
+// *after* the early collect.
+// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/a.profdata
+// RUN: llvm-profdata show --all-functions %t.dir/a.profdata \
+// RUN:   | FileCheck %s
+
+#include <hip/hip_runtime.h>
+
+// Declared by libclang_rt.profile-<host arch>.a; we call it directly to
+// simulate any caller that drains device counters at an arbitrary point in
+// the program lifetime (e.g. a per-iteration profile dump).
+extern "C" int __llvm_profile_hip_collect_device_data(void);
+
+__global__ void post_collect_kernel(int *p) {
+  if (*p > 0)
+    *p += 1;
+  else
+    *p -= 1;
+}
+
+int main() {
+  // (1) Early collection -- runs before any kernel launch.  The drainer
+  //     finds either no GPU agents, no loaded segments, or no instrumented
+  //     bounds table, and returns 0 without latching DrainCompleted.
+  (void)__llvm_profile_hip_collect_device_data();
+
+  // (2) Now launch a kernel.  HIP loads the device code object that carries
+  //     the __llvm_profile_sections bounds table, executes our kernel, and
+  //     populates the device-side counters.
+  int *d = nullptr;
+  if (hipMalloc(&d, sizeof(int)) != hipSuccess)
+    return 2;
+  int h = 5;
+  (void)hipMemcpy(d, &h, sizeof(int), hipMemcpyHostToDevice);
+  post_collect_kernel<<<1, 1>>>(d);
+  (void)hipMemcpy(&h, d, sizeof(int), hipMemcpyDeviceToHost);
+  (void)hipFree(d);
+
+  // (3) Exit normally.  The atexit drain runs and -- because step (1) did
+  //     not latch DrainCompleted -- it walks the (now loaded) code object,
+  //     finds __llvm_profile_sections, and emits the device .profraw.
+  return h > 0 ? 0 : 1;
+}
+
+// CHECK-DAG: post_collect_kernel
+// CHECK-DAG: main
diff --git a/compiler-rt/test/profile/AMDGPU/device-no-kernel.hip b/compiler-rt/test/profile/AMDGPU/device-no-kernel.hip
new file mode 100644
index 0000000000000..a154308d725d8
--- /dev/null
+++ b/compiler-rt/test/profile/AMDGPU/device-no-kernel.hip
@@ -0,0 +1,44 @@
+// Independence / robustness: an instrumented HIP program that never launches a
+// kernel still writes its host .profraw, and the device drain is a clean no-op
+// (no crash, no spurious device .profraw). We assert the no-op condition
+// directly via the runtime's verbose log rather than rely on HIP lazy-loading
+// to leave the device code object unloaded -- the loader may load it for
+// other reasons (e.g. eager registration), and in that case the drain
+// legitimately walks it and reports zero instrumented sections / zero
+// drained. Either outcome is correct.
+//
+// REQUIRES: hip, amdgpu
+// The terminal conditions checked below ("no GPU agents", "no loaded
+// segments", "drained=0") are Linux HSA-drain strings with no Windows analog.
+// UNSUPPORTED: windows
+
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch \
+// RUN:   -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN:   -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   LLVM_PROFILE_VERBOSE=1 \
+// RUN:   %t.dir/a.out 2> %t.dir/verbose.log
+// RUN: ls %t.dir/host.*.profraw
+// No arch-prefixed device profraw should have been produced.
+// RUN: not ls %t.dir/gfx*.profraw
+// The drain must have run; one of these three terminal conditions must hold:
+//   - no GPU agents enumerated (test host has /dev/kfd but no usable agent)
+//   - no loaded code object segments at exit
+//   - the walk completed and drained=0 (no instrumented kernel was launched
+//     so the device code object either wasn't loaded or its bounds were
+//     empty/already drained)
+// RUN: FileCheck --input-file=%t.dir/verbose.log %s
+// CHECK: {{no GPU agents present|no loaded segments|drained=0}}
+
+#include <hip/hip_runtime.h>
+
+// Defined but never launched.
+__global__ void unused(int *p) { *p += 1; }
+
+int main() {
+  int n = 0;
+  (void)hipGetDeviceCount(&n);
+  return 0;
+}
diff --git a/compiler-rt/test/profile/AMDGPU/device-symbols.hip b/compiler-rt/test/profile/AMDGPU/device-symbols.hip
new file mode 100644
index 0000000000000..f12283b7da636
--- /dev/null
+++ b/compiler-rt/test/profile/AMDGPU/device-symbols.hip
@@ -0,0 +1,42 @@
+// The decoupled drain reads only the canonical __llvm_profile_sections bounds
+// table provided by the device profile runtime (InstrProfilingPlatformGPU.c),
+// since clang no longer emits a per-TU struct. Assert that symbol is present
+// in the device ELF's dynamic symbol table (protected visibility) for both
+// non-RDC and RDC device compiles. This is the contract the drainer depends on.
+//
+// REQUIRES: hip, amdgpu
+
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+
+// --- non-RDC ---
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \
+// RUN:   -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN:   -L%hip_lib_path -lamdhip64
+// Extraction failure here would make the readelf invocation succeed against
+// an empty/missing file; surface it instead of hiding it behind `|| true`.
+// RUN: llvm-objdump --offloading %t.dir/a.out > /dev/null
+// RUN: llvm-readelf --dyn-syms %t.dir/a.out.0.hip-amdgcn-amd-amdhsa--*gfx* \
+// RUN:   | FileCheck %s
+
+// --- RDC ---
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fgpu-rdc \
+// RUN:   -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/b.out \
+// RUN:   -L%hip_lib_path -lamdhip64
+// RUN: llvm-objdump --offloading %t.dir/b.out > /dev/null
+// RUN: llvm-readelf --dyn-syms %t.dir/b.out.0.hip-amdgcn-amd-amdhsa--*gfx* \
+// RUN:   | FileCheck %s
+
+// CHECK: PROTECTED {{.*}} __llvm_profile_sections
+
+#include <hip/hip_runtime.h>
+
+__global__ void k(int *p) { *p += 1; }
+
+int main() {
+  int *d = nullptr;
+  if (hipMalloc(&d, sizeof(int)) != hipSuccess)
+    return 2;
+  k<<<1, 1>>>(d);
+  (void)hipFree(d);
+  return 0;
+}
diff --git a/compiler-rt/test/profile/AMDGPU/lit.local.cfg.py b/compiler-rt/test/profile/AMDGPU/lit.local.cfg.py
new file mode 100644
index 0000000000000..5148dd6b9e2f2
--- /dev/null
+++ b/compiler-rt/test/profile/AMDGPU/lit.local.cfg.py
@@ -0,0 +1,4 @@
+# Device-profile drain tests: require an AMD GPU (and, implicitly, the amdgcn
+# device profile runtime in the resource directory and a ROCm/HIP install).
+if "amdgpu" not in config.available_features:
+    config.unsupported = True
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-basic.hip b/compiler-rt/test/profile/GPU/instrprof-hip-basic.hip
new file mode 100644
index 0000000000000..8cbe7c970052c
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-basic.hip
@@ -0,0 +1,51 @@
+// Test basic HIP PGO instrumentation and profile collection.
+//
+// REQUIRES: hip, amdgpu
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN:   --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: env LLVM_PROFILE_FILE=%t.dir/prof.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %run %t
+// RUN: ls %t.dir/prof.profraw
+// RUN: llvm-profdata merge -o %t.profdata %t.dir/
+// RUN: llvm-profdata show --all-functions %t.profdata \
+// RUN:   | FileCheck %s --check-prefix=PROF
+//
+// PROF: _Z6squarePiPKii
+// PROF: main
+// PROF: Functions shown: 2
+// PROF: Total functions: 2
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void square(int *out, const int *in, int n) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx < n)
+        out[idx] = in[idx] * in[idx];
+}
+
+int main() {
+    constexpr int N = 64;
+    int h_in[N], h_out[N];
+    for (int i = 0; i < N; ++i) h_in[i] = i;
+
+    int *d_in, *d_out;
+    (void)hipMalloc(&d_in, N * sizeof(int));
+    (void)hipMalloc(&d_out, N * sizeof(int));
+    (void)hipMemcpy(d_in, h_in, N * sizeof(int), hipMemcpyHostToDevice);
+
+    square<<<1, N>>>(d_out, d_in, N);
+
+    (void)hipMemcpy(h_out, d_out, N * sizeof(int), hipMemcpyDeviceToHost);
+
+    int ok = 1;
+    for (int i = 0; i < N; ++i)
+        if (h_out[i] != i * i) ok = 0;
+
+    printf("%s\n", ok ? "PASS" : "FAIL");
+    (void)hipFree(d_in);
+    (void)hipFree(d_out);
+    return !ok;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-collect-after.hip b/compiler-rt/test/profile/GPU/instrprof-hip-collect-after.hip
new file mode 100644
index 0000000000000..5a2393f8dcc47
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-collect-after.hip
@@ -0,0 +1,63 @@
+// Explicit-collect idempotency: a program that calls
+// __llvm_profile_hip_collect_device_data() itself *after* a launch (e.g. a
+// periodic profile dump) and then also exits normally must not double-count the
+// device counters. The explicit drain and the atexit drain write the same
+// arch-named device profraw, so the merged profile must reflect a single launch
+// (function count 64, even-branch 32), not two. Complements device-early-collect
+// (which covers a collect *before* the first launch).
+//
+// REQUIRES: hip, amdgpu
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \
+// RUN:   -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN:   -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %t.dir/a.out
+// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/a.profdata
+// RUN: llvm-profdata show --all-functions --counts %t.dir/a.profdata \
+// RUN:   | FileCheck %s
+//
+// A single launch of 64 threads, drained twice (explicit + atexit), must still
+// merge to exactly one launch's worth of counts.
+// CHECK: _Z8classifyPii:
+// CHECK: Function count: 64
+// CHECK: Block counts: [0, 32]
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+extern "C" int __llvm_profile_hip_collect_device_data(void);
+
+__global__ void classify(int *out, int n) {
+  int idx = blockIdx.x * blockDim.x + threadIdx.x;
+  if (idx >= n)
+    return;
+  if (idx % 2 == 0)
+    out[idx] = 1;
+  else
+    out[idx] = 0;
+}
+
+int main() {
+  constexpr int N = 64;
+  int *d = nullptr;
+  if (hipMalloc(&d, N * sizeof(int)) != hipSuccess)
+    return 2;
+  classify<<<1, N>>>(d, N);
+  (void)hipDeviceSynchronize();
+
+  // Explicit mid-program collect after the launch. The atexit drain runs again
+  // at exit; together they must not double the device counters.
+  (void)__llvm_profile_hip_collect_device_data();
+
+  int h[N];
+  (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost);
+  (void)hipFree(d);
+
+  int evens = 0;
+  for (int i = 0; i < N; ++i)
+    evens += h[i];
+  printf("%s\n", evens == 32 ? "PASS" : "FAIL");
+  return evens == 32 ? 0 : 1;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-counter-correctness.hip b/compiler-rt/test/profile/GPU/instrprof-hip-counter-correctness.hip
new file mode 100644
index 0000000000000..c2bfc9ac9dc66
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-counter-correctness.hip
@@ -0,0 +1,56 @@
+// Quantitative device-counter correctness: the drained device profile must carry
+// the *exact* per-region execution counts produced by the kernel, not merely
+// "some counts are present". A single launch of 64 threads over classify() must
+// record a function entry count of 64 and an even-branch block count of 32
+// (idx % 2 == 0 holds for exactly half of idx in [0, 64)). This pins the drain +
+// dedup path against silent under/over-counting (e.g. a dedup bug that dropped
+// or doubled a section would change these numbers).
+//
+// REQUIRES: hip, amdgpu
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \
+// RUN:   -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN:   -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %t.dir/a.out
+// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/a.profdata
+// RUN: llvm-profdata show --all-functions --counts %t.dir/a.profdata \
+// RUN:   | FileCheck %s
+//
+// The device kernel ran with exactly 64 threads, all of which entered the
+// function; the even branch was taken 32 times and the early-return path 0.
+// CHECK: _Z8classifyPii:
+// CHECK: Function count: 64
+// CHECK: Block counts: [0, 32]
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void classify(int *out, int n) {
+  int idx = blockIdx.x * blockDim.x + threadIdx.x;
+  if (idx >= n)
+    return;
+  if (idx % 2 == 0)
+    out[idx] = 1;
+  else
+    out[idx] = 0;
+}
+
+int main() {
+  constexpr int N = 64;
+  int *d = nullptr;
+  if (hipMalloc(&d, N * sizeof(int)) != hipSuccess)
+    return 2;
+  classify<<<1, N>>>(d, N);
+  (void)hipDeviceSynchronize();
+  int h[N];
+  (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost);
+  (void)hipFree(d);
+
+  int evens = 0;
+  for (int i = 0; i < N; ++i)
+    evens += h[i];
+  printf("%s\n", evens == 32 ? "PASS" : "FAIL");
+  return evens == 32 ? 0 : 1;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-coverage.hip b/compiler-rt/test/profile/GPU/instrprof-hip-coverage.hip
new file mode 100644
index 0000000000000..a867c30f0edfb
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-coverage.hip
@@ -0,0 +1,51 @@
+// Test HIP coverage mapping produces source-level coverage for host code.
+//
+// REQUIRES: hip, amdgpu
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN:   --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: env LLVM_PROFILE_FILE=%t.dir/prof.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %run %t
+// RUN: llvm-profdata merge -o %t.profdata %t.dir/
+// RUN: llvm-cov report %t -instr-profile=%t.profdata 2>&1 \
+// RUN:   | FileCheck %s --check-prefix=REPORT
+//
+// REPORT: instrprof-hip-coverage.hip
+// No coverage column should be fully uncovered. Anchor on a non-digit before
+// the "0.00%" so this does not spuriously match e.g. "80.00%".
+// REPORT-NOT: {{[^.0-9]0[.]00%}}
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__device__ int gpu_abs(int x) {
+    return x < 0 ? -x : x;
+}
+
+__global__ void abs_kernel(int *data, int n) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx < n)
+        data[idx] = gpu_abs(data[idx]);
+}
+
+int main() {
+    constexpr int N = 16;
+    int h[N];
+    for (int i = 0; i < N; ++i)
+        h[i] = (i % 2 == 0) ? i : -i;
+
+    int *d;
+    (void)hipMalloc(&d, N * sizeof(int));
+    (void)hipMemcpy(d, h, N * sizeof(int), hipMemcpyHostToDevice);
+    abs_kernel<<<1, N>>>(d, N);
+    (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost);
+    (void)hipFree(d);
+
+    int ok = 1;
+    for (int i = 0; i < N; ++i)
+        if (h[i] != i) ok = 0;
+
+    printf("%s\n", ok ? "PASS" : "FAIL");
+    return !ok;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-device-branching.hip b/compiler-rt/test/profile/GPU/instrprof-hip-device-branching.hip
new file mode 100644
index 0000000000000..a24b28ec9af0a
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-device-branching.hip
@@ -0,0 +1,67 @@
+// Test that device-side branching is captured in profile counters.
+// Exercises the classify-style pattern where different branches are taken
+// by different threads, verifying that counter values reflect actual execution.
+//
+// REQUIRES: hip, amdgpu
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN:   --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: env LLVM_PROFILE_FILE=%t.dir/prof.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %run %t
+// RUN: llvm-profdata merge -o %t.profdata %t.dir/
+// RUN: llvm-profdata show --all-functions %t.profdata \
+// RUN:   | FileCheck %s --check-prefix=PROF
+//
+// Device functions should appear with non-zero counters. The __device__
+// classify() helper is inlined into the histogram kernel, so it does not get a
+// separate profile record; its branching is captured within the kernel's
+// counters instead.
+// PROF-DAG: _Z9histogramPKiPii
+// PROF-DAG: main
+// PROF: Total functions: 2
+// PROF: Maximum function count: {{[1-9][0-9]*}}
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__device__ int classify(int x) {
+    if (x > 100)    return 2;
+    else if (x > 0) return 1;
+    else            return 0;
+}
+
+__global__ void histogram(const int *input, int *bins, int n) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx < n) {
+        int cls = classify(input[idx]);
+        atomicAdd(&bins[cls], 1);
+    }
+}
+
+int main() {
+    constexpr int N = 256;
+    constexpr int NBINS = 3;
+
+    int h_in[N], h_bins[NBINS] = {};
+    for (int i = 0; i < N; ++i)
+        h_in[i] = (i % 3 == 0) ? -1 : (i % 3 == 1) ? 50 : 200;
+
+    int *d_in, *d_bins;
+    (void)hipMalloc(&d_in, N * sizeof(int));
+    (void)hipMalloc(&d_bins, NBINS * sizeof(int));
+    (void)hipMemcpy(d_in, h_in, N * sizeof(int), hipMemcpyHostToDevice);
+    (void)hipMemset(d_bins, 0, NBINS * sizeof(int));
+
+    histogram<<<1, N>>>(d_in, d_bins, N);
+
+    (void)hipMemcpy(h_bins, d_bins, NBINS * sizeof(int), hipMemcpyDeviceToHost);
+    printf("bins: [%d, %d, %d]\n", h_bins[0], h_bins[1], h_bins[2]);
+
+    int ok = (h_bins[0] > 0 && h_bins[1] > 0 && h_bins[2] > 0);
+    printf("%s\n", ok ? "PASS" : "FAIL");
+
+    (void)hipFree(d_in);
+    (void)hipFree(d_bins);
+    return !ok;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-fork-safety.hip b/compiler-rt/test/profile/GPU/instrprof-hip-fork-safety.hip
new file mode 100644
index 0000000000000..c79cf568f88bc
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-fork-safety.hip
@@ -0,0 +1,61 @@
+// Fork safety: loading the profile-instrumented library must NOT initialize
+// HSA in a process that itself never touches HIP and only runs device work in
+// forked children. RCCL's unit tests follow exactly this pattern -- the parent
+// deliberately keeps HIP/HSA uninitialized and launches kernels only inside
+// forked children. If the profile runtime's library constructor eagerly called
+// hsa_init(), the child would inherit invalid HSA state across fork() and crash
+// inside HSA (e.g. hsa_amd_signal_create -> SharedSignalPool::alloc). The HSA
+// drain therefore brings HSA up lazily, never from a constructor.
+//
+// REQUIRES: hip, amdgpu
+// The eager-hsa_init fork hazard and the lazy HSA drain are Linux-only.
+// UNSUPPORTED: windows
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN:   --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   %run %t 2>&1 | FileCheck %s
+//
+// The forked child must complete its kernel without crashing in HSA.
+// CHECK: PASS
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+#include <sys/wait.h>
+#include <unistd.h>
+
+__global__ void increment(int *p) { *p += 1; }
+
+static int doChildWork() {
+  int *d = nullptr;
+  if (hipMalloc(&d, sizeof(int)) != hipSuccess)
+    return 1;
+  int h = 41;
+  (void)hipMemcpy(d, &h, sizeof(int), hipMemcpyHostToDevice);
+  increment<<<1, 1>>>(d);
+  if (hipDeviceSynchronize() != hipSuccess)
+    return 1;
+  (void)hipMemcpy(&h, d, sizeof(int), hipMemcpyDeviceToHost);
+  (void)hipFree(d);
+  return h == 42 ? 0 : 1;
+}
+
+int main() {
+  // The parent intentionally performs no HIP/HSA work before forking.
+  pid_t pid = fork();
+  if (pid < 0) {
+    printf("FAIL (fork failed)\n");
+    return 1;
+  }
+  if (pid == 0) {
+    // Child runs the device work; _exit avoids flushing the parent's profile
+    // handlers from the child (the RCCL test pattern).
+    _exit(doChildWork());
+  }
+
+  int status = 0;
+  (void)waitpid(pid, &status, 0);
+  int ok = WIFEXITED(status) && WEXITSTATUS(status) == 0;
+  printf("%s\n", ok ? "PASS" : "FAIL");
+  return ok ? 0 : 1;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-multi-gpu.hip b/compiler-rt/test/profile/GPU/instrprof-hip-multi-gpu.hip
new file mode 100644
index 0000000000000..6a99546d34bdb
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-multi-gpu.hip
@@ -0,0 +1,57 @@
+// Test that HIP PGO works on multi-GPU systems. The kernel runs on the default
+// device, so the host-shadow drain (guarded by upstream's launch tracking)
+// collects only that device and the supplemental HSA agent-walk then finds the
+// same code object and dedups it out. The point of the test is that neither
+// pass crashes or hangs reading a non-resident device on a host with several
+// GPUs (the failure mode that the launch tracking + HSA residency walk fix).
+//
+// REQUIRES: hip, amdgpu
+// The "walk complete" / dedup notes are Linux-only HSA-drain strings; the
+// Windows host-shadow drain collects only the current device.
+// UNSUPPORTED: windows
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN:   --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   LLVM_PROFILE_VERBOSE=1 %run %t 2>&1 | FileCheck %s
+//
+// The host-shadow pass drains the launched device, the HSA walk finds that same
+// code object and dedups it (drained=0), and the program does not crash.
+// CHECK: Copied device sections:
+// CHECK: device bounds already drained, skipping
+// CHECK: walk complete: agents={{[0-9]+}} pairs={{[0-9]+}} found={{[1-9][0-9]*}} drained={{[0-9]+}}
+// CHECK: PASS
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void add_one(int *data, int n) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx < n)
+        data[idx] += 1;
+}
+
+int main() {
+    int ndev = 0;
+    (void)hipGetDeviceCount(&ndev);
+
+    constexpr int N = 32;
+    int h_data[N];
+    for (int i = 0; i < N; ++i) h_data[i] = i;
+
+    int *d_data;
+    (void)hipMalloc(&d_data, N * sizeof(int));
+    (void)hipMemcpy(d_data, h_data, N * sizeof(int), hipMemcpyHostToDevice);
+
+    add_one<<<1, N>>>(d_data, N);
+
+    (void)hipMemcpy(h_data, d_data, N * sizeof(int), hipMemcpyDeviceToHost);
+    (void)hipFree(d_data);
+
+    int ok = 1;
+    for (int i = 0; i < N; ++i)
+        if (h_data[i] != i + 1) ok = 0;
+
+    printf("%s (devices=%d)\n", ok ? "PASS" : "FAIL", ndev);
+    return !ok;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-multi-process-merge.hip b/compiler-rt/test/profile/GPU/instrprof-hip-multi-process-merge.hip
new file mode 100644
index 0000000000000..8cf1258a63535
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-multi-process-merge.hip
@@ -0,0 +1,63 @@
+// Multi-process accumulation: device counters from several independent runs of an
+// instrumented HIP program must accumulate when their profraws are offline-merged
+// (llvm-profdata merge), the common real-world feedback-collection path. Three
+// runs of a 64-thread launch must sum to a function count of 192 and an
+// even-branch block count of 96 (3 x 64 / 3 x 32).
+//
+// Note: on-the-fly merge-pooling via LLVM_PROFILE_FILE=...%m does NOT currently
+// accumulate *device* counters -- the device profraw is rewritten per process
+// rather than merged in place -- so each process must write a distinct file
+// (here via %p) and the accumulation is done by llvm-profdata merge.
+//
+// REQUIRES: hip, amdgpu
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \
+// RUN:   -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN:   -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/run1.%%p.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %t.dir/a.out
+// RUN: env LLVM_PROFILE_FILE=%t.dir/run2.%%p.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %t.dir/a.out
+// RUN: env LLVM_PROFILE_FILE=%t.dir/run3.%%p.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %t.dir/a.out
+// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/a.profdata
+// RUN: llvm-profdata show --all-functions --counts %t.dir/a.profdata \
+// RUN:   | FileCheck %s
+//
+// CHECK: _Z8classifyPii:
+// CHECK: Function count: 192
+// CHECK: Block counts: [0, 96]
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void classify(int *out, int n) {
+  int idx = blockIdx.x * blockDim.x + threadIdx.x;
+  if (idx >= n)
+    return;
+  if (idx % 2 == 0)
+    out[idx] = 1;
+  else
+    out[idx] = 0;
+}
+
+int main() {
+  constexpr int N = 64;
+  int *d = nullptr;
+  if (hipMalloc(&d, N * sizeof(int)) != hipSuccess)
+    return 2;
+  classify<<<1, N>>>(d, N);
+  (void)hipDeviceSynchronize();
+  int h[N];
+  (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost);
+  (void)hipFree(d);
+
+  int evens = 0;
+  for (int i = 0; i < N; ++i)
+    evens += h[i];
+  printf("%s\n", evens == 32 ? "PASS" : "FAIL");
+  return evens == 32 ? 0 : 1;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-multiple-kernels.hip b/compiler-rt/test/profile/GPU/instrprof-hip-multiple-kernels.hip
new file mode 100644
index 0000000000000..0fd6185b82441
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-multiple-kernels.hip
@@ -0,0 +1,58 @@
+// Test PGO with multiple kernel launches from a single TU.
+// Verifies that counters from all device functions are collected correctly.
+//
+// REQUIRES: hip, amdgpu
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN:   --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: env LLVM_PROFILE_FILE=%t.dir/prof.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %run %t
+// RUN: llvm-profdata merge -o %t.profdata %t.dir/
+// RUN: llvm-profdata show --all-functions %t.profdata \
+// RUN:   | FileCheck %s --check-prefix=PROF
+//
+// All three kernels plus main should be profiled.
+// PROF-DAG: _Z4fillPii
+// PROF-DAG: _Z5scalePii
+// PROF-DAG: _Z6negatePii
+// PROF-DAG: main
+// PROF: Total functions: 4
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void fill(int *data, int val) {
+    data[threadIdx.x] = val;
+}
+
+__global__ void scale(int *data, int factor) {
+    data[threadIdx.x] *= factor;
+}
+
+__global__ void negate(int *data, int n) {
+    int idx = threadIdx.x;
+    if (idx < n)
+        data[idx] = -data[idx];
+}
+
+int main() {
+    constexpr int N = 16;
+    int h[N];
+    int *d;
+    (void)hipMalloc(&d, N * sizeof(int));
+
+    fill<<<1, N>>>(d, 5);
+    scale<<<1, N>>>(d, 3);
+    negate<<<1, N>>>(d, N);
+
+    (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost);
+    (void)hipFree(d);
+
+    int ok = 1;
+    for (int i = 0; i < N; ++i)
+        if (h[i] != -15) ok = 0;
+
+    printf("%s\n", ok ? "PASS" : "FAIL");
+    return !ok;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-nondefault-device.hip b/compiler-rt/test/profile/GPU/instrprof-hip-nondefault-device.hip
new file mode 100644
index 0000000000000..5d3dea671047b
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-nondefault-device.hip
@@ -0,0 +1,60 @@
+// Test PGO when the kernel runs on a non-default device (here the program
+// selects device 1). Upstream's launch tracking records that device 1 was used,
+// so the host-shadow drain skips the other devices and collects device 1, and
+// the supplemental HSA agent-walk then finds that same code object and dedups it
+// out. This exercises both that the correct device is drained and that an
+// unused device is never read (which would fault/hang on a multi-GPU host).
+//
+// REQUIRES: hip, amdgpu, multi-device
+// The "walk complete" / dedup notes are Linux-only HSA-drain strings; the
+// Windows host-shadow drain only collects the current device.
+// UNSUPPORTED: windows
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN:   --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   LLVM_PROFILE_VERBOSE=1 %run %t 2>&1 | FileCheck %s
+//
+// The launched device (1) is drained, the unused default device is skipped, and
+// the HSA walk finds the same code object and dedups it (drained=0).
+// CHECK: Skipping unused device 0
+// CHECK: Collecting static profile data from device 1
+// CHECK: Copied device sections:
+// CHECK: device bounds already drained, skipping
+// CHECK: walk complete: agents={{[0-9]+}} pairs={{[0-9]+}} found={{[1-9][0-9]*}} drained={{[0-9]+}}
+// CHECK: PASS
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void fill(int *data, int val, int n) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx < n)
+        data[idx] = val;
+}
+
+int main() {
+    int ndev = 0;
+    (void)hipGetDeviceCount(&ndev);
+    if (ndev < 2) {
+        printf("PASS (skipped: only %d device)\n", ndev);
+        return 0;
+    }
+
+    (void)hipSetDevice(1);
+
+    constexpr int N = 32;
+    int h[N] = {};
+    int *d;
+    (void)hipMalloc(&d, N * sizeof(int));
+    fill<<<1, N>>>(d, 99, N);
+    (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost);
+    (void)hipFree(d);
+
+    int ok = 1;
+    for (int i = 0; i < N; ++i)
+        if (h[i] != 99) ok = 0;
+
+    printf("%s\n", ok ? "PASS" : "FAIL");
+    return !ok;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-pgo-use.hip b/compiler-rt/test/profile/GPU/instrprof-hip-pgo-use.hip
new file mode 100644
index 0000000000000..9a8a8187f8e77
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-pgo-use.hip
@@ -0,0 +1,63 @@
+// Test the full PGO cycle: instrument, collect, merge, optimize.
+// Verifies that the optimized binary produces correct output and that
+// profile data is consumed without errors.
+//
+// REQUIRES: hip, amdgpu
+//
+// Step 1: Build instrumented binary.
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN:   --offload-arch=%amdgpu_arch %s -o %t.instr \
+// RUN:   -L%hip_lib_path -lamdhip64
+//
+// Step 2: Run to collect profile data.
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: env LLVM_PROFILE_FILE=%t.dir/prof.profraw \
+// RUN:   LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %run %t.instr 2>&1 | FileCheck %s
+//
+// Step 3: Merge profile data.
+// RUN: llvm-profdata merge -o %t.profdata %t.dir/
+//
+// Step 4: Build optimized binary with profile data.
+// RUN: %clang -x hip -fprofile-instr-use=%t.profdata \
+// RUN:   --offload-arch=%amdgpu_arch %s -o %t.opt \
+// RUN:   -L%hip_lib_path -lamdhip64 -O2
+//
+// Step 5: Run optimized binary.
+// RUN: env LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN:   HIP_VISIBLE_DEVICES=0 %run %t.opt 2>&1 | FileCheck %s
+//
+// CHECK: PASS
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void scale(float *data, float factor, int n) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    if (idx < n)
+        data[idx] *= factor;
+}
+
+int main() {
+    constexpr int N = 128;
+    float h[N];
+    for (int i = 0; i < N; ++i) h[i] = (float)i;
+
+    float *d;
+    (void)hipMalloc(&d, N * sizeof(float));
+    (void)hipMemcpy(d, h, N * sizeof(float), hipMemcpyHostToDevice);
+
+    scale<<<1, N>>>(d, 2.0f, N);
+
+    (void)hipMemcpy(h, d, N * sizeof(float), hipMemcpyDeviceToHost);
+    (void)hipFree(d);
+
+    int ok = 1;
+    for (int i = 0; i < N; ++i) {
+        float expected = (float)(i * 2);
+        if (h[i] != expected) ok = 0;
+    }
+
+    printf("%s\n", ok ? "PASS" : "FAIL");
+    return !ok;
+}
diff --git a/compiler-rt/test/profile/device-pgo/README.md b/compiler-rt/test/profile/device-pgo/README.md
new file mode 100644
index 0000000000000..4338c637abe49
--- /dev/null
+++ b/compiler-rt/test/profile/device-pgo/README.md
@@ -0,0 +1,125 @@
+# HIP device PGO / code coverage: standalone build & test recipe
+
+This directory provides a CMake-based recipe to build and exercise HIP device
+profile-guided optimization (PGO) and source-based code coverage **outside
+TheRock**, using only an `llvm-project` checkout plus a ROCm runtime.
+
+It builds, in one configure:
+
+- the host toolchain (`clang`, `clang++`, `lld`, `llvm-profdata`, `llvm-cov`)
+  and the lit-lite test utilities (`FileCheck`, `not`);
+- the host ROCm drain runtime `clang_rt.profile_rocm` (opt-in,
+  `COMPILER_RT_BUILD_PROFILE_ROCM=ON`). It runs the upstream host-shadow drain on
+  all platforms; on **Linux** `InstrProfilingPlatformROCm.cpp` additionally runs a
+  supplemental HSA-introspection pass (with content-dedup) to collect device code
+  objects that have no host shadow (e.g. device-linked/RCCL kernels);
+- the **amdgcn device** profile runtime `libclang_rt.profile.a` (the baremetal
+  profile subset that provides `__llvm_profile_instrument_gpu` and the
+  `__llvm_profile_sections` bounds table), built for the `amdgcn-amd-amdhsa`
+  runtime target with LLVM libc for amdgcn.
+
+## Why a separate library
+
+Upstream relands HIP offload PGO runtime support as the **opt-in**
+`clang_rt.profile_rocm` (llvm#201606), a `/MD` superset of `clang_rt.profile`;
+the base library stays unchanged. The driver links `clang_rt.profile_rocm`
+ahead of `clang_rt.profile` for HIP host links when profiling is requested
+(see `clang/lib/Driver/ToolChains/{Linux,MSVC}.cpp`). This recipe just turns
+the option on and builds the matching amdgcn device runtime.
+
+## Prerequisites
+
+- A ROCm installation (for `libamdhip64` and, on Linux, `libhsa-runtime64`),
+  e.g. `/opt/rocm`. Export `ROCM_PATH`.
+- An AMD GPU visible to the runtime for the *run* step (the build step does
+  not need a GPU). `amdgpu-arch` should list your device(s).
+- Ninja, a host C/C++ compiler, and Python 3.
+
+## Build
+
+```bash
+export ROCM_PATH=/opt/rocm
+./build.sh                 # builds into <repo>/build/device-pgo
+# or: ./build.sh /path/to/builddir
+```
+
+Key outputs under the build dir:
+
+```
+bin/{clang,clang++,lld,llvm-profdata,llvm-cov,FileCheck,not}
+lib/clang/<ver>/lib/<host-triple>/libclang_rt.profile_rocm.a
+lib/clang/<ver>/lib/amdgcn-amd-amdhsa/libclang_rt.profile.a
+```
+
+See `toolchain-cache.cmake` for the exact CMake variables, including the
+`LLVM_RUNTIME_TARGETS="default;amdgcn-amd-amdhsa"` split.
+
+## Run the tests
+
+The lit-lite runner (`../run_gpu_tests.py`) compiles each `.hip` test with the
+just-built toolchain, runs it on the GPU, and pipes output through `FileCheck`.
+It auto-detects features (`multi-device` via `amdgpu-arch`) so tests that need
+two visible GPUs are skipped on single-GPU hosts.
+
+```bash
+python3 ../run_gpu_tests.py \
+    --toolchain-bin "$PWD/<builddir>/bin" \
+    --hip-lib-path "$ROCM_PATH/lib" \
+    ../GPU ../AMDGPU
+```
+
+`--toolchain-bin` must be an **absolute** path (the runner executes each RUN
+line from a temp directory). With the toolchain's `amdgpu-arch`/`offload-arch`
+on hand, `--offload-arch=native` resolves automatically and the `multi-device`
+feature is enabled when 2+ GPUs are visible (so multi-GPU tests run on a
+multi-GPU host and are skipped otherwise). On a multi-gfx90a host this suite is
+15 passed, 0 failed.
+
+### Coverage notes / known gaps
+
+- Quantitative device-counter correctness (`instrprof-hip-counter-correctness`),
+  multi-process offline accumulation (`instrprof-hip-multi-process-merge`) and
+  explicit-collect idempotency (`instrprof-hip-collect-after`) pin exact device
+  counter values, so a dedup or drain regression that drops/doubles a section is
+  caught.
+- `LLVM_PROFILE_FILE=...%m` on-the-fly merge-pooling does **not** accumulate
+  *device* counters today (the device profraw is rewritten per process rather
+  than merged in place); multi-process accumulation must go through
+  `llvm-profdata merge` of distinct per-process files.
+- There is no in-tree test that drains a code object with **no** host shadow in
+  isolation (the pure device-linked/RCCL case the HSA pass uniquely handles): it
+  requires a real device-side library build (the profile runtime linked into the
+  device image), which is not expressible in the lit-lite harness via the clang
+  driver. The dedup tests do prove the HSA pass finds and dedups the same code
+  objects the host-shadow pass drains; validating the no-host-shadow drain needs
+  an actual RCCL-style binary in downstream CI.
+
+## Manual workflow (for reference)
+
+```bash
+CLANG=<builddir>/bin/clang++
+# 1. Instrumented build (host + device).
+$CLANG -O2 -fprofile-instr-generate -fcoverage-mapping \
+    --offload-arch=gfx1100 -xhip app.hip -o app
+
+# 2. Run. Produces a host .profraw and a device
+#    <name>.amdgcn-amd-amdhsa.<arch>.profraw drained by clang_rt.profile_rocm.
+LLVM_PROFILE_FILE='app-%p.profraw' ./app
+
+# 3. Merge (device profiles are merged per GPU arch).
+<builddir>/bin/llvm-profdata merge -o app.profdata app-*.profraw
+
+# 4. Coverage report (device).
+<builddir>/bin/llvm-cov show ./app -instr-profile=app.profdata
+```
+
+## Notes / environment-specific knobs
+
+- `--offload-arch` must match your GPU; the amdgcn device runtime is target
+  generic but the app's device code is per-arch. The build installs
+  `offload-arch` (and the `amdgpu-arch` alias) into `<builddir>/bin`, so
+  `--offload-arch=native` works without a system ROCm `amdgpu-arch`.
+- The amdgcn runtime target requires LLVM libc for amdgcn; if your environment
+  cannot build it, drop `libc` from
+  `RUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES` only if your headers are
+  otherwise provided.
diff --git a/compiler-rt/test/profile/device-pgo/build.sh b/compiler-rt/test/profile/device-pgo/build.sh
new file mode 100755
index 0000000000000..edf90f42fb8c1
--- /dev/null
+++ b/compiler-rt/test/profile/device-pgo/build.sh
@@ -0,0 +1,56 @@
+#!/usr/bin/env bash
+# Standalone (non-TheRock) build of the toolchain + host/device runtimes used by
+# the HIP device-PGO / code-coverage tests. See toolchain-cache.cmake and
+# README.md for details.
+#
+#   ./build.sh [BUILD_DIR]
+#
+# Env knobs:
+#   LLVM_SRC   path to the llvm-project checkout (default: repo root inferred
+#              from this script's location)
+#   JOBS       parallelism for ninja (default: nproc)
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+# .../compiler-rt/test/profile/device-pgo -> repo root is four levels up.
+LLVM_SRC="${LLVM_SRC:-$(cd "${SCRIPT_DIR}/../../../.." && pwd)}"
+BUILD_DIR="${1:-${LLVM_SRC}/build/device-pgo}"
+JOBS="${JOBS:-$(nproc)}"
+
+echo "llvm-project source : ${LLVM_SRC}"
+echo "build directory     : ${BUILD_DIR}"
+echo "parallel jobs       : ${JOBS}"
+
+cmake -G Ninja \
+  -S "${LLVM_SRC}/llvm" \
+  -B "${BUILD_DIR}" \
+  -C "${SCRIPT_DIR}/toolchain-cache.cmake"
+
+# The 'clang' target also produces the clang++ symlink. The offload toolchain
+# tools (clang-offload-bundler, clang-linker-wrapper, llvm-link,
+# llvm-offload-binary) and offload-arch (also installed as amdgpu-arch) are
+# needed to compile/link a HIP program and to resolve --offload-arch=native /
+# the multi-device test feature. 'runtimes' builds both the host (default) and
+# amdgcn device runtime targets.
+ninja -C "${BUILD_DIR}" -j "${JOBS}" \
+  clang lld \
+  clang-offload-bundler clang-linker-wrapper llvm-link llvm-offload-binary \
+  offload-arch \
+  llvm-profdata llvm-cov FileCheck not \
+  runtimes
+
+cat <<EOF
+
+Build complete.
+
+Toolchain bin : ${BUILD_DIR}/bin
+Run the GPU tests with, e.g.:
+
+  python3 ${SCRIPT_DIR}/../run_gpu_tests.py \\
+      --toolchain-bin ${BUILD_DIR}/bin \\
+      --hip-lib-path \${ROCM_PATH:-/opt/rocm}/lib \\
+      ${SCRIPT_DIR}/../GPU ${SCRIPT_DIR}/../AMDGPU
+
+(--toolchain-bin must be an absolute path; the runner executes RUN lines from a
+temp dir. See README.md for more.)
+EOF
diff --git a/compiler-rt/test/profile/device-pgo/toolchain-cache.cmake b/compiler-rt/test/profile/device-pgo/toolchain-cache.cmake
new file mode 100644
index 0000000000000..f48656c66d82d
--- /dev/null
+++ b/compiler-rt/test/profile/device-pgo/toolchain-cache.cmake
@@ -0,0 +1,55 @@
+# CMake cache for a standalone (non-TheRock) build of everything needed to
+# compile, run, and FileCheck the HIP device-PGO / code-coverage tests under
+# compiler-rt/test/profile/{GPU,AMDGPU}.
+#
+# It produces, in a single configure:
+#   * the host toolchain: clang, clang++, lld, llvm-profdata, llvm-cov, plus the
+#     test utilities FileCheck and not (LLVM_INSTALL_UTILS);
+#   * the host ROCm drain runtime clang_rt.profile_rocm (opt-in, links the
+#     sanitizer interception object libs -- hence COMPILER_RT_BUILD_SANITIZERS);
+#   * the amdgcn device profile runtime libclang_rt.profile.a (the baremetal
+#     profile subset providing __llvm_profile_instrument_gpu and the
+#     __llvm_profile_sections bounds table), built for the amdgcn-amd-amdhsa
+#     runtime target via compiler-rt/cmake/caches/AMDGPU.cmake. Building the
+#     device runtime requires LLVM libc for amdgcn, so libc is enabled for that
+#     runtime target.
+#
+# Usage (see ./build.sh for a wrapper):
+#   cmake -G Ninja -S llvm -B build/device-pgo \
+#         -C compiler-rt/test/profile/device-pgo/toolchain-cache.cmake
+#   ninja -C build/device-pgo clang lld clang-offload-bundler \
+#         clang-linker-wrapper llvm-link llvm-offload-binary offload-arch \
+#         llvm-profdata llvm-cov FileCheck not runtimes
+#
+# Outputs (under build/device-pgo):
+#   bin/{clang,clang++,lld,llvm-profdata,llvm-cov,FileCheck,not}
+#   lib/clang/<ver>/lib/<host-triple>/libclang_rt.profile_rocm.a
+#   lib/clang/<ver>/lib/amdgcn-amd-amdhsa/libclang_rt.profile.a
+
+set(CMAKE_BUILD_TYPE Release CACHE STRING "")
+
+set(LLVM_ENABLE_PROJECTS "clang;lld" CACHE STRING "")
+set(LLVM_ENABLE_RUNTIMES "compiler-rt" CACHE STRING "")
+set(LLVM_TARGETS_TO_BUILD "host;AMDGPU" CACHE STRING "")
+set(LLVM_ENABLE_PER_TARGET_RUNTIME_DIR ON CACHE BOOL "")
+set(LLVM_ENABLE_ASSERTIONS ON CACHE BOOL "")
+
+set(CLANG_DEFAULT_LINKER "lld" CACHE STRING "")
+set(CLANG_DEFAULT_RTLIB "compiler-rt" CACHE STRING "")
+
+# Make FileCheck / not available in the install/bin tree for the lit-lite runner.
+set(LLVM_INSTALL_UTILS ON CACHE BOOL "")
+
+# Build host (default) and device (amdgcn) runtimes in one tree.
+set(LLVM_RUNTIME_TARGETS "default;amdgcn-amd-amdhsa" CACHE STRING "")
+
+# Host runtimes: turn on the opt-in ROCm host drain library. It pulls in the
+# sanitizer interception object libs, so sanitizers must be built too.
+set(RUNTIMES_default_COMPILER_RT_BUILD_PROFILE_ROCM ON CACHE BOOL "")
+set(RUNTIMES_default_COMPILER_RT_BUILD_SANITIZERS ON CACHE BOOL "")
+
+# Device runtime: the amdgcn baremetal profile subset, built with LLVM libc for
+# amdgcn (freestanding C headers).
+set(RUNTIMES_amdgcn-amd-amdhsa_CACHE_FILES
+  "${CMAKE_SOURCE_DIR}/../compiler-rt/cmake/caches/AMDGPU.cmake" CACHE STRING "")
+set(RUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES "compiler-rt;libc" CACHE STRING "")
diff --git a/compiler-rt/test/profile/run_gpu_tests.py b/compiler-rt/test/profile/run_gpu_tests.py
new file mode 100644
index 0000000000000..d339969ba245e
--- /dev/null
+++ b/compiler-rt/test/profile/run_gpu_tests.py
@@ -0,0 +1,380 @@
+#!/usr/bin/env python3
+"""Minimal lit-style runner for the HIP device-PGO tests.
+
+The compiler-rt profile lit suite (and llvm-lit / FileCheck) is not part of the
+installed ROCm artifact, but the toolchain, the amdgcn device profile runtime,
+and the HIP runtime are. This runner executes the
+``compiler-rt/test/profile/{GPU,AMDGPU}/*.hip`` tests directly against an
+installed toolchain on a real GPU runner, interpreting just the slice of lit
+markup those tests use:
+
+  - ``// REQUIRES:`` / ``// UNSUPPORTED:`` boolean feature gating,
+  - ``// RUN:`` lines (with ``\\`` continuations) and the fixed substitution set
+    (%clang, %s, %t[.*], %amdgpu_arch, %hip_lib_path, %run, %%),
+  - delegation to ``FileCheck`` / ``not`` (real binaries if present on PATH,
+    otherwise shims backed by the ``filecheck`` PyPI package and a tiny
+    exit-code inverter).
+
+Each RUN line is executed via ``bash -e -o pipefail -c`` so pipes, redirection
+and globbing behave as under lit. A test passes iff all its RUN lines exit 0.
+"""
+
+import argparse
+import os
+import re
+import shutil
+import stat
+import subprocess
+import sys
+import tempfile
+from pathlib import Path
+
+# --- feature detection ------------------------------------------------------
+
+
+def _count_visible_gpus(toolchain_bin):
+    """Number of GPUs actually visible to the runtime, or 0 if unknown.
+
+    Uses the toolchain's ``amdgpu-arch`` (one line per visible device). Unlike
+    the KFD topology under ``/sys/class/kfd`` this reflects what HIP/ROCr really
+    exposes -- it honours ``ROCR_VISIBLE_DEVICES`` / ``HIP_VISIBLE_DEVICES`` and
+    container device limits, so it matches what a test's ``hipGetDeviceCount``
+    will see. It is also portable: Windows has no ``/dev/kfd``, but does ship
+    ``amdgpu-arch``.
+    """
+    if not toolchain_bin:
+        return 0
+    tb = Path(toolchain_bin)
+    exe = next(
+        (str(tb / c) for c in ("amdgpu-arch", "amdgpu-arch.exe") if (tb / c).exists()),
+        None,
+    )
+    if exe is None:
+        return 0
+    try:
+        proc = subprocess.run(exe, capture_output=True, text=True, timeout=60)
+    except (OSError, subprocess.SubprocessError):
+        return 0
+    if proc.returncode != 0:
+        return 0
+    return sum(1 for line in proc.stdout.splitlines() if line.strip())
+
+
+def detect_features(toolchain_bin=None, force=None):
+    """Return the set of lit features available on this runner.
+
+    hip/amdgpu are assumed present (this runner only ever drives GPU tests on a
+    runner that has the toolchain + HIP). ``multi-device`` is derived from the
+    number of GPUs the runtime actually exposes (>= 2), via ``amdgpu-arch``.
+    """
+    features = {"hip", "amdgpu"}
+    if sys.platform.startswith("linux"):
+        features.add("linux")
+    elif sys.platform.startswith("win"):
+        features.add("windows")
+
+    if _count_visible_gpus(toolchain_bin) >= 2:
+        features.add("multi-device")
+
+    if force:
+        for f in force:
+            features.add(f)
+    return features
+
+
+# --- boolean expression evaluation (REQUIRES / UNSUPPORTED) ------------------
+
+_TOKEN_RE = re.compile(r"\s*(\(|\)|\|\||&&|!|[\w.+-]+)\s*")
+
+
+def _clause_to_py(clause):
+    out = []
+    for tok in _TOKEN_RE.findall(clause):
+        if tok == "||":
+            out.append(" or ")
+        elif tok == "&&":
+            out.append(" and ")
+        elif tok == "!":
+            out.append(" not ")
+        elif tok in ("(", ")"):
+            out.append(tok)
+        elif tok == "true":
+            out.append("True")
+        elif tok == "false":
+            out.append("False")
+        else:
+            out.append("(%r in FEATURES)" % tok)
+    return "".join(out) or "True"
+
+
+def eval_requires(expr, features):
+    """All comma-separated clauses must be true."""
+    return all(
+        eval(_clause_to_py(c), {"__builtins__": {}}, {"FEATURES": features})
+        for c in expr.split(",")
+        if c.strip()
+    )
+
+
+def eval_unsupported(expr, features):
+    """Unsupported if any comma-separated clause is true."""
+    return any(
+        eval(_clause_to_py(c), {"__builtins__": {}}, {"FEATURES": features})
+        for c in expr.split(",")
+        if c.strip()
+    )
+
+
+# --- test parsing -----------------------------------------------------------
+
+_DIRECTIVE_RE = re.compile(r"(?://|#)\s*(RUN|REQUIRES|UNSUPPORTED):\s?(.*)")
+
+
+def parse_test(path):
+    """Return (run_lines, requires, unsupported) for a test file."""
+    runs, requires, unsupported = [], [], []
+    cont = None
+    for raw in Path(path).read_text(errors="replace").splitlines():
+        m = _DIRECTIVE_RE.search(raw)
+        if cont is not None:
+            # Continuation of a previous RUN line.
+            text = raw
+            cm = re.search(r"(?://|#)\s*RUN:\s?(.*)", raw)
+            if cm:
+                text = cm.group(1)
+            cont += " " + text.strip()
+            if cont.rstrip().endswith("\\"):
+                cont = cont.rstrip()[:-1]
+            else:
+                runs.append(cont)
+                cont = None
+            continue
+        if not m:
+            continue
+        kind, body = m.group(1), m.group(2)
+        if kind == "REQUIRES":
+            requires.append(body.strip())
+        elif kind == "UNSUPPORTED":
+            unsupported.append(body.strip())
+        elif kind == "RUN":
+            if body.rstrip().endswith("\\"):
+                cont = body.rstrip()[:-1]
+            else:
+                runs.append(body)
+    return runs, requires, unsupported
+
+
+# --- substitutions ----------------------------------------------------------
+
+
+def make_substitutions(clang, clangxx, src, tprefix, arch, hip_lib_path):
+    # Order matters: longer / more specific tokens first; %% resolved last.
+    return [
+        ("%clangxx", clangxx),
+        ("%clang", clang),
+        ("%amdgpu_arch", arch),
+        ("%hip_lib_path", hip_lib_path),
+        ("%run ", ""),
+        ("%s", str(src)),
+        ("%t", tprefix),
+        ("%%", "%"),
+    ]
+
+
+def apply_substitutions(line, subs):
+    for token, value in subs:
+        line = line.replace(token, value)
+    return line
+
+
+# --- tool shims (FileCheck / not) -------------------------------------------
+
+
+def ensure_tools(toolchain_bin, workdir):
+    """Build a PATH that resolves clang/llvm-*, FileCheck and not.
+
+    Prefers real binaries under toolchain_bin; falls back to shims for FileCheck
+    (PyPI ``filecheck``) and ``not`` (exit-code inverter).
+    """
+    shim_dir = workdir / "shims"
+    shim_dir.mkdir(parents=True, exist_ok=True)
+    path = os.pathsep.join([str(toolchain_bin), str(shim_dir), os.environ.get("PATH", "")])
+
+    def have(tool):
+        # File-based check (shutil.which is quirky across OSes / Git Bash). The
+        # shims are extensionless bash scripts, which Git Bash resolves via the
+        # shebang, so a real binary is anything matching tool or tool.exe.
+        tb = Path(toolchain_bin)
+        return (tb / tool).exists() or (tb / (tool + ".exe")).exists()
+
+    def write_shim(name, body):
+        p = shim_dir / name
+        p.write_text(body)
+        p.chmod(p.stat().st_mode | stat.S_IXUSR | stat.S_IXGRP | stat.S_IXOTH)
+
+    if not have("FileCheck"):
+        write_shim(
+            "FileCheck",
+            '#!/usr/bin/env bash\n'
+            'if command -v filecheck >/dev/null 2>&1; then exec filecheck "$@"; fi\n'
+            'exec python3 -m filecheck "$@"\n',
+        )
+    if not have("not"):
+        write_shim(
+            "not",
+            "#!/usr/bin/env bash\n"
+            'if [ "$1" = "--crash" ]; then shift; "$@"; ec=$?; '
+            "[ $ec -ge 128 ] && exit 0 || exit 1; fi\n"
+            '"$@"; ec=$?; [ $ec -eq 0 ] && exit 1 || exit 0\n',
+        )
+    return path
+
+
+# --- execution --------------------------------------------------------------
+
+
+def run_one(path, args, features, base_env):
+    runs, requires, unsupported = parse_test(path)
+
+    for expr in requires:
+        if not eval_requires(expr, features):
+            return "UNSUPPORTED", "missing requirement: %s" % expr
+    for expr in unsupported:
+        if eval_unsupported(expr, features):
+            return "UNSUPPORTED", "unsupported: %s" % expr
+    if not runs:
+        return "UNSUPPORTED", "no RUN lines"
+
+    workdir = Path(tempfile.mkdtemp(prefix="profgpu-"))
+    tprefix = str(workdir / "t")
+    subs = make_substitutions(
+        args.clang, args.clangxx, Path(path).resolve(), tprefix, args.amdgpu_arch,
+        args.hip_lib_path,
+    )
+
+    if args.dry_run:
+        print("# %s" % path)
+        for line in runs:
+            print("    " + apply_substitutions(line, subs).strip())
+        return "DRYRUN", ""
+
+    env = dict(base_env)
+    env["PATH"] = ensure_tools(Path(args.toolchain_bin), workdir)
+    timeout = args.timeout if args.timeout and args.timeout > 0 else None
+    for line in runs:
+        cmd = apply_substitutions(line, subs).strip()
+        try:
+            proc = subprocess.run(
+                ["bash", "-e", "-o", "pipefail", "-c", cmd],
+                cwd=str(workdir),
+                env=env,
+                capture_output=True,
+                text=True,
+                timeout=timeout,
+            )
+        except subprocess.TimeoutExpired as e:
+            out = e.stdout or ""
+            err = e.stderr or ""
+            if isinstance(out, bytes):
+                out = out.decode("utf-8", "replace")
+            if isinstance(err, bytes):
+                err = err.decode("utf-8", "replace")
+            detail = "RUN timed out after %gs: %s\n%s%s" % (
+                timeout, cmd, out, err,
+            )
+            if not args.keep:
+                shutil.rmtree(workdir, ignore_errors=True)
+            return "FAIL", detail
+        if proc.returncode != 0:
+            detail = "RUN failed (rc=%d): %s\n%s%s" % (
+                proc.returncode, cmd, proc.stdout, proc.stderr,
+            )
+            if not args.keep:
+                shutil.rmtree(workdir, ignore_errors=True)
+            return "FAIL", detail
+    if not args.keep:
+        shutil.rmtree(workdir, ignore_errors=True)
+    return "PASS", ""
+
+
+def discover(paths):
+    tests = []
+    for p in paths:
+        p = Path(p)
+        if p.is_dir():
+            tests.extend(sorted(str(x) for x in p.rglob("*.hip")))
+        elif p.is_file():
+            tests.append(str(p))
+    return tests
+
+
+def main():
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("tests", nargs="+", help="Test files or directories")
+    ap.add_argument("--toolchain-bin", required=False,
+                    help="Directory with clang and llvm-* tools")
+    ap.add_argument("--hip-lib-path", default="", help="Directory with libamdhip64")
+    ap.add_argument("--amdgpu-arch", default="native")
+    ap.add_argument("--clang", help="Override clang path")
+    ap.add_argument("--clangxx", help="Override clang++ path")
+    ap.add_argument("--feature", action="append", default=[],
+                    help="Force-enable an extra lit feature")
+    ap.add_argument("--dry-run", action="store_true",
+                    help="Print resolved RUN lines without executing")
+    ap.add_argument("--keep", action="store_true", help="Keep per-test temp dirs")
+    ap.add_argument("--timeout", type=float, default=600,
+                    help="Per-RUN-line timeout in seconds (<=0 disables); "
+                         "guards against a hung GPU/compiler wedging the run")
+    args = ap.parse_args()
+
+    if not args.dry_run and not args.toolchain_bin:
+        ap.error("--toolchain-bin is required unless --dry-run is given")
+
+    if args.toolchain_bin:
+        binp = Path(args.toolchain_bin)
+        args.clang = args.clang or str(binp / "clang")
+        args.clangxx = args.clangxx or str(binp / "clang++")
+    else:
+        args.clang = args.clang or "clang"
+        args.clangxx = args.clangxx or "clang++"
+
+    features = detect_features(args.toolchain_bin, args.feature)
+    print("# features: %s" % ", ".join(sorted(features)))
+
+    base_env = dict(os.environ)
+    if args.toolchain_bin:
+        lib_dirs = [
+            str(Path(args.toolchain_bin).parent / "lib"),  # toolchain libs
+        ]
+        if args.hip_lib_path:
+            lib_dirs.append(args.hip_lib_path)
+        existing = base_env.get("LD_LIBRARY_PATH", "")
+        base_env["LD_LIBRARY_PATH"] = os.pathsep.join(
+            [d for d in lib_dirs if d] + ([existing] if existing else [])
+        )
+
+    tests = discover(args.tests)
+    if not tests:
+        print("error: no tests found", file=sys.stderr)
+        return 2
+
+    results = {"PASS": [], "FAIL": [], "UNSUPPORTED": [], "DRYRUN": []}
+    for t in tests:
+        status, detail = run_one(t, args, features, base_env)
+        results[status].append(t)
+        if status == "FAIL":
+            print("FAIL: %s" % t)
+            print(detail)
+        elif status in ("PASS", "UNSUPPORTED"):
+            print("%s: %s" % (status, t))
+
+    print(
+        "\n# summary: %d passed, %d failed, %d unsupported (of %d)"
+        % (len(results["PASS"]), len(results["FAIL"]),
+           len(results["UNSUPPORTED"]), len(tests))
+    )
+    return 1 if results["FAIL"] else 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())