[clang] [compiler-rt] [PGO][HIP] HSA-introspection device profile drain + GPU PGO tests (PR #203056)
Larry Meadows via cfe-commits
cfe-commits at lists.llvm.org
Sat Jun 13 14:53:47 PDT 2026
https://github.com/lfmeadow updated https://github.com/llvm/llvm-project/pull/203056
>From 575f97951812585229c35ed82ff9bb3ce5c5519e Mon Sep 17 00:00:00 2001
From: Larry Meadows <Lawrence.Meadows at amd.com>
Date: Sat, 13 Jun 2026 11:48:37 -0500
Subject: [PATCH] [PGO][HIP] Add supplemental HSA-introspection device drain
and GPU PGO tests
The host-shadow device-profile drain (InstrProfilingPlatformROCm.cpp) can only
collect device counters for kernels that registered a host-side shadow via
__hipRegisterVar. Device-linked programs (e.g. RCCL) link the instrumented code
object directly into the device image with no host shadow, so their counters are
never drained.
Add a supplemental, Linux-only drain that introspects the loaded code objects via
the HSA runtime: it walks each GPU agent, enumerates only the code objects
actually resident there, reads each one's __llvm_profile_sections, and routes
them through the existing processDeviceOffloadPrf() path so the emitted profraw
layout is identical. A small content-dedup set keyed on the
(data, counters, names) device-pointer triple ensures a section already drained
by the host-shadow pass is not drained again, so the two passes compose without
double-counting. HSA is brought up lazily from the drain (never from a library
constructor) to avoid poisoning fork-based callers.
Because the HSA walk only ever touches resident code objects, it also makes the
host-shadow pass's collect-all fallback unnecessary on Linux: when no kernel
launch was tracked (a program that never launches, collects before its first
launch, or launches only via an untracked API), the host-shadow pass is skipped
and the HSA drain covers it safely instead of faulting/hanging on a non-resident
device on a multi-GPU host.
Also add a GPU-executed test suite (compiler-rt/test/profile/{GPU,AMDGPU}/*.hip)
and a dependency-free "lit-lite" runner (run_gpu_tests.py) so the device drain
can be exercised on a real AMD GPU runner: basic/coverage/pgo-use, multi-kernel,
device-branching, multi-GPU and non-default-device drain + dedup, early-collect /
no-kernel edges, RDC vs non-RDC __llvm_profile_sections, fork-safety (the RCCL
parent-no-HIP / kernel-in-forked-child pattern), quantitative device-counter
correctness, multi-process offline accumulation, and explicit-collect
idempotency. A standalone device-pgo/ build helper reproduces the toolchain
locally.
Link the device profile runtime on both the new-offload-driver
(LinkerWrapper::ConstructJob) and traditional (HIPAMD constructLldCommand) link
paths so instrumented device images resolve the runtime symbols.
Co-authored-by: Cursor <cursoragent at cursor.com>
---
clang/lib/Driver/ToolChains/Clang.cpp | 15 +
clang/lib/Driver/ToolChains/HIPAMD.cpp | 20 +
.../profile/InstrProfilingPlatformROCm.cpp | 682 ++++++++++++++++--
.../test/profile/AMDGPU/device-basic.hip | 67 ++
.../profile/AMDGPU/device-early-collect.hip | 68 ++
.../test/profile/AMDGPU/device-no-kernel.hip | 44 ++
.../test/profile/AMDGPU/device-symbols.hip | 42 ++
.../test/profile/AMDGPU/lit.local.cfg.py | 4 +
.../test/profile/GPU/instrprof-hip-basic.hip | 51 ++
.../GPU/instrprof-hip-collect-after.hip | 63 ++
.../GPU/instrprof-hip-counter-correctness.hip | 56 ++
.../profile/GPU/instrprof-hip-coverage.hip | 51 ++
.../GPU/instrprof-hip-device-branching.hip | 67 ++
.../profile/GPU/instrprof-hip-fork-safety.hip | 61 ++
.../profile/GPU/instrprof-hip-multi-gpu.hip | 57 ++
.../GPU/instrprof-hip-multi-process-merge.hip | 63 ++
.../GPU/instrprof-hip-multiple-kernels.hip | 58 ++
.../GPU/instrprof-hip-nondefault-device.hip | 60 ++
.../profile/GPU/instrprof-hip-pgo-use.hip | 63 ++
compiler-rt/test/profile/device-pgo/README.md | 125 ++++
compiler-rt/test/profile/device-pgo/build.sh | 56 ++
.../profile/device-pgo/toolchain-cache.cmake | 55 ++
compiler-rt/test/profile/run_gpu_tests.py | 380 ++++++++++
23 files changed, 2156 insertions(+), 52 deletions(-)
create mode 100644 compiler-rt/test/profile/AMDGPU/device-basic.hip
create mode 100644 compiler-rt/test/profile/AMDGPU/device-early-collect.hip
create mode 100644 compiler-rt/test/profile/AMDGPU/device-no-kernel.hip
create mode 100644 compiler-rt/test/profile/AMDGPU/device-symbols.hip
create mode 100644 compiler-rt/test/profile/AMDGPU/lit.local.cfg.py
create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-basic.hip
create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-collect-after.hip
create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-counter-correctness.hip
create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-coverage.hip
create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-device-branching.hip
create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-fork-safety.hip
create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-multi-gpu.hip
create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-multi-process-merge.hip
create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-multiple-kernels.hip
create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-nondefault-device.hip
create mode 100644 compiler-rt/test/profile/GPU/instrprof-hip-pgo-use.hip
create mode 100644 compiler-rt/test/profile/device-pgo/README.md
create mode 100755 compiler-rt/test/profile/device-pgo/build.sh
create mode 100644 compiler-rt/test/profile/device-pgo/toolchain-cache.cmake
create mode 100644 compiler-rt/test/profile/run_gpu_tests.py
diff --git a/clang/lib/Driver/ToolChains/Clang.cpp b/clang/lib/Driver/ToolChains/Clang.cpp
index c2ac478d84929..3b8bc46820af6 100644
--- a/clang/lib/Driver/ToolChains/Clang.cpp
+++ b/clang/lib/Driver/ToolChains/Clang.cpp
@@ -9658,6 +9658,21 @@ void LinkerWrapper::ConstructJob(Compilation &C, const JobAction &JA,
(TC->getTriple().isAMDGPU() || TC->getTriple().isNVPTX()))
LinkerArgs.emplace_back("-lompdevice");
+ // With PGO/coverage instrumentation, GPU device code references the
+ // device profile runtime (__llvm_profile_instrument_gpu and the
+ // __llvm_profile_sections bounds table emitted by
+ // InstrProfilingPlatformGPU). The offload device link does not otherwise
+ // pull it in, so forward the static device profile runtime to the GPU
+ // device linker. The archive is arch-suffixed, so pass its full path
+ // rather than a -l name.
+ if (ToolChain::needsProfileRT(Args) &&
+ (TC->getTriple().isAMDGPU() || TC->getTriple().isNVPTX())) {
+ std::string ProfileRT =
+ TC->getCompilerRT(Args, "profile", ToolChain::FT_Static);
+ if (TC->getVFS().exists(ProfileRT))
+ LinkerArgs.emplace_back(Args.MakeArgString(ProfileRT));
+ }
+
// For SPIR-V, pass some extra flags to `spirv-link`, the out-of-tree
// SPIR-V linker. `spirv-link` isn't called in LTO mode so restrict these
// flags to normal compilation.
diff --git a/clang/lib/Driver/ToolChains/HIPAMD.cpp b/clang/lib/Driver/ToolChains/HIPAMD.cpp
index 01cb23d0aa230..1bd4e073b4e27 100644
--- a/clang/lib/Driver/ToolChains/HIPAMD.cpp
+++ b/clang/lib/Driver/ToolChains/HIPAMD.cpp
@@ -19,6 +19,7 @@
#include "clang/Options/Options.h"
#include "llvm/Support/FileSystem.h"
#include "llvm/Support/Path.h"
+#include "llvm/Support/VirtualFileSystem.h"
#include "llvm/TargetParser/TargetParser.h"
using namespace clang::driver;
@@ -142,6 +143,25 @@ void AMDGCN::Linker::constructLldCommand(Compilation &C, const JobAction &JA,
LldArgs.push_back("--no-whole-archive");
+ // With PGO/coverage instrumentation, instrumented device code references the
+ // device profile runtime (__llvm_profile_instrument_gpu and the
+ // __llvm_profile_sections bounds table emitted by InstrProfilingPlatformGPU).
+ // The new-offload-driver path injects this in LinkerWrapper::ConstructJob,
+ // but HIP using the traditional offload path (e.g. on Windows, which does not
+ // route device linking through clang-linker-wrapper) reaches the device link
+ // here instead. Forward the static device profile runtime to this lld device
+ // link so the runtime is pulled in regardless of offload-driver/host OS. The
+ // archive is arch-suffixed, so pass its full path rather than a -l name.
+ if (ToolChain::needsProfileRT(Args)) {
+ std::string ProfileRT =
+ TC.getCompilerRT(Args, "profile", ToolChain::FT_Static);
+ // Use the ToolChain VFS (matches the new-offload-driver path in
+ // Clang.cpp) so overlay/virtual filesystems used by the driver are
+ // honored; llvm::sys::fs bypasses them and can wrongly skip the runtime.
+ if (TC.getVFS().exists(ProfileRT))
+ LldArgs.push_back(Args.MakeArgString(ProfileRT));
+ }
+
const char *Lld = Args.MakeArgStringRef(getToolChain().GetProgramPath("lld"));
C.addCommand(std::make_unique<Command>(JA, *this, ResponseFileSupport::None(),
Lld, LldArgs, Inputs, Output));
diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
index d0d9b1ea8f61d..b1db1d8a74041 100644
--- a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
+++ b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
@@ -66,6 +66,15 @@ struct OffloadSectionShadowGroup;
static int processDeviceOffloadPrf(void *DeviceOffloadPrf, const char *Target,
const OffloadSectionShadowGroup *Sections);
+#if defined(__linux__) && !defined(_WIN32)
+// Record a drained section-bounds tuple so the supplemental HSA-introspection
+// pass (Linux only) skips any code object the host-shadow path already
+// drained. Defined alongside the HSA drain below; forward-declared here so
+// processDeviceOffloadPrf can register every successful host-shadow drain.
+static void profRecordDrainedBounds(const void *Data, const void *Counters,
+ const void *Names);
+#endif
+
static int isVerboseMode() {
static int IsVerbose = -1;
if (IsVerbose == -1)
@@ -1119,8 +1128,14 @@ static int processDeviceOffloadPrf(void *DeviceOffloadPrf, const char *Target,
if (ret != 0) {
PROF_ERR("%s\n", "failed to write device profile using shared API");
- } else if (isVerboseMode()) {
- PROF_NOTE("%s\n", "Successfully wrote device profile using shared API");
+ } else {
+#if defined(__linux__) && !defined(_WIN32)
+ // Dedup against the supplemental HSA pass: this section is now drained, so
+ // the HSA walk must not drain the same device code object again.
+ profRecordDrainedBounds(DevDataBegin, DevCntsBegin, DevNamesBegin);
+#endif
+ if (isVerboseMode())
+ PROF_NOTE("%s\n", "Successfully wrote device profile using shared API");
}
return ret;
@@ -1148,72 +1163,635 @@ static int isHipAvailable(void) {
return pHipMemcpy != nullptr && pHipGetSymbolAddress != nullptr;
}
-/* -------------------------------------------------------------------------- */
-/* Collect device-side profile data */
-/* -------------------------------------------------------------------------- */
+/* ========================================================================== */
+/* Supplemental HSA-introspection drain (Linux only) */
+/* */
+/* The host-shadow drain above only sees device code objects registered */
+/* host-side (__hipRegisterVar shadows) or loaded through an intercepted */
+/* hipModuleLoad* call. Device code linked by the offload device linker with */
+/* no host-side shadow -- e.g. RCCL, whose many device functions are glued */
+/* into a single kernel with no source module -- is invisible to it. This */
+/* pass walks every GPU agent's loaded executables via HSA, finds each */
+/* __llvm_profile_sections table directly on the device, and drains the ones */
+/* the host-shadow pass did not already handle (deduped by the device */
+/* section-bounds tuple). It reuses processDeviceOffloadPrf() for the */
+/* copy/relocate/write so the on-disk profraw layout is identical. */
+/* ========================================================================== */
+#if defined(__linux__) && !defined(_WIN32)
-extern "C" int __llvm_profile_hip_collect_device_data(void) {
- if (NumShadowVariables == 0 && NumDynamicModules == 0)
+/* Minimal HSA type/enum stubs. compiler-rt cannot depend on ROCm headers at
+ * build time, so mirror just the handful of HSA declarations the drain needs.
+ * Values match hsa/hsa.h and hsa/hsa_ven_amd_loader.h. */
+typedef uint32_t prof_hsa_status_t;
+#define PROF_HSA_STATUS_SUCCESS ((prof_hsa_status_t)0x0)
+#define PROF_HSA_STATUS_INFO_BREAK ((prof_hsa_status_t)0x1)
+
+typedef struct {
+ uint64_t handle;
+} prof_hsa_agent_t;
+typedef struct {
+ uint64_t handle;
+} prof_hsa_executable_t;
+typedef struct {
+ uint64_t handle;
+} prof_hsa_executable_symbol_t;
+
+typedef uint32_t prof_hsa_agent_info_t;
+#define PROF_HSA_AGENT_INFO_NAME ((prof_hsa_agent_info_t)0)
+#define PROF_HSA_AGENT_INFO_DEVICE ((prof_hsa_agent_info_t)17)
+
+typedef uint32_t prof_hsa_device_type_t;
+#define PROF_HSA_DEVICE_TYPE_GPU ((prof_hsa_device_type_t)1)
+
+typedef uint32_t prof_hsa_symbol_kind_t;
+#define PROF_HSA_SYMBOL_KIND_VARIABLE ((prof_hsa_symbol_kind_t)0)
+
+typedef uint32_t prof_hsa_executable_symbol_info_t;
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_TYPE \
+ ((prof_hsa_executable_symbol_info_t)0)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME_LENGTH \
+ ((prof_hsa_executable_symbol_info_t)1)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME \
+ ((prof_hsa_executable_symbol_info_t)2)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS \
+ ((prof_hsa_executable_symbol_info_t)21)
+
+#define PROF_HSA_EXTENSION_AMD_LOADER ((uint16_t)0x201)
+
+typedef uint32_t prof_hsa_loader_storage_type_t;
+
+typedef struct {
+ prof_hsa_agent_t agent;
+ prof_hsa_executable_t executable;
+ prof_hsa_loader_storage_type_t code_object_storage_type;
+ const void *code_object_storage_base;
+ size_t code_object_storage_size;
+ size_t code_object_storage_offset;
+ const void *segment_base;
+ size_t segment_size;
+} prof_hsa_loader_segment_descriptor_t;
+
+typedef prof_hsa_status_t (*hsa_init_ty)(void);
+typedef prof_hsa_status_t (*hsa_iterate_agents_ty)(
+ prof_hsa_status_t (*)(prof_hsa_agent_t, void *), void *);
+typedef prof_hsa_status_t (*hsa_agent_get_info_ty)(prof_hsa_agent_t,
+ prof_hsa_agent_info_t,
+ void *);
+typedef prof_hsa_status_t (*hsa_executable_iterate_agent_symbols_ty)(
+ prof_hsa_executable_t, prof_hsa_agent_t,
+ prof_hsa_status_t (*)(prof_hsa_executable_t, prof_hsa_agent_t,
+ prof_hsa_executable_symbol_t, void *),
+ void *);
+typedef prof_hsa_status_t (*hsa_executable_symbol_get_info_ty)(
+ prof_hsa_executable_symbol_t, prof_hsa_executable_symbol_info_t, void *);
+typedef prof_hsa_status_t (*hsa_system_get_major_extension_table_ty)(uint16_t,
+ uint16_t,
+ size_t,
+ void *);
+typedef prof_hsa_status_t (*hsa_loader_query_segment_descriptors_ty)(
+ prof_hsa_loader_segment_descriptor_t *, size_t *);
+
+/* First two members of hsa_ven_amd_loader_1_00_pfn_t. Only
+ * query_segment_descriptors is used; query_host_address keeps the offset. */
+typedef struct {
+ void *query_host_address;
+ hsa_loader_query_segment_descriptors_ty query_segment_descriptors;
+} prof_hsa_loader_pfn_t;
+
+static hsa_iterate_agents_ty pHsaIterateAgents = nullptr;
+static hsa_agent_get_info_ty pHsaAgentGetInfo = nullptr;
+static hsa_executable_iterate_agent_symbols_ty pHsaExecIterAgentSyms = nullptr;
+static hsa_executable_symbol_get_info_ty pHsaSymGetInfo = nullptr;
+static hsa_loader_query_segment_descriptors_ty pQuerySegDescs = nullptr;
+
+/* 0 = not yet attempted, 1 = ready, -1 = unavailable. Accessed with acquire/
+ * release atomics: a thread observing HsaRuntimeState==1 (acquire) also sees
+ * the fully-written p* function pointers (published before the release store
+ * of HsaRuntimeState=1 below). */
+static int HsaRuntimeState = 0;
+
+static int setHsaRuntimeState(int S) {
+ __atomic_store_n(&HsaRuntimeState, S, __ATOMIC_RELEASE);
+ return S > 0 ? 0 : -1;
+}
+
+/* Resolve HSA entry points (and the AMD loader extension) once, and confirm
+ * HIP's hipMemcpy is reachable for the device-to-host copies. HIP itself is
+ * resolved by the shared ensureHipLoaded() above. */
+static int loadHsaRuntimePointers(void) {
+ int State = __atomic_load_n(&HsaRuntimeState, __ATOMIC_ACQUIRE);
+ if (State)
+ return State > 0 ? 0 : -1;
+
+ if (!__interception::DynamicLoaderAvailable()) {
+ if (isVerboseMode())
+ PROF_NOTE("%s", "Dynamic library loading not available - "
+ "HSA device profiling disabled\n");
+ return setHsaRuntimeState(-1);
+ }
+
+ void *Hsa = __interception::OpenLibrary("libhsa-runtime64.so");
+ if (!Hsa)
+ Hsa = __interception::OpenLibrary("libhsa-runtime64.so.1");
+ if (!Hsa) {
+ if (isVerboseMode())
+ PROF_NOTE("%s", "libhsa-runtime64.so not loadable - "
+ "HSA device profiling disabled\n");
+ return setHsaRuntimeState(-1);
+ }
+
+ hsa_init_ty pHsaInit =
+ (hsa_init_ty)__interception::LookupSymbol(Hsa, "hsa_init");
+ hsa_system_get_major_extension_table_ty pGetExtTable =
+ (hsa_system_get_major_extension_table_ty)__interception::LookupSymbol(
+ Hsa, "hsa_system_get_major_extension_table");
+ pHsaIterateAgents = (hsa_iterate_agents_ty)__interception::LookupSymbol(
+ Hsa, "hsa_iterate_agents");
+ pHsaAgentGetInfo = (hsa_agent_get_info_ty)__interception::LookupSymbol(
+ Hsa, "hsa_agent_get_info");
+ pHsaExecIterAgentSyms =
+ (hsa_executable_iterate_agent_symbols_ty)__interception::LookupSymbol(
+ Hsa, "hsa_executable_iterate_agent_symbols");
+ pHsaSymGetInfo =
+ (hsa_executable_symbol_get_info_ty)__interception::LookupSymbol(
+ Hsa, "hsa_executable_symbol_get_info");
+
+ if (!pHsaInit || !pGetExtTable || !pHsaIterateAgents || !pHsaAgentGetInfo ||
+ !pHsaExecIterAgentSyms || !pHsaSymGetInfo) {
+ PROF_WARN("%s",
+ "required HSA symbols missing - HSA device profiling disabled\n");
+ return setHsaRuntimeState(-1);
+ }
+
+ /* Bring HSA up (idempotent, refcounted). This runs lazily on the first drain
+ * rather than from the library constructor, so merely loading the
+ * instrumented library does not initialize HSA in the process -- which would
+ * break fork-based callers that deliberately keep HIP/HSA uninitialized in
+ * the parent (see the constructor note at the end of the HSA block). In the
+ * common case the drain runs from the profile write path while HSA is still
+ * alive; if it only runs after HSA's own atexit(hsa_shut_down) has executed,
+ * this simply re-initializes HSA (the process is exiting anyway). */
+ prof_hsa_status_t St = pHsaInit();
+ if (St != PROF_HSA_STATUS_SUCCESS && St != PROF_HSA_STATUS_INFO_BREAK) {
+ if (isVerboseMode())
+ PROF_NOTE("hsa_init failed (0x%x) - HSA device profiling disabled\n", St);
+ return setHsaRuntimeState(-1);
+ }
+
+ prof_hsa_loader_pfn_t LoaderApi;
+ __builtin_memset(&LoaderApi, 0, sizeof(LoaderApi));
+ St = pGetExtTable(PROF_HSA_EXTENSION_AMD_LOADER, 1, sizeof(LoaderApi),
+ &LoaderApi);
+ if (St != PROF_HSA_STATUS_SUCCESS || !LoaderApi.query_segment_descriptors) {
+ PROF_WARN("AMD loader extension unavailable (0x%x) - "
+ "HSA device profiling disabled\n",
+ St);
+ return setHsaRuntimeState(-1);
+ }
+ pQuerySegDescs = LoaderApi.query_segment_descriptors;
+
+ /* The device-to-host copies go through the shared HIP loader. */
+ ensureHipLoaded();
+ if (!pHipMemcpy) {
+ PROF_WARN("%s", "hipMemcpy unavailable - HSA device profiling disabled\n");
+ return setHsaRuntimeState(-1);
+ }
+
+ if (isVerboseMode())
+ PROF_NOTE("%s", "HSA + HIP runtime resolved for device profiling\n");
+ return setHsaRuntimeState(1);
+}
+
+/* The canonical device bounds-table symbol from InstrProfilingPlatformGPU.c. */
+static const char ProfileSectionsSymbol[] = "__llvm_profile_sections";
+
+/* Dedup of drained section-bounds tuples, shared with the host-shadow path
+ * (processDeviceOffloadPrf records here on every successful drain). A single
+ * linked device code object exposes one __llvm_profile_sections, but the same
+ * bounds may be seen via multiple agents, so each unique counter set is
+ * drained exactly once across both paths. */
+namespace {
+struct ProfBoundsTuple {
+ const void *data;
+ const void *cnts;
+ const void *names;
+};
+} // namespace
+
+#define PROF_MAX_SEEN_BOUNDS 256
+static ProfBoundsTuple SeenBounds[PROF_MAX_SEEN_BOUNDS];
+static int NumSeenBounds = 0;
+
+/* Pure check: has this bounds tuple already been drained? Does not mutate
+ * state, so a transient failure does not permanently suppress retries. */
+static int profBoundsAlreadyDrained(const void *D, const void *C,
+ const void *N) {
+ for (int i = 0; i < NumSeenBounds; ++i)
+ if (SeenBounds[i].data == D && SeenBounds[i].cnts == C &&
+ SeenBounds[i].names == N)
+ return 1;
+ return 0;
+}
+
+/* Record a drained bounds tuple. Idempotent. Called after a successful drain
+ * (either path) so a failed attempt stays retryable. */
+static void profRecordDrainedBounds(const void *D, const void *C,
+ const void *N) {
+ if (profBoundsAlreadyDrained(D, C, N))
+ return;
+ if (NumSeenBounds < PROF_MAX_SEEN_BOUNDS) {
+ SeenBounds[NumSeenBounds].data = D;
+ SeenBounds[NumSeenBounds].cnts = C;
+ SeenBounds[NumSeenBounds].names = N;
+ NumSeenBounds++;
+ }
+}
+
+#define PROF_MAX_GPU_AGENTS 64
+
+namespace {
+struct GpuAgent {
+ prof_hsa_agent_t agent;
+ char arch[64];
+};
+
+struct WalkState {
+ GpuAgent agents[PROF_MAX_GPU_AGENTS];
+ int num_agents;
+ int total_found;
+ int total_drained;
+};
+
+/* Per (agent, executable) symbol-iteration state. */
+struct SymbolState {
+ const char *arch;
+ int found;
+ int drained;
+};
+} // namespace
+
+/* HSA per-symbol callback: when it finds a __llvm_profile_sections variable,
+ * drain it via processDeviceOffloadPrf() unless the host-shadow path (or an
+ * earlier agent) already handled the same bounds. */
+static prof_hsa_status_t onSymbol(prof_hsa_executable_t, prof_hsa_agent_t,
+ prof_hsa_executable_symbol_t Sym,
+ void *Data) {
+ SymbolState *S = (SymbolState *)Data;
+
+ prof_hsa_symbol_kind_t Kind;
+ if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_TYPE, &Kind) !=
+ PROF_HSA_STATUS_SUCCESS ||
+ Kind != PROF_HSA_SYMBOL_KIND_VARIABLE)
+ return PROF_HSA_STATUS_SUCCESS;
+
+ uint32_t NameLen = 0;
+ if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME_LENGTH,
+ &NameLen) != PROF_HSA_STATUS_SUCCESS ||
+ NameLen != sizeof(ProfileSectionsSymbol) - 1)
+ return PROF_HSA_STATUS_SUCCESS;
+
+ char NameBuf[64];
+ if (NameLen + 1 > sizeof(NameBuf))
+ return PROF_HSA_STATUS_SUCCESS;
+ if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME, NameBuf) !=
+ PROF_HSA_STATUS_SUCCESS)
+ return PROF_HSA_STATUS_SUCCESS;
+ NameBuf[NameLen] = '\0';
+
+ if (strcmp(NameBuf, ProfileSectionsSymbol) != 0)
+ return PROF_HSA_STATUS_SUCCESS;
+
+ uint64_t Addr = 0;
+ if (pHsaSymGetInfo(Sym, PROF_HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS,
+ &Addr) != PROF_HSA_STATUS_SUCCESS ||
+ Addr == 0) {
+ if (isVerboseMode())
+ PROF_NOTE("%s", "failed to read __llvm_profile_sections address\n");
+ return PROF_HSA_STATUS_SUCCESS;
+ }
+
+ S->found++;
+
+ // Read the bounds table first to dedup (and detect empty sections) before
+ // the full copy/relocate done by processDeviceOffloadPrf.
+ __llvm_profile_gpu_sections Sec;
+ if (memcpyDeviceToHost(&Sec, (void *)(uintptr_t)Addr, sizeof(Sec)) != 0) {
+ PROF_WARN("%s", "failed to copy device bounds table\n");
+ return PROF_HSA_STATUS_SUCCESS;
+ }
+ if (profBoundsAlreadyDrained(Sec.DataStart, Sec.CountersStart,
+ Sec.NamesStart)) {
+ if (isVerboseMode())
+ PROF_NOTE("%s", "device bounds already drained, skipping\n");
+ return PROF_HSA_STATUS_SUCCESS;
+ }
+
+ size_t DataBytes = (const char *)Sec.DataStop - (const char *)Sec.DataStart;
+ size_t CntsBytes =
+ (const char *)Sec.CountersStop - (const char *)Sec.CountersStart;
+ if (DataBytes == 0 || CntsBytes == 0) {
+ // Empty code object: nothing to write. Mark seen so we don't revisit it.
+ profRecordDrainedBounds(Sec.DataStart, Sec.CountersStart, Sec.NamesStart);
+ return PROF_HSA_STATUS_SUCCESS;
+ }
+
+ // Generate a collision-free target. Multiple distinct device code objects on
+ // the same arch (e.g. non-RDC multi-TU) must not clobber each other's file.
+ static int DrainIndex = 0;
+ char Target[96];
+ if (DrainIndex == 0)
+ snprintf(Target, sizeof(Target), "%s", S->arch);
+ else
+ snprintf(Target, sizeof(Target), "%s.%d", S->arch, DrainIndex);
+
+ // processDeviceOffloadPrf returns 0 on a successful write, -1 on error.
+ // Record the bounds (and advance the target index) only on success so a
+ // transient error stays retryable on a later agent or collect call.
+ int Rc = processDeviceOffloadPrf((void *)(uintptr_t)Addr, Target, nullptr);
+ if (Rc == 0) {
+ S->drained++;
+ DrainIndex++;
+ profRecordDrainedBounds(Sec.DataStart, Sec.CountersStart, Sec.NamesStart);
+ }
+
+ return PROF_HSA_STATUS_SUCCESS;
+}
+
+static prof_hsa_status_t collectAgent(prof_hsa_agent_t Agent, void *Data) {
+ prof_hsa_device_type_t DevType;
+ if (pHsaAgentGetInfo(Agent, PROF_HSA_AGENT_INFO_DEVICE, &DevType) !=
+ PROF_HSA_STATUS_SUCCESS ||
+ DevType != PROF_HSA_DEVICE_TYPE_GPU)
+ return PROF_HSA_STATUS_SUCCESS;
+
+ WalkState *W = (WalkState *)Data;
+ if (W->num_agents >= PROF_MAX_GPU_AGENTS)
+ return PROF_HSA_STATUS_SUCCESS;
+
+ GpuAgent &GA = W->agents[W->num_agents++];
+ GA.agent = Agent;
+ char Name[64];
+ __builtin_memset(Name, 0, sizeof(Name));
+ pHsaAgentGetInfo(Agent, PROF_HSA_AGENT_INFO_NAME, Name);
+ size_t N = strnlen(Name, sizeof(GA.arch) - 1);
+ __builtin_memcpy(GA.arch, Name, N);
+ GA.arch[N] = '\0';
+ if (!GA.arch[0])
+ strncpy(GA.arch, "amdgpu", sizeof(GA.arch) - 1);
+
+ if (isVerboseMode())
+ PROF_NOTE("GPU agent %d: %s\n", W->num_agents - 1, GA.arch);
+ return PROF_HSA_STATUS_SUCCESS;
+}
+
+/* Reentrancy guard and "drained data at least once" latch. The collect hook
+ * may run more than once (an explicit early __llvm_profile_write_file plus the
+ * exit write); a successful walk latches HsaDrainCompleted so we never re-emit
+ * duplicate .profraw files, while transient no-op outcomes ("runtime not yet
+ * loadable", "no GPU agents", "no loaded segments", "nothing instrumented")
+ * stay retryable so a later call can still pick up code objects loaded later.
+ * HsaDrainInProgress prevents a concurrent or reentrant call (e.g. a library
+ * destructor) from corrupting the global SeenBounds table. Both flags use
+ * acquire/release atomics. */
+static int HsaDrainInProgress = 0;
+static int HsaDrainCompleted = 0;
+
+static int drainDevicesViaHsa(void) {
+ if (__atomic_load_n(&HsaDrainCompleted, __ATOMIC_ACQUIRE))
return 0;
- if (!isHipAvailable())
+ int Expected = 0;
+ if (!__atomic_compare_exchange_n(&HsaDrainInProgress, &Expected, 1,
+ /*weak=*/0, __ATOMIC_ACQ_REL,
+ __ATOMIC_ACQUIRE))
return 0;
- int Ret = 0;
+ struct InProgressGuard {
+ ~InProgressGuard() {
+ __atomic_store_n(&HsaDrainInProgress, 0, __ATOMIC_RELEASE);
+ }
+ } _Guard;
- /* Shadow variables (static-linked kernels): drain from every device. */
- if (NumShadowVariables > 0) {
- int OrigDevice = -1;
- hipGetDevice(&OrigDevice);
+ if (loadHsaRuntimePointers() != 0)
+ return 0; /* Runtime unavailable: stay retryable. */
- for (int Dev = 0; Dev < NumDevices; ++Dev) {
- if (!shouldCollectDevice(Dev)) {
- if (isVerboseMode())
- PROF_NOTE("Skipping unused device %d\n", Dev);
- continue;
+ WalkState W;
+ __builtin_memset(&W, 0, sizeof(W));
+ prof_hsa_status_t St = pHsaIterateAgents(collectAgent, &W);
+ if (St != PROF_HSA_STATUS_SUCCESS && St != PROF_HSA_STATUS_INFO_BREAK) {
+ PROF_WARN("hsa_iterate_agents failed (0x%x)\n", St);
+ return -1;
+ }
+ if (W.num_agents == 0) {
+ if (isVerboseMode())
+ PROF_NOTE("%s", "no GPU agents present; nothing to drain (will retry)\n");
+ return 0;
+ }
+
+ /* query_segment_descriptors ships in every loader-extension version and is
+ * more permissive than iterate_executables on ROCm. It yields the loaded
+ * (agent, executable) pairs directly. */
+ size_t NumSegs = 0;
+ St = pQuerySegDescs(nullptr, &NumSegs);
+ if (St != PROF_HSA_STATUS_SUCCESS) {
+ PROF_WARN("query_segment_descriptors(count) failed (0x%x)\n", St);
+ return -1;
+ }
+ if (NumSegs == 0) {
+ if (isVerboseMode())
+ PROF_NOTE("%s", "no loaded segments; nothing to drain (will retry)\n");
+ return 0;
+ }
+
+ prof_hsa_loader_segment_descriptor_t *Segs =
+ (prof_hsa_loader_segment_descriptor_t *)calloc(NumSegs, sizeof(*Segs));
+ if (!Segs) {
+ PROF_ERR("%s\n", "failed to allocate segment descriptor array");
+ return -1;
+ }
+ UniqueFree SegsOwner(Segs);
+
+ St = pQuerySegDescs(Segs, &NumSegs);
+ if (St != PROF_HSA_STATUS_SUCCESS) {
+ PROF_WARN("query_segment_descriptors(fetch) failed (0x%x)\n", St);
+ return -1;
+ }
+
+ if (isVerboseMode())
+ PROF_NOTE("query_segment_descriptors: %zu segments\n", NumSegs);
+
+ /* Walk unique (agent, executable) pairs. */
+ enum { kMaxPairs = 512 };
+ uint64_t SeenAgents[kMaxPairs];
+ uint64_t SeenExecs[kMaxPairs];
+ int NumPairs = 0;
+ int IterFailures = 0;
+
+ for (size_t i = 0; i < NumSegs; ++i) {
+ if (Segs[i].executable.handle == 0 || Segs[i].agent.handle == 0)
+ continue;
+
+ int Seen = 0;
+ for (int j = 0; j < NumPairs; ++j)
+ if (SeenAgents[j] == Segs[i].agent.handle &&
+ SeenExecs[j] == Segs[i].executable.handle) {
+ Seen = 1;
+ break;
}
- if (hipSetDevice(Dev) != 0) {
- if (isVerboseMode())
- PROF_NOTE("Failed to set device %d, skipping\n", Dev);
- continue;
+ if (Seen)
+ continue;
+ if (NumPairs < kMaxPairs) {
+ SeenAgents[NumPairs] = Segs[i].agent.handle;
+ SeenExecs[NumPairs] = Segs[i].executable.handle;
+ NumPairs++;
+ }
+
+ const char *Arch = nullptr;
+ for (int k = 0; k < W.num_agents; ++k)
+ if (W.agents[k].agent.handle == Segs[i].agent.handle) {
+ Arch = W.agents[k].arch;
+ break;
}
- const char *ArchName = getDeviceArchName(Dev);
- if (isVerboseMode())
- PROF_NOTE("Collecting static profile data from device %d (%s)\n", Dev,
- ArchName);
- for (int i = 0; i < NumShadowVariables; ++i) {
- /* RDC-mode multi-shadow drains need a distinct profraw per TU;
- * single-TU programs keep the bare arch target. */
- const char *Target = ArchName;
- char TargetWithIdx[64];
- if (NumShadowVariables > 1) {
- snprintf(TargetWithIdx, sizeof(TargetWithIdx), "%s.%d", ArchName, i);
- Target = TargetWithIdx;
+ if (!Arch)
+ continue; /* not a GPU agent we collected */
+
+ SymbolState S;
+ __builtin_memset(&S, 0, sizeof(S));
+ S.arch = Arch;
+ if (isVerboseMode())
+ PROF_NOTE("walking executable 0x%llx on %s\n",
+ (unsigned long long)Segs[i].executable.handle, Arch);
+ prof_hsa_status_t IterSt =
+ pHsaExecIterAgentSyms(Segs[i].executable, Segs[i].agent, onSymbol, &S);
+ if (IterSt != PROF_HSA_STATUS_SUCCESS &&
+ IterSt != PROF_HSA_STATUS_INFO_BREAK) {
+ PROF_WARN("hsa_executable_iterate_agent_symbols on executable 0x%llx "
+ "failed (0x%x)\n",
+ (unsigned long long)Segs[i].executable.handle, IterSt);
+ IterFailures++;
+ }
+ W.total_found += S.found;
+ W.total_drained += S.drained;
+ }
+
+ if (isVerboseMode())
+ PROF_NOTE("HSA walk complete: agents=%d pairs=%d found=%d drained=%d "
+ "iter-failures=%d\n",
+ W.num_agents, NumPairs, W.total_found, W.total_drained,
+ IterFailures);
+
+ /* Latch only when we actually drained data. Deliberately do NOT latch the
+ * "walked everything but found nothing new" case: an early collect call can
+ * run before any kernel launch, and latching it would suppress the real
+ * exit-time drain once kernels do run. Repeating a no-op walk is cheap. */
+ if (W.total_drained > 0)
+ __atomic_store_n(&HsaDrainCompleted, 1, __ATOMIC_RELEASE);
+ return (IterFailures > 0) ? -1 : 0;
+}
+
+/* NOTE: deliberately no library constructor that calls hsa_init() here.
+ * Bringing HSA up merely because the instrumented library was loaded poisons
+ * fork-based callers: frameworks and tests (e.g. RCCL's unit tests) keep
+ * HIP/HSA uninitialized in the parent and only touch HIP inside forked
+ * children. A parent that has already hsa_init()'d makes those children crash
+ * inside HSA (HSA state is not valid across fork()). HSA is instead brought up
+ * lazily from drainDevicesViaHsa() -> loadHsaRuntimePointers(); see the init
+ * rationale there. */
+
+#endif /* defined(__linux__) && !defined(_WIN32) -- HSA drain */
+
+/* -------------------------------------------------------------------------- */
+/* Collect device-side profile data */
+/* -------------------------------------------------------------------------- */
+
+extern "C" int __llvm_profile_hip_collect_device_data(void) {
+ int Ret = 0;
+
+ /* Host-shadow drain: static-linked kernels (host __hipRegisterVar shadows)
+ * and intercepted dynamic modules. Only meaningful when something registered
+ * host-side; skipped entirely for pure device-linked programs (RCCL), which
+ * the supplemental HSA pass below handles. */
+ if ((NumShadowVariables != 0 || NumDynamicModules != 0) && isHipAvailable()) {
+ /* Shadow variables (static-linked kernels): drain from every device. */
+ if (NumShadowVariables > 0) {
+ int OrigDevice = -1;
+ hipGetDevice(&OrigDevice);
+
+ for (int Dev = 0; Dev < NumDevices; ++Dev) {
+ if (!shouldCollectDevice(Dev)) {
+ if (isVerboseMode())
+ PROF_NOTE("Skipping unused device %d\n", Dev);
+ continue;
+ }
+#if defined(__linux__) && !defined(_WIN32)
+ /* When no kernel launch was tracked at all, shouldCollectDevice()
+ * falls back to collect-all, which can fault/hang reading a
+ * non-resident device's sections on a multi-GPU host (e.g. a program
+ * that never launches, collects before its first launch, or launches
+ * only via an untracked API). On Linux the supplemental HSA drain
+ * below covers those cases safely -- it walks only code objects
+ * actually resident on each agent -- so skip the host-shadow pass
+ * entirely rather than take the unsafe fallback. */
+ if (!__atomic_load_n(&AnyDeviceUsed, __ATOMIC_ACQUIRE)) {
+ if (isVerboseMode())
+ PROF_NOTE("No tracked launch; deferring device %d to HSA drain\n",
+ Dev);
+ continue;
+ }
+#endif
+ if (hipSetDevice(Dev) != 0) {
+ if (isVerboseMode())
+ PROF_NOTE("Failed to set device %d, skipping\n", Dev);
+ continue;
+ }
+ const char *ArchName = getDeviceArchName(Dev);
+ if (isVerboseMode())
+ PROF_NOTE("Collecting static profile data from device %d (%s)\n", Dev,
+ ArchName);
+ for (int i = 0; i < NumShadowVariables; ++i) {
+ /* RDC-mode multi-shadow drains need a distinct profraw per TU;
+ * single-TU programs keep the bare arch target. */
+ const char *Target = ArchName;
+ char TargetWithIdx[64];
+ if (NumShadowVariables > 1) {
+ snprintf(TargetWithIdx, sizeof(TargetWithIdx), "%s.%d", ArchName,
+ i);
+ Target = TargetWithIdx;
+ }
+ if (processShadowVariable(i, Target) != 0)
+ Ret = -1;
}
- if (processShadowVariable(i, Target) != 0)
- Ret = -1;
}
- }
- if (OrigDevice >= 0)
- hipSetDevice(OrigDevice);
- }
+ if (OrigDevice >= 0)
+ hipSetDevice(OrigDevice);
+ }
- /* Warn about unprocessed TUs; skip cleared slots (already drained). */
- lockDynamicModules();
- for (int i = 0; i < NumDynamicModules; ++i) {
- OffloadDynamicModuleInfo *MI = &DynamicModules[i];
- if (!MI->ModulePtr)
- continue;
- for (int t = 0; t < MI->NumTUs; ++t) {
- if (!MI->TUs[t].Processed) {
- PROF_WARN("dynamic module %p TU %d was not processed before exit\n",
- MI->ModulePtr, t);
- Ret = -1;
+ /* Warn about unprocessed TUs; skip cleared slots (already drained). */
+ lockDynamicModules();
+ for (int i = 0; i < NumDynamicModules; ++i) {
+ OffloadDynamicModuleInfo *MI = &DynamicModules[i];
+ if (!MI->ModulePtr)
+ continue;
+ for (int t = 0; t < MI->NumTUs; ++t) {
+ if (!MI->TUs[t].Processed) {
+ PROF_WARN("dynamic module %p TU %d was not processed before exit\n",
+ MI->ModulePtr, t);
+ Ret = -1;
+ }
}
}
+ unlockDynamicModules();
}
- unlockDynamicModules();
+
+#if defined(__linux__) && !defined(_WIN32)
+ /* Supplemental HSA-introspection drain: catches device code objects with no
+ * host-side shadow (e.g. RCCL device-linked kernels). Runs after the
+ * host-shadow drain so already-drained sections are deduped out, and runs
+ * even when there are no host shadows at all (the common RCCL case). */
+ if (drainDevicesViaHsa() != 0)
+ Ret = -1;
+#endif
if (Ret != 0)
PROF_WARN("%s\n", "failed to collect device profile data");
diff --git a/compiler-rt/test/profile/AMDGPU/device-basic.hip b/compiler-rt/test/profile/AMDGPU/device-basic.hip
new file mode 100644
index 0000000000000..4fcf044802240
--- /dev/null
+++ b/compiler-rt/test/profile/AMDGPU/device-basic.hip
@@ -0,0 +1,67 @@
+// Basic HIP device PGO drain end-to-end: a host + device .profraw are written
+// at exit (the device one arch-prefixed), they merge, the merged profile
+// contains the device kernel's counters, and llvm-cov reports device-side
+// coverage. Covers both non-RDC and RDC device compiles.
+//
+// REQUIRES: hip, amdgpu
+
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+
+// --- non-RDC ---
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \
+// RUN: -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN: -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \
+// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN: %t.dir/a.out
+// A device profraw (arch-prefixed) must have been drained alongside the host one.
+// RUN: ls %t.dir/gfx*.profraw
+// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/a.profdata
+// RUN: llvm-profdata show --all-functions %t.dir/a.profdata \
+// RUN: | FileCheck --check-prefix=FUNCS %s
+// Confirm the embedded device image is extractable (failure here is the real
+// cause of any downstream llvm-cov failure, so let it propagate).
+// RUN: llvm-objdump --offloading %t.dir/a.out > /dev/null
+// RUN: llvm-cov report %t.dir/a.out.0.hip-amdgcn-amd-amdhsa--*gfx* \
+// RUN: -instr-profile=%t.dir/a.profdata 2>&1 | FileCheck --check-prefix=COV %s
+
+// --- RDC ---
+// RUN: rm -f %t.dir/*.profraw
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fgpu-rdc \
+// RUN: -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/b.out \
+// RUN: -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \
+// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN: %t.dir/b.out
+// RUN: ls %t.dir/gfx*.profraw
+// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/b.profdata
+// RUN: llvm-profdata show --all-functions %t.dir/b.profdata \
+// RUN: | FileCheck --check-prefix=FUNCS %s
+
+#include <hip/hip_runtime.h>
+
+__global__ void addk(int *p) {
+ if (*p > 0)
+ *p += 1;
+ else
+ *p -= 1;
+}
+
+int main() {
+ int *d = nullptr;
+ if (hipMalloc(&d, sizeof(int)) != hipSuccess)
+ return 2;
+ int h = 5;
+ (void)hipMemcpy(d, &h, sizeof(int), hipMemcpyHostToDevice);
+ addk<<<1, 1>>>(d);
+ (void)hipMemcpy(&h, d, sizeof(int), hipMemcpyDeviceToHost);
+ (void)hipFree(d);
+ return h > 0 ? 0 : 1;
+}
+
+// The merged profile contains both the host main and the device kernel,
+// proving the device counters were drained and merged.
+// FUNCS-DAG: addk
+// FUNCS-DAG: main
+
+// COV: TOTAL
diff --git a/compiler-rt/test/profile/AMDGPU/device-early-collect.hip b/compiler-rt/test/profile/AMDGPU/device-early-collect.hip
new file mode 100644
index 0000000000000..3e2c6e84e26c2
--- /dev/null
+++ b/compiler-rt/test/profile/AMDGPU/device-early-collect.hip
@@ -0,0 +1,68 @@
+// M1 regression: calling __llvm_profile_hip_collect_device_data() before any
+// kernel has been launched must not poison the later atexit drain. The early
+// call sees "no instrumented code object loaded yet" (a transient no-op) and
+// must not latch the drain as completed; otherwise the post-launch atexit
+// pass produces no device .profraw and we silently lose device counters.
+//
+// REQUIRES: hip, amdgpu
+// Guards the Linux introspection drain's DrainCompleted latch; the Windows
+// host-shadow drain has no such latch (it tracks per-TU Processed flags).
+// UNSUPPORTED: windows
+
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \
+// RUN: -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN: -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \
+// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN: %t.dir/a.out
+// Both the host profraw and at least one device profraw (gfx-prefixed) must
+// have been produced, despite the early collection attempt.
+// RUN: ls %t.dir/host.*.profraw
+// RUN: ls %t.dir/gfx*.profraw
+// And the merged profile must contain the device kernel that was launched
+// *after* the early collect.
+// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/a.profdata
+// RUN: llvm-profdata show --all-functions %t.dir/a.profdata \
+// RUN: | FileCheck %s
+
+#include <hip/hip_runtime.h>
+
+// Declared by libclang_rt.profile-<host arch>.a; we call it directly to
+// simulate any caller that drains device counters at an arbitrary point in
+// the program lifetime (e.g. a per-iteration profile dump).
+extern "C" int __llvm_profile_hip_collect_device_data(void);
+
+__global__ void post_collect_kernel(int *p) {
+ if (*p > 0)
+ *p += 1;
+ else
+ *p -= 1;
+}
+
+int main() {
+ // (1) Early collection -- runs before any kernel launch. The drainer
+ // finds either no GPU agents, no loaded segments, or no instrumented
+ // bounds table, and returns 0 without latching DrainCompleted.
+ (void)__llvm_profile_hip_collect_device_data();
+
+ // (2) Now launch a kernel. HIP loads the device code object that carries
+ // the __llvm_profile_sections bounds table, executes our kernel, and
+ // populates the device-side counters.
+ int *d = nullptr;
+ if (hipMalloc(&d, sizeof(int)) != hipSuccess)
+ return 2;
+ int h = 5;
+ (void)hipMemcpy(d, &h, sizeof(int), hipMemcpyHostToDevice);
+ post_collect_kernel<<<1, 1>>>(d);
+ (void)hipMemcpy(&h, d, sizeof(int), hipMemcpyDeviceToHost);
+ (void)hipFree(d);
+
+ // (3) Exit normally. The atexit drain runs and -- because step (1) did
+ // not latch DrainCompleted -- it walks the (now loaded) code object,
+ // finds __llvm_profile_sections, and emits the device .profraw.
+ return h > 0 ? 0 : 1;
+}
+
+// CHECK-DAG: post_collect_kernel
+// CHECK-DAG: main
diff --git a/compiler-rt/test/profile/AMDGPU/device-no-kernel.hip b/compiler-rt/test/profile/AMDGPU/device-no-kernel.hip
new file mode 100644
index 0000000000000..a154308d725d8
--- /dev/null
+++ b/compiler-rt/test/profile/AMDGPU/device-no-kernel.hip
@@ -0,0 +1,44 @@
+// Independence / robustness: an instrumented HIP program that never launches a
+// kernel still writes its host .profraw, and the device drain is a clean no-op
+// (no crash, no spurious device .profraw). We assert the no-op condition
+// directly via the runtime's verbose log rather than rely on HIP lazy-loading
+// to leave the device code object unloaded -- the loader may load it for
+// other reasons (e.g. eager registration), and in that case the drain
+// legitimately walks it and reports zero instrumented sections / zero
+// drained. Either outcome is correct.
+//
+// REQUIRES: hip, amdgpu
+// The terminal conditions checked below ("no GPU agents", "no loaded
+// segments", "drained=0") are Linux HSA-drain strings with no Windows analog.
+// UNSUPPORTED: windows
+
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch \
+// RUN: -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN: -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \
+// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN: LLVM_PROFILE_VERBOSE=1 \
+// RUN: %t.dir/a.out 2> %t.dir/verbose.log
+// RUN: ls %t.dir/host.*.profraw
+// No arch-prefixed device profraw should have been produced.
+// RUN: not ls %t.dir/gfx*.profraw
+// The drain must have run; one of these three terminal conditions must hold:
+// - no GPU agents enumerated (test host has /dev/kfd but no usable agent)
+// - no loaded code object segments at exit
+// - the walk completed and drained=0 (no instrumented kernel was launched
+// so the device code object either wasn't loaded or its bounds were
+// empty/already drained)
+// RUN: FileCheck --input-file=%t.dir/verbose.log %s
+// CHECK: {{no GPU agents present|no loaded segments|drained=0}}
+
+#include <hip/hip_runtime.h>
+
+// Defined but never launched.
+__global__ void unused(int *p) { *p += 1; }
+
+int main() {
+ int n = 0;
+ (void)hipGetDeviceCount(&n);
+ return 0;
+}
diff --git a/compiler-rt/test/profile/AMDGPU/device-symbols.hip b/compiler-rt/test/profile/AMDGPU/device-symbols.hip
new file mode 100644
index 0000000000000..f12283b7da636
--- /dev/null
+++ b/compiler-rt/test/profile/AMDGPU/device-symbols.hip
@@ -0,0 +1,42 @@
+// The decoupled drain reads only the canonical __llvm_profile_sections bounds
+// table provided by the device profile runtime (InstrProfilingPlatformGPU.c),
+// since clang no longer emits a per-TU struct. Assert that symbol is present
+// in the device ELF's dynamic symbol table (protected visibility) for both
+// non-RDC and RDC device compiles. This is the contract the drainer depends on.
+//
+// REQUIRES: hip, amdgpu
+
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+
+// --- non-RDC ---
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \
+// RUN: -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN: -L%hip_lib_path -lamdhip64
+// Extraction failure here would make the readelf invocation succeed against
+// an empty/missing file; surface it instead of hiding it behind `|| true`.
+// RUN: llvm-objdump --offloading %t.dir/a.out > /dev/null
+// RUN: llvm-readelf --dyn-syms %t.dir/a.out.0.hip-amdgcn-amd-amdhsa--*gfx* \
+// RUN: | FileCheck %s
+
+// --- RDC ---
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fgpu-rdc \
+// RUN: -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/b.out \
+// RUN: -L%hip_lib_path -lamdhip64
+// RUN: llvm-objdump --offloading %t.dir/b.out > /dev/null
+// RUN: llvm-readelf --dyn-syms %t.dir/b.out.0.hip-amdgcn-amd-amdhsa--*gfx* \
+// RUN: | FileCheck %s
+
+// CHECK: PROTECTED {{.*}} __llvm_profile_sections
+
+#include <hip/hip_runtime.h>
+
+__global__ void k(int *p) { *p += 1; }
+
+int main() {
+ int *d = nullptr;
+ if (hipMalloc(&d, sizeof(int)) != hipSuccess)
+ return 2;
+ k<<<1, 1>>>(d);
+ (void)hipFree(d);
+ return 0;
+}
diff --git a/compiler-rt/test/profile/AMDGPU/lit.local.cfg.py b/compiler-rt/test/profile/AMDGPU/lit.local.cfg.py
new file mode 100644
index 0000000000000..5148dd6b9e2f2
--- /dev/null
+++ b/compiler-rt/test/profile/AMDGPU/lit.local.cfg.py
@@ -0,0 +1,4 @@
+# Device-profile drain tests: require an AMD GPU (and, implicitly, the amdgcn
+# device profile runtime in the resource directory and a ROCm/HIP install).
+if "amdgpu" not in config.available_features:
+ config.unsupported = True
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-basic.hip b/compiler-rt/test/profile/GPU/instrprof-hip-basic.hip
new file mode 100644
index 0000000000000..8cbe7c970052c
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-basic.hip
@@ -0,0 +1,51 @@
+// Test basic HIP PGO instrumentation and profile collection.
+//
+// REQUIRES: hip, amdgpu
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN: --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: env LLVM_PROFILE_FILE=%t.dir/prof.profraw \
+// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN: HIP_VISIBLE_DEVICES=0 %run %t
+// RUN: ls %t.dir/prof.profraw
+// RUN: llvm-profdata merge -o %t.profdata %t.dir/
+// RUN: llvm-profdata show --all-functions %t.profdata \
+// RUN: | FileCheck %s --check-prefix=PROF
+//
+// PROF: _Z6squarePiPKii
+// PROF: main
+// PROF: Functions shown: 2
+// PROF: Total functions: 2
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void square(int *out, const int *in, int n) {
+ int idx = blockIdx.x * blockDim.x + threadIdx.x;
+ if (idx < n)
+ out[idx] = in[idx] * in[idx];
+}
+
+int main() {
+ constexpr int N = 64;
+ int h_in[N], h_out[N];
+ for (int i = 0; i < N; ++i) h_in[i] = i;
+
+ int *d_in, *d_out;
+ (void)hipMalloc(&d_in, N * sizeof(int));
+ (void)hipMalloc(&d_out, N * sizeof(int));
+ (void)hipMemcpy(d_in, h_in, N * sizeof(int), hipMemcpyHostToDevice);
+
+ square<<<1, N>>>(d_out, d_in, N);
+
+ (void)hipMemcpy(h_out, d_out, N * sizeof(int), hipMemcpyDeviceToHost);
+
+ int ok = 1;
+ for (int i = 0; i < N; ++i)
+ if (h_out[i] != i * i) ok = 0;
+
+ printf("%s\n", ok ? "PASS" : "FAIL");
+ (void)hipFree(d_in);
+ (void)hipFree(d_out);
+ return !ok;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-collect-after.hip b/compiler-rt/test/profile/GPU/instrprof-hip-collect-after.hip
new file mode 100644
index 0000000000000..5a2393f8dcc47
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-collect-after.hip
@@ -0,0 +1,63 @@
+// Explicit-collect idempotency: a program that calls
+// __llvm_profile_hip_collect_device_data() itself *after* a launch (e.g. a
+// periodic profile dump) and then also exits normally must not double-count the
+// device counters. The explicit drain and the atexit drain write the same
+// arch-named device profraw, so the merged profile must reflect a single launch
+// (function count 64, even-branch 32), not two. Complements device-early-collect
+// (which covers a collect *before* the first launch).
+//
+// REQUIRES: hip, amdgpu
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \
+// RUN: -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN: -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \
+// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN: HIP_VISIBLE_DEVICES=0 %t.dir/a.out
+// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/a.profdata
+// RUN: llvm-profdata show --all-functions --counts %t.dir/a.profdata \
+// RUN: | FileCheck %s
+//
+// A single launch of 64 threads, drained twice (explicit + atexit), must still
+// merge to exactly one launch's worth of counts.
+// CHECK: _Z8classifyPii:
+// CHECK: Function count: 64
+// CHECK: Block counts: [0, 32]
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+extern "C" int __llvm_profile_hip_collect_device_data(void);
+
+__global__ void classify(int *out, int n) {
+ int idx = blockIdx.x * blockDim.x + threadIdx.x;
+ if (idx >= n)
+ return;
+ if (idx % 2 == 0)
+ out[idx] = 1;
+ else
+ out[idx] = 0;
+}
+
+int main() {
+ constexpr int N = 64;
+ int *d = nullptr;
+ if (hipMalloc(&d, N * sizeof(int)) != hipSuccess)
+ return 2;
+ classify<<<1, N>>>(d, N);
+ (void)hipDeviceSynchronize();
+
+ // Explicit mid-program collect after the launch. The atexit drain runs again
+ // at exit; together they must not double the device counters.
+ (void)__llvm_profile_hip_collect_device_data();
+
+ int h[N];
+ (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost);
+ (void)hipFree(d);
+
+ int evens = 0;
+ for (int i = 0; i < N; ++i)
+ evens += h[i];
+ printf("%s\n", evens == 32 ? "PASS" : "FAIL");
+ return evens == 32 ? 0 : 1;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-counter-correctness.hip b/compiler-rt/test/profile/GPU/instrprof-hip-counter-correctness.hip
new file mode 100644
index 0000000000000..c2bfc9ac9dc66
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-counter-correctness.hip
@@ -0,0 +1,56 @@
+// Quantitative device-counter correctness: the drained device profile must carry
+// the *exact* per-region execution counts produced by the kernel, not merely
+// "some counts are present". A single launch of 64 threads over classify() must
+// record a function entry count of 64 and an even-branch block count of 32
+// (idx % 2 == 0 holds for exactly half of idx in [0, 64)). This pins the drain +
+// dedup path against silent under/over-counting (e.g. a dedup bug that dropped
+// or doubled a section would change these numbers).
+//
+// REQUIRES: hip, amdgpu
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \
+// RUN: -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN: -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/host.%%p.profraw \
+// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN: HIP_VISIBLE_DEVICES=0 %t.dir/a.out
+// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/a.profdata
+// RUN: llvm-profdata show --all-functions --counts %t.dir/a.profdata \
+// RUN: | FileCheck %s
+//
+// The device kernel ran with exactly 64 threads, all of which entered the
+// function; the even branch was taken 32 times and the early-return path 0.
+// CHECK: _Z8classifyPii:
+// CHECK: Function count: 64
+// CHECK: Block counts: [0, 32]
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void classify(int *out, int n) {
+ int idx = blockIdx.x * blockDim.x + threadIdx.x;
+ if (idx >= n)
+ return;
+ if (idx % 2 == 0)
+ out[idx] = 1;
+ else
+ out[idx] = 0;
+}
+
+int main() {
+ constexpr int N = 64;
+ int *d = nullptr;
+ if (hipMalloc(&d, N * sizeof(int)) != hipSuccess)
+ return 2;
+ classify<<<1, N>>>(d, N);
+ (void)hipDeviceSynchronize();
+ int h[N];
+ (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost);
+ (void)hipFree(d);
+
+ int evens = 0;
+ for (int i = 0; i < N; ++i)
+ evens += h[i];
+ printf("%s\n", evens == 32 ? "PASS" : "FAIL");
+ return evens == 32 ? 0 : 1;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-coverage.hip b/compiler-rt/test/profile/GPU/instrprof-hip-coverage.hip
new file mode 100644
index 0000000000000..a867c30f0edfb
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-coverage.hip
@@ -0,0 +1,51 @@
+// Test HIP coverage mapping produces source-level coverage for host code.
+//
+// REQUIRES: hip, amdgpu
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN: --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: env LLVM_PROFILE_FILE=%t.dir/prof.profraw \
+// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN: HIP_VISIBLE_DEVICES=0 %run %t
+// RUN: llvm-profdata merge -o %t.profdata %t.dir/
+// RUN: llvm-cov report %t -instr-profile=%t.profdata 2>&1 \
+// RUN: | FileCheck %s --check-prefix=REPORT
+//
+// REPORT: instrprof-hip-coverage.hip
+// No coverage column should be fully uncovered. Anchor on a non-digit before
+// the "0.00%" so this does not spuriously match e.g. "80.00%".
+// REPORT-NOT: {{[^.0-9]0[.]00%}}
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__device__ int gpu_abs(int x) {
+ return x < 0 ? -x : x;
+}
+
+__global__ void abs_kernel(int *data, int n) {
+ int idx = blockIdx.x * blockDim.x + threadIdx.x;
+ if (idx < n)
+ data[idx] = gpu_abs(data[idx]);
+}
+
+int main() {
+ constexpr int N = 16;
+ int h[N];
+ for (int i = 0; i < N; ++i)
+ h[i] = (i % 2 == 0) ? i : -i;
+
+ int *d;
+ (void)hipMalloc(&d, N * sizeof(int));
+ (void)hipMemcpy(d, h, N * sizeof(int), hipMemcpyHostToDevice);
+ abs_kernel<<<1, N>>>(d, N);
+ (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost);
+ (void)hipFree(d);
+
+ int ok = 1;
+ for (int i = 0; i < N; ++i)
+ if (h[i] != i) ok = 0;
+
+ printf("%s\n", ok ? "PASS" : "FAIL");
+ return !ok;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-device-branching.hip b/compiler-rt/test/profile/GPU/instrprof-hip-device-branching.hip
new file mode 100644
index 0000000000000..a24b28ec9af0a
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-device-branching.hip
@@ -0,0 +1,67 @@
+// Test that device-side branching is captured in profile counters.
+// Exercises the classify-style pattern where different branches are taken
+// by different threads, verifying that counter values reflect actual execution.
+//
+// REQUIRES: hip, amdgpu
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN: --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: env LLVM_PROFILE_FILE=%t.dir/prof.profraw \
+// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN: HIP_VISIBLE_DEVICES=0 %run %t
+// RUN: llvm-profdata merge -o %t.profdata %t.dir/
+// RUN: llvm-profdata show --all-functions %t.profdata \
+// RUN: | FileCheck %s --check-prefix=PROF
+//
+// Device functions should appear with non-zero counters. The __device__
+// classify() helper is inlined into the histogram kernel, so it does not get a
+// separate profile record; its branching is captured within the kernel's
+// counters instead.
+// PROF-DAG: _Z9histogramPKiPii
+// PROF-DAG: main
+// PROF: Total functions: 2
+// PROF: Maximum function count: {{[1-9][0-9]*}}
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__device__ int classify(int x) {
+ if (x > 100) return 2;
+ else if (x > 0) return 1;
+ else return 0;
+}
+
+__global__ void histogram(const int *input, int *bins, int n) {
+ int idx = blockIdx.x * blockDim.x + threadIdx.x;
+ if (idx < n) {
+ int cls = classify(input[idx]);
+ atomicAdd(&bins[cls], 1);
+ }
+}
+
+int main() {
+ constexpr int N = 256;
+ constexpr int NBINS = 3;
+
+ int h_in[N], h_bins[NBINS] = {};
+ for (int i = 0; i < N; ++i)
+ h_in[i] = (i % 3 == 0) ? -1 : (i % 3 == 1) ? 50 : 200;
+
+ int *d_in, *d_bins;
+ (void)hipMalloc(&d_in, N * sizeof(int));
+ (void)hipMalloc(&d_bins, NBINS * sizeof(int));
+ (void)hipMemcpy(d_in, h_in, N * sizeof(int), hipMemcpyHostToDevice);
+ (void)hipMemset(d_bins, 0, NBINS * sizeof(int));
+
+ histogram<<<1, N>>>(d_in, d_bins, N);
+
+ (void)hipMemcpy(h_bins, d_bins, NBINS * sizeof(int), hipMemcpyDeviceToHost);
+ printf("bins: [%d, %d, %d]\n", h_bins[0], h_bins[1], h_bins[2]);
+
+ int ok = (h_bins[0] > 0 && h_bins[1] > 0 && h_bins[2] > 0);
+ printf("%s\n", ok ? "PASS" : "FAIL");
+
+ (void)hipFree(d_in);
+ (void)hipFree(d_bins);
+ return !ok;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-fork-safety.hip b/compiler-rt/test/profile/GPU/instrprof-hip-fork-safety.hip
new file mode 100644
index 0000000000000..c79cf568f88bc
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-fork-safety.hip
@@ -0,0 +1,61 @@
+// Fork safety: loading the profile-instrumented library must NOT initialize
+// HSA in a process that itself never touches HIP and only runs device work in
+// forked children. RCCL's unit tests follow exactly this pattern -- the parent
+// deliberately keeps HIP/HSA uninitialized and launches kernels only inside
+// forked children. If the profile runtime's library constructor eagerly called
+// hsa_init(), the child would inherit invalid HSA state across fork() and crash
+// inside HSA (e.g. hsa_amd_signal_create -> SharedSignalPool::alloc). The HSA
+// drain therefore brings HSA up lazily, never from a constructor.
+//
+// REQUIRES: hip, amdgpu
+// The eager-hsa_init fork hazard and the lazy HSA drain are Linux-only.
+// UNSUPPORTED: windows
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN: --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.profraw \
+// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN: %run %t 2>&1 | FileCheck %s
+//
+// The forked child must complete its kernel without crashing in HSA.
+// CHECK: PASS
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+#include <sys/wait.h>
+#include <unistd.h>
+
+__global__ void increment(int *p) { *p += 1; }
+
+static int doChildWork() {
+ int *d = nullptr;
+ if (hipMalloc(&d, sizeof(int)) != hipSuccess)
+ return 1;
+ int h = 41;
+ (void)hipMemcpy(d, &h, sizeof(int), hipMemcpyHostToDevice);
+ increment<<<1, 1>>>(d);
+ if (hipDeviceSynchronize() != hipSuccess)
+ return 1;
+ (void)hipMemcpy(&h, d, sizeof(int), hipMemcpyDeviceToHost);
+ (void)hipFree(d);
+ return h == 42 ? 0 : 1;
+}
+
+int main() {
+ // The parent intentionally performs no HIP/HSA work before forking.
+ pid_t pid = fork();
+ if (pid < 0) {
+ printf("FAIL (fork failed)\n");
+ return 1;
+ }
+ if (pid == 0) {
+ // Child runs the device work; _exit avoids flushing the parent's profile
+ // handlers from the child (the RCCL test pattern).
+ _exit(doChildWork());
+ }
+
+ int status = 0;
+ (void)waitpid(pid, &status, 0);
+ int ok = WIFEXITED(status) && WEXITSTATUS(status) == 0;
+ printf("%s\n", ok ? "PASS" : "FAIL");
+ return ok ? 0 : 1;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-multi-gpu.hip b/compiler-rt/test/profile/GPU/instrprof-hip-multi-gpu.hip
new file mode 100644
index 0000000000000..6a99546d34bdb
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-multi-gpu.hip
@@ -0,0 +1,57 @@
+// Test that HIP PGO works on multi-GPU systems. The kernel runs on the default
+// device, so the host-shadow drain (guarded by upstream's launch tracking)
+// collects only that device and the supplemental HSA agent-walk then finds the
+// same code object and dedups it out. The point of the test is that neither
+// pass crashes or hangs reading a non-resident device on a host with several
+// GPUs (the failure mode that the launch tracking + HSA residency walk fix).
+//
+// REQUIRES: hip, amdgpu
+// The "walk complete" / dedup notes are Linux-only HSA-drain strings; the
+// Windows host-shadow drain collects only the current device.
+// UNSUPPORTED: windows
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN: --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.profraw \
+// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN: LLVM_PROFILE_VERBOSE=1 %run %t 2>&1 | FileCheck %s
+//
+// The host-shadow pass drains the launched device, the HSA walk finds that same
+// code object and dedups it (drained=0), and the program does not crash.
+// CHECK: Copied device sections:
+// CHECK: device bounds already drained, skipping
+// CHECK: walk complete: agents={{[0-9]+}} pairs={{[0-9]+}} found={{[1-9][0-9]*}} drained={{[0-9]+}}
+// CHECK: PASS
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void add_one(int *data, int n) {
+ int idx = blockIdx.x * blockDim.x + threadIdx.x;
+ if (idx < n)
+ data[idx] += 1;
+}
+
+int main() {
+ int ndev = 0;
+ (void)hipGetDeviceCount(&ndev);
+
+ constexpr int N = 32;
+ int h_data[N];
+ for (int i = 0; i < N; ++i) h_data[i] = i;
+
+ int *d_data;
+ (void)hipMalloc(&d_data, N * sizeof(int));
+ (void)hipMemcpy(d_data, h_data, N * sizeof(int), hipMemcpyHostToDevice);
+
+ add_one<<<1, N>>>(d_data, N);
+
+ (void)hipMemcpy(h_data, d_data, N * sizeof(int), hipMemcpyDeviceToHost);
+ (void)hipFree(d_data);
+
+ int ok = 1;
+ for (int i = 0; i < N; ++i)
+ if (h_data[i] != i + 1) ok = 0;
+
+ printf("%s (devices=%d)\n", ok ? "PASS" : "FAIL", ndev);
+ return !ok;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-multi-process-merge.hip b/compiler-rt/test/profile/GPU/instrprof-hip-multi-process-merge.hip
new file mode 100644
index 0000000000000..8cf1258a63535
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-multi-process-merge.hip
@@ -0,0 +1,63 @@
+// Multi-process accumulation: device counters from several independent runs of an
+// instrumented HIP program must accumulate when their profraws are offline-merged
+// (llvm-profdata merge), the common real-world feedback-collection path. Three
+// runs of a 64-thread launch must sum to a function count of 192 and an
+// even-branch block count of 96 (3 x 64 / 3 x 32).
+//
+// Note: on-the-fly merge-pooling via LLVM_PROFILE_FILE=...%m does NOT currently
+// accumulate *device* counters -- the device profraw is rewritten per process
+// rather than merged in place -- so each process must write a distinct file
+// (here via %p) and the accumulation is done by llvm-profdata merge.
+//
+// REQUIRES: hip, amdgpu
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: %clang -x hip --offload-arch=%amdgpu_arch -fno-gpu-rdc \
+// RUN: -fprofile-instr-generate -fcoverage-mapping %s -o %t.dir/a.out \
+// RUN: -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.dir/run1.%%p.profraw \
+// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN: HIP_VISIBLE_DEVICES=0 %t.dir/a.out
+// RUN: env LLVM_PROFILE_FILE=%t.dir/run2.%%p.profraw \
+// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN: HIP_VISIBLE_DEVICES=0 %t.dir/a.out
+// RUN: env LLVM_PROFILE_FILE=%t.dir/run3.%%p.profraw \
+// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN: HIP_VISIBLE_DEVICES=0 %t.dir/a.out
+// RUN: llvm-profdata merge %t.dir/*.profraw -o %t.dir/a.profdata
+// RUN: llvm-profdata show --all-functions --counts %t.dir/a.profdata \
+// RUN: | FileCheck %s
+//
+// CHECK: _Z8classifyPii:
+// CHECK: Function count: 192
+// CHECK: Block counts: [0, 96]
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void classify(int *out, int n) {
+ int idx = blockIdx.x * blockDim.x + threadIdx.x;
+ if (idx >= n)
+ return;
+ if (idx % 2 == 0)
+ out[idx] = 1;
+ else
+ out[idx] = 0;
+}
+
+int main() {
+ constexpr int N = 64;
+ int *d = nullptr;
+ if (hipMalloc(&d, N * sizeof(int)) != hipSuccess)
+ return 2;
+ classify<<<1, N>>>(d, N);
+ (void)hipDeviceSynchronize();
+ int h[N];
+ (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost);
+ (void)hipFree(d);
+
+ int evens = 0;
+ for (int i = 0; i < N; ++i)
+ evens += h[i];
+ printf("%s\n", evens == 32 ? "PASS" : "FAIL");
+ return evens == 32 ? 0 : 1;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-multiple-kernels.hip b/compiler-rt/test/profile/GPU/instrprof-hip-multiple-kernels.hip
new file mode 100644
index 0000000000000..0fd6185b82441
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-multiple-kernels.hip
@@ -0,0 +1,58 @@
+// Test PGO with multiple kernel launches from a single TU.
+// Verifies that counters from all device functions are collected correctly.
+//
+// REQUIRES: hip, amdgpu
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN: --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: env LLVM_PROFILE_FILE=%t.dir/prof.profraw \
+// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN: HIP_VISIBLE_DEVICES=0 %run %t
+// RUN: llvm-profdata merge -o %t.profdata %t.dir/
+// RUN: llvm-profdata show --all-functions %t.profdata \
+// RUN: | FileCheck %s --check-prefix=PROF
+//
+// All three kernels plus main should be profiled.
+// PROF-DAG: _Z4fillPii
+// PROF-DAG: _Z5scalePii
+// PROF-DAG: _Z6negatePii
+// PROF-DAG: main
+// PROF: Total functions: 4
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void fill(int *data, int val) {
+ data[threadIdx.x] = val;
+}
+
+__global__ void scale(int *data, int factor) {
+ data[threadIdx.x] *= factor;
+}
+
+__global__ void negate(int *data, int n) {
+ int idx = threadIdx.x;
+ if (idx < n)
+ data[idx] = -data[idx];
+}
+
+int main() {
+ constexpr int N = 16;
+ int h[N];
+ int *d;
+ (void)hipMalloc(&d, N * sizeof(int));
+
+ fill<<<1, N>>>(d, 5);
+ scale<<<1, N>>>(d, 3);
+ negate<<<1, N>>>(d, N);
+
+ (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost);
+ (void)hipFree(d);
+
+ int ok = 1;
+ for (int i = 0; i < N; ++i)
+ if (h[i] != -15) ok = 0;
+
+ printf("%s\n", ok ? "PASS" : "FAIL");
+ return !ok;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-nondefault-device.hip b/compiler-rt/test/profile/GPU/instrprof-hip-nondefault-device.hip
new file mode 100644
index 0000000000000..5d3dea671047b
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-nondefault-device.hip
@@ -0,0 +1,60 @@
+// Test PGO when the kernel runs on a non-default device (here the program
+// selects device 1). Upstream's launch tracking records that device 1 was used,
+// so the host-shadow drain skips the other devices and collects device 1, and
+// the supplemental HSA agent-walk then finds that same code object and dedups it
+// out. This exercises both that the correct device is drained and that an
+// unused device is never read (which would fault/hang on a multi-GPU host).
+//
+// REQUIRES: hip, amdgpu, multi-device
+// The "walk complete" / dedup notes are Linux-only HSA-drain strings; the
+// Windows host-shadow drain only collects the current device.
+// UNSUPPORTED: windows
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN: --offload-arch=%amdgpu_arch %s -o %t -L%hip_lib_path -lamdhip64
+// RUN: env LLVM_PROFILE_FILE=%t.profraw \
+// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN: LLVM_PROFILE_VERBOSE=1 %run %t 2>&1 | FileCheck %s
+//
+// The launched device (1) is drained, the unused default device is skipped, and
+// the HSA walk finds the same code object and dedups it (drained=0).
+// CHECK: Skipping unused device 0
+// CHECK: Collecting static profile data from device 1
+// CHECK: Copied device sections:
+// CHECK: device bounds already drained, skipping
+// CHECK: walk complete: agents={{[0-9]+}} pairs={{[0-9]+}} found={{[1-9][0-9]*}} drained={{[0-9]+}}
+// CHECK: PASS
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void fill(int *data, int val, int n) {
+ int idx = blockIdx.x * blockDim.x + threadIdx.x;
+ if (idx < n)
+ data[idx] = val;
+}
+
+int main() {
+ int ndev = 0;
+ (void)hipGetDeviceCount(&ndev);
+ if (ndev < 2) {
+ printf("PASS (skipped: only %d device)\n", ndev);
+ return 0;
+ }
+
+ (void)hipSetDevice(1);
+
+ constexpr int N = 32;
+ int h[N] = {};
+ int *d;
+ (void)hipMalloc(&d, N * sizeof(int));
+ fill<<<1, N>>>(d, 99, N);
+ (void)hipMemcpy(h, d, N * sizeof(int), hipMemcpyDeviceToHost);
+ (void)hipFree(d);
+
+ int ok = 1;
+ for (int i = 0; i < N; ++i)
+ if (h[i] != 99) ok = 0;
+
+ printf("%s\n", ok ? "PASS" : "FAIL");
+ return !ok;
+}
diff --git a/compiler-rt/test/profile/GPU/instrprof-hip-pgo-use.hip b/compiler-rt/test/profile/GPU/instrprof-hip-pgo-use.hip
new file mode 100644
index 0000000000000..9a8a8187f8e77
--- /dev/null
+++ b/compiler-rt/test/profile/GPU/instrprof-hip-pgo-use.hip
@@ -0,0 +1,63 @@
+// Test the full PGO cycle: instrument, collect, merge, optimize.
+// Verifies that the optimized binary produces correct output and that
+// profile data is consumed without errors.
+//
+// REQUIRES: hip, amdgpu
+//
+// Step 1: Build instrumented binary.
+// RUN: %clang -x hip -fprofile-instr-generate -fcoverage-mapping \
+// RUN: --offload-arch=%amdgpu_arch %s -o %t.instr \
+// RUN: -L%hip_lib_path -lamdhip64
+//
+// Step 2: Run to collect profile data.
+// RUN: rm -rf %t.dir && mkdir -p %t.dir
+// RUN: env LLVM_PROFILE_FILE=%t.dir/prof.profraw \
+// RUN: LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN: HIP_VISIBLE_DEVICES=0 %run %t.instr 2>&1 | FileCheck %s
+//
+// Step 3: Merge profile data.
+// RUN: llvm-profdata merge -o %t.profdata %t.dir/
+//
+// Step 4: Build optimized binary with profile data.
+// RUN: %clang -x hip -fprofile-instr-use=%t.profdata \
+// RUN: --offload-arch=%amdgpu_arch %s -o %t.opt \
+// RUN: -L%hip_lib_path -lamdhip64 -O2
+//
+// Step 5: Run optimized binary.
+// RUN: env LD_LIBRARY_PATH=%hip_lib_path:$LD_LIBRARY_PATH \
+// RUN: HIP_VISIBLE_DEVICES=0 %run %t.opt 2>&1 | FileCheck %s
+//
+// CHECK: PASS
+
+#include <hip/hip_runtime.h>
+#include <cstdio>
+
+__global__ void scale(float *data, float factor, int n) {
+ int idx = blockIdx.x * blockDim.x + threadIdx.x;
+ if (idx < n)
+ data[idx] *= factor;
+}
+
+int main() {
+ constexpr int N = 128;
+ float h[N];
+ for (int i = 0; i < N; ++i) h[i] = (float)i;
+
+ float *d;
+ (void)hipMalloc(&d, N * sizeof(float));
+ (void)hipMemcpy(d, h, N * sizeof(float), hipMemcpyHostToDevice);
+
+ scale<<<1, N>>>(d, 2.0f, N);
+
+ (void)hipMemcpy(h, d, N * sizeof(float), hipMemcpyDeviceToHost);
+ (void)hipFree(d);
+
+ int ok = 1;
+ for (int i = 0; i < N; ++i) {
+ float expected = (float)(i * 2);
+ if (h[i] != expected) ok = 0;
+ }
+
+ printf("%s\n", ok ? "PASS" : "FAIL");
+ return !ok;
+}
diff --git a/compiler-rt/test/profile/device-pgo/README.md b/compiler-rt/test/profile/device-pgo/README.md
new file mode 100644
index 0000000000000..4338c637abe49
--- /dev/null
+++ b/compiler-rt/test/profile/device-pgo/README.md
@@ -0,0 +1,125 @@
+# HIP device PGO / code coverage: standalone build & test recipe
+
+This directory provides a CMake-based recipe to build and exercise HIP device
+profile-guided optimization (PGO) and source-based code coverage **outside
+TheRock**, using only an `llvm-project` checkout plus a ROCm runtime.
+
+It builds, in one configure:
+
+- the host toolchain (`clang`, `clang++`, `lld`, `llvm-profdata`, `llvm-cov`)
+ and the lit-lite test utilities (`FileCheck`, `not`);
+- the host ROCm drain runtime `clang_rt.profile_rocm` (opt-in,
+ `COMPILER_RT_BUILD_PROFILE_ROCM=ON`). It runs the upstream host-shadow drain on
+ all platforms; on **Linux** `InstrProfilingPlatformROCm.cpp` additionally runs a
+ supplemental HSA-introspection pass (with content-dedup) to collect device code
+ objects that have no host shadow (e.g. device-linked/RCCL kernels);
+- the **amdgcn device** profile runtime `libclang_rt.profile.a` (the baremetal
+ profile subset that provides `__llvm_profile_instrument_gpu` and the
+ `__llvm_profile_sections` bounds table), built for the `amdgcn-amd-amdhsa`
+ runtime target with LLVM libc for amdgcn.
+
+## Why a separate library
+
+Upstream relands HIP offload PGO runtime support as the **opt-in**
+`clang_rt.profile_rocm` (llvm#201606), a `/MD` superset of `clang_rt.profile`;
+the base library stays unchanged. The driver links `clang_rt.profile_rocm`
+ahead of `clang_rt.profile` for HIP host links when profiling is requested
+(see `clang/lib/Driver/ToolChains/{Linux,MSVC}.cpp`). This recipe just turns
+the option on and builds the matching amdgcn device runtime.
+
+## Prerequisites
+
+- A ROCm installation (for `libamdhip64` and, on Linux, `libhsa-runtime64`),
+ e.g. `/opt/rocm`. Export `ROCM_PATH`.
+- An AMD GPU visible to the runtime for the *run* step (the build step does
+ not need a GPU). `amdgpu-arch` should list your device(s).
+- Ninja, a host C/C++ compiler, and Python 3.
+
+## Build
+
+```bash
+export ROCM_PATH=/opt/rocm
+./build.sh # builds into <repo>/build/device-pgo
+# or: ./build.sh /path/to/builddir
+```
+
+Key outputs under the build dir:
+
+```
+bin/{clang,clang++,lld,llvm-profdata,llvm-cov,FileCheck,not}
+lib/clang/<ver>/lib/<host-triple>/libclang_rt.profile_rocm.a
+lib/clang/<ver>/lib/amdgcn-amd-amdhsa/libclang_rt.profile.a
+```
+
+See `toolchain-cache.cmake` for the exact CMake variables, including the
+`LLVM_RUNTIME_TARGETS="default;amdgcn-amd-amdhsa"` split.
+
+## Run the tests
+
+The lit-lite runner (`../run_gpu_tests.py`) compiles each `.hip` test with the
+just-built toolchain, runs it on the GPU, and pipes output through `FileCheck`.
+It auto-detects features (`multi-device` via `amdgpu-arch`) so tests that need
+two visible GPUs are skipped on single-GPU hosts.
+
+```bash
+python3 ../run_gpu_tests.py \
+ --toolchain-bin "$PWD/<builddir>/bin" \
+ --hip-lib-path "$ROCM_PATH/lib" \
+ ../GPU ../AMDGPU
+```
+
+`--toolchain-bin` must be an **absolute** path (the runner executes each RUN
+line from a temp directory). With the toolchain's `amdgpu-arch`/`offload-arch`
+on hand, `--offload-arch=native` resolves automatically and the `multi-device`
+feature is enabled when 2+ GPUs are visible (so multi-GPU tests run on a
+multi-GPU host and are skipped otherwise). On a multi-gfx90a host this suite is
+15 passed, 0 failed.
+
+### Coverage notes / known gaps
+
+- Quantitative device-counter correctness (`instrprof-hip-counter-correctness`),
+ multi-process offline accumulation (`instrprof-hip-multi-process-merge`) and
+ explicit-collect idempotency (`instrprof-hip-collect-after`) pin exact device
+ counter values, so a dedup or drain regression that drops/doubles a section is
+ caught.
+- `LLVM_PROFILE_FILE=...%m` on-the-fly merge-pooling does **not** accumulate
+ *device* counters today (the device profraw is rewritten per process rather
+ than merged in place); multi-process accumulation must go through
+ `llvm-profdata merge` of distinct per-process files.
+- There is no in-tree test that drains a code object with **no** host shadow in
+ isolation (the pure device-linked/RCCL case the HSA pass uniquely handles): it
+ requires a real device-side library build (the profile runtime linked into the
+ device image), which is not expressible in the lit-lite harness via the clang
+ driver. The dedup tests do prove the HSA pass finds and dedups the same code
+ objects the host-shadow pass drains; validating the no-host-shadow drain needs
+ an actual RCCL-style binary in downstream CI.
+
+## Manual workflow (for reference)
+
+```bash
+CLANG=<builddir>/bin/clang++
+# 1. Instrumented build (host + device).
+$CLANG -O2 -fprofile-instr-generate -fcoverage-mapping \
+ --offload-arch=gfx1100 -xhip app.hip -o app
+
+# 2. Run. Produces a host .profraw and a device
+# <name>.amdgcn-amd-amdhsa.<arch>.profraw drained by clang_rt.profile_rocm.
+LLVM_PROFILE_FILE='app-%p.profraw' ./app
+
+# 3. Merge (device profiles are merged per GPU arch).
+<builddir>/bin/llvm-profdata merge -o app.profdata app-*.profraw
+
+# 4. Coverage report (device).
+<builddir>/bin/llvm-cov show ./app -instr-profile=app.profdata
+```
+
+## Notes / environment-specific knobs
+
+- `--offload-arch` must match your GPU; the amdgcn device runtime is target
+ generic but the app's device code is per-arch. The build installs
+ `offload-arch` (and the `amdgpu-arch` alias) into `<builddir>/bin`, so
+ `--offload-arch=native` works without a system ROCm `amdgpu-arch`.
+- The amdgcn runtime target requires LLVM libc for amdgcn; if your environment
+ cannot build it, drop `libc` from
+ `RUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES` only if your headers are
+ otherwise provided.
diff --git a/compiler-rt/test/profile/device-pgo/build.sh b/compiler-rt/test/profile/device-pgo/build.sh
new file mode 100755
index 0000000000000..edf90f42fb8c1
--- /dev/null
+++ b/compiler-rt/test/profile/device-pgo/build.sh
@@ -0,0 +1,56 @@
+#!/usr/bin/env bash
+# Standalone (non-TheRock) build of the toolchain + host/device runtimes used by
+# the HIP device-PGO / code-coverage tests. See toolchain-cache.cmake and
+# README.md for details.
+#
+# ./build.sh [BUILD_DIR]
+#
+# Env knobs:
+# LLVM_SRC path to the llvm-project checkout (default: repo root inferred
+# from this script's location)
+# JOBS parallelism for ninja (default: nproc)
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+# .../compiler-rt/test/profile/device-pgo -> repo root is four levels up.
+LLVM_SRC="${LLVM_SRC:-$(cd "${SCRIPT_DIR}/../../../.." && pwd)}"
+BUILD_DIR="${1:-${LLVM_SRC}/build/device-pgo}"
+JOBS="${JOBS:-$(nproc)}"
+
+echo "llvm-project source : ${LLVM_SRC}"
+echo "build directory : ${BUILD_DIR}"
+echo "parallel jobs : ${JOBS}"
+
+cmake -G Ninja \
+ -S "${LLVM_SRC}/llvm" \
+ -B "${BUILD_DIR}" \
+ -C "${SCRIPT_DIR}/toolchain-cache.cmake"
+
+# The 'clang' target also produces the clang++ symlink. The offload toolchain
+# tools (clang-offload-bundler, clang-linker-wrapper, llvm-link,
+# llvm-offload-binary) and offload-arch (also installed as amdgpu-arch) are
+# needed to compile/link a HIP program and to resolve --offload-arch=native /
+# the multi-device test feature. 'runtimes' builds both the host (default) and
+# amdgcn device runtime targets.
+ninja -C "${BUILD_DIR}" -j "${JOBS}" \
+ clang lld \
+ clang-offload-bundler clang-linker-wrapper llvm-link llvm-offload-binary \
+ offload-arch \
+ llvm-profdata llvm-cov FileCheck not \
+ runtimes
+
+cat <<EOF
+
+Build complete.
+
+Toolchain bin : ${BUILD_DIR}/bin
+Run the GPU tests with, e.g.:
+
+ python3 ${SCRIPT_DIR}/../run_gpu_tests.py \\
+ --toolchain-bin ${BUILD_DIR}/bin \\
+ --hip-lib-path \${ROCM_PATH:-/opt/rocm}/lib \\
+ ${SCRIPT_DIR}/../GPU ${SCRIPT_DIR}/../AMDGPU
+
+(--toolchain-bin must be an absolute path; the runner executes RUN lines from a
+temp dir. See README.md for more.)
+EOF
diff --git a/compiler-rt/test/profile/device-pgo/toolchain-cache.cmake b/compiler-rt/test/profile/device-pgo/toolchain-cache.cmake
new file mode 100644
index 0000000000000..f48656c66d82d
--- /dev/null
+++ b/compiler-rt/test/profile/device-pgo/toolchain-cache.cmake
@@ -0,0 +1,55 @@
+# CMake cache for a standalone (non-TheRock) build of everything needed to
+# compile, run, and FileCheck the HIP device-PGO / code-coverage tests under
+# compiler-rt/test/profile/{GPU,AMDGPU}.
+#
+# It produces, in a single configure:
+# * the host toolchain: clang, clang++, lld, llvm-profdata, llvm-cov, plus the
+# test utilities FileCheck and not (LLVM_INSTALL_UTILS);
+# * the host ROCm drain runtime clang_rt.profile_rocm (opt-in, links the
+# sanitizer interception object libs -- hence COMPILER_RT_BUILD_SANITIZERS);
+# * the amdgcn device profile runtime libclang_rt.profile.a (the baremetal
+# profile subset providing __llvm_profile_instrument_gpu and the
+# __llvm_profile_sections bounds table), built for the amdgcn-amd-amdhsa
+# runtime target via compiler-rt/cmake/caches/AMDGPU.cmake. Building the
+# device runtime requires LLVM libc for amdgcn, so libc is enabled for that
+# runtime target.
+#
+# Usage (see ./build.sh for a wrapper):
+# cmake -G Ninja -S llvm -B build/device-pgo \
+# -C compiler-rt/test/profile/device-pgo/toolchain-cache.cmake
+# ninja -C build/device-pgo clang lld clang-offload-bundler \
+# clang-linker-wrapper llvm-link llvm-offload-binary offload-arch \
+# llvm-profdata llvm-cov FileCheck not runtimes
+#
+# Outputs (under build/device-pgo):
+# bin/{clang,clang++,lld,llvm-profdata,llvm-cov,FileCheck,not}
+# lib/clang/<ver>/lib/<host-triple>/libclang_rt.profile_rocm.a
+# lib/clang/<ver>/lib/amdgcn-amd-amdhsa/libclang_rt.profile.a
+
+set(CMAKE_BUILD_TYPE Release CACHE STRING "")
+
+set(LLVM_ENABLE_PROJECTS "clang;lld" CACHE STRING "")
+set(LLVM_ENABLE_RUNTIMES "compiler-rt" CACHE STRING "")
+set(LLVM_TARGETS_TO_BUILD "host;AMDGPU" CACHE STRING "")
+set(LLVM_ENABLE_PER_TARGET_RUNTIME_DIR ON CACHE BOOL "")
+set(LLVM_ENABLE_ASSERTIONS ON CACHE BOOL "")
+
+set(CLANG_DEFAULT_LINKER "lld" CACHE STRING "")
+set(CLANG_DEFAULT_RTLIB "compiler-rt" CACHE STRING "")
+
+# Make FileCheck / not available in the install/bin tree for the lit-lite runner.
+set(LLVM_INSTALL_UTILS ON CACHE BOOL "")
+
+# Build host (default) and device (amdgcn) runtimes in one tree.
+set(LLVM_RUNTIME_TARGETS "default;amdgcn-amd-amdhsa" CACHE STRING "")
+
+# Host runtimes: turn on the opt-in ROCm host drain library. It pulls in the
+# sanitizer interception object libs, so sanitizers must be built too.
+set(RUNTIMES_default_COMPILER_RT_BUILD_PROFILE_ROCM ON CACHE BOOL "")
+set(RUNTIMES_default_COMPILER_RT_BUILD_SANITIZERS ON CACHE BOOL "")
+
+# Device runtime: the amdgcn baremetal profile subset, built with LLVM libc for
+# amdgcn (freestanding C headers).
+set(RUNTIMES_amdgcn-amd-amdhsa_CACHE_FILES
+ "${CMAKE_SOURCE_DIR}/../compiler-rt/cmake/caches/AMDGPU.cmake" CACHE STRING "")
+set(RUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES "compiler-rt;libc" CACHE STRING "")
diff --git a/compiler-rt/test/profile/run_gpu_tests.py b/compiler-rt/test/profile/run_gpu_tests.py
new file mode 100644
index 0000000000000..d339969ba245e
--- /dev/null
+++ b/compiler-rt/test/profile/run_gpu_tests.py
@@ -0,0 +1,380 @@
+#!/usr/bin/env python3
+"""Minimal lit-style runner for the HIP device-PGO tests.
+
+The compiler-rt profile lit suite (and llvm-lit / FileCheck) is not part of the
+installed ROCm artifact, but the toolchain, the amdgcn device profile runtime,
+and the HIP runtime are. This runner executes the
+``compiler-rt/test/profile/{GPU,AMDGPU}/*.hip`` tests directly against an
+installed toolchain on a real GPU runner, interpreting just the slice of lit
+markup those tests use:
+
+ - ``// REQUIRES:`` / ``// UNSUPPORTED:`` boolean feature gating,
+ - ``// RUN:`` lines (with ``\\`` continuations) and the fixed substitution set
+ (%clang, %s, %t[.*], %amdgpu_arch, %hip_lib_path, %run, %%),
+ - delegation to ``FileCheck`` / ``not`` (real binaries if present on PATH,
+ otherwise shims backed by the ``filecheck`` PyPI package and a tiny
+ exit-code inverter).
+
+Each RUN line is executed via ``bash -e -o pipefail -c`` so pipes, redirection
+and globbing behave as under lit. A test passes iff all its RUN lines exit 0.
+"""
+
+import argparse
+import os
+import re
+import shutil
+import stat
+import subprocess
+import sys
+import tempfile
+from pathlib import Path
+
+# --- feature detection ------------------------------------------------------
+
+
+def _count_visible_gpus(toolchain_bin):
+ """Number of GPUs actually visible to the runtime, or 0 if unknown.
+
+ Uses the toolchain's ``amdgpu-arch`` (one line per visible device). Unlike
+ the KFD topology under ``/sys/class/kfd`` this reflects what HIP/ROCr really
+ exposes -- it honours ``ROCR_VISIBLE_DEVICES`` / ``HIP_VISIBLE_DEVICES`` and
+ container device limits, so it matches what a test's ``hipGetDeviceCount``
+ will see. It is also portable: Windows has no ``/dev/kfd``, but does ship
+ ``amdgpu-arch``.
+ """
+ if not toolchain_bin:
+ return 0
+ tb = Path(toolchain_bin)
+ exe = next(
+ (str(tb / c) for c in ("amdgpu-arch", "amdgpu-arch.exe") if (tb / c).exists()),
+ None,
+ )
+ if exe is None:
+ return 0
+ try:
+ proc = subprocess.run(exe, capture_output=True, text=True, timeout=60)
+ except (OSError, subprocess.SubprocessError):
+ return 0
+ if proc.returncode != 0:
+ return 0
+ return sum(1 for line in proc.stdout.splitlines() if line.strip())
+
+
+def detect_features(toolchain_bin=None, force=None):
+ """Return the set of lit features available on this runner.
+
+ hip/amdgpu are assumed present (this runner only ever drives GPU tests on a
+ runner that has the toolchain + HIP). ``multi-device`` is derived from the
+ number of GPUs the runtime actually exposes (>= 2), via ``amdgpu-arch``.
+ """
+ features = {"hip", "amdgpu"}
+ if sys.platform.startswith("linux"):
+ features.add("linux")
+ elif sys.platform.startswith("win"):
+ features.add("windows")
+
+ if _count_visible_gpus(toolchain_bin) >= 2:
+ features.add("multi-device")
+
+ if force:
+ for f in force:
+ features.add(f)
+ return features
+
+
+# --- boolean expression evaluation (REQUIRES / UNSUPPORTED) ------------------
+
+_TOKEN_RE = re.compile(r"\s*(\(|\)|\|\||&&|!|[\w.+-]+)\s*")
+
+
+def _clause_to_py(clause):
+ out = []
+ for tok in _TOKEN_RE.findall(clause):
+ if tok == "||":
+ out.append(" or ")
+ elif tok == "&&":
+ out.append(" and ")
+ elif tok == "!":
+ out.append(" not ")
+ elif tok in ("(", ")"):
+ out.append(tok)
+ elif tok == "true":
+ out.append("True")
+ elif tok == "false":
+ out.append("False")
+ else:
+ out.append("(%r in FEATURES)" % tok)
+ return "".join(out) or "True"
+
+
+def eval_requires(expr, features):
+ """All comma-separated clauses must be true."""
+ return all(
+ eval(_clause_to_py(c), {"__builtins__": {}}, {"FEATURES": features})
+ for c in expr.split(",")
+ if c.strip()
+ )
+
+
+def eval_unsupported(expr, features):
+ """Unsupported if any comma-separated clause is true."""
+ return any(
+ eval(_clause_to_py(c), {"__builtins__": {}}, {"FEATURES": features})
+ for c in expr.split(",")
+ if c.strip()
+ )
+
+
+# --- test parsing -----------------------------------------------------------
+
+_DIRECTIVE_RE = re.compile(r"(?://|#)\s*(RUN|REQUIRES|UNSUPPORTED):\s?(.*)")
+
+
+def parse_test(path):
+ """Return (run_lines, requires, unsupported) for a test file."""
+ runs, requires, unsupported = [], [], []
+ cont = None
+ for raw in Path(path).read_text(errors="replace").splitlines():
+ m = _DIRECTIVE_RE.search(raw)
+ if cont is not None:
+ # Continuation of a previous RUN line.
+ text = raw
+ cm = re.search(r"(?://|#)\s*RUN:\s?(.*)", raw)
+ if cm:
+ text = cm.group(1)
+ cont += " " + text.strip()
+ if cont.rstrip().endswith("\\"):
+ cont = cont.rstrip()[:-1]
+ else:
+ runs.append(cont)
+ cont = None
+ continue
+ if not m:
+ continue
+ kind, body = m.group(1), m.group(2)
+ if kind == "REQUIRES":
+ requires.append(body.strip())
+ elif kind == "UNSUPPORTED":
+ unsupported.append(body.strip())
+ elif kind == "RUN":
+ if body.rstrip().endswith("\\"):
+ cont = body.rstrip()[:-1]
+ else:
+ runs.append(body)
+ return runs, requires, unsupported
+
+
+# --- substitutions ----------------------------------------------------------
+
+
+def make_substitutions(clang, clangxx, src, tprefix, arch, hip_lib_path):
+ # Order matters: longer / more specific tokens first; %% resolved last.
+ return [
+ ("%clangxx", clangxx),
+ ("%clang", clang),
+ ("%amdgpu_arch", arch),
+ ("%hip_lib_path", hip_lib_path),
+ ("%run ", ""),
+ ("%s", str(src)),
+ ("%t", tprefix),
+ ("%%", "%"),
+ ]
+
+
+def apply_substitutions(line, subs):
+ for token, value in subs:
+ line = line.replace(token, value)
+ return line
+
+
+# --- tool shims (FileCheck / not) -------------------------------------------
+
+
+def ensure_tools(toolchain_bin, workdir):
+ """Build a PATH that resolves clang/llvm-*, FileCheck and not.
+
+ Prefers real binaries under toolchain_bin; falls back to shims for FileCheck
+ (PyPI ``filecheck``) and ``not`` (exit-code inverter).
+ """
+ shim_dir = workdir / "shims"
+ shim_dir.mkdir(parents=True, exist_ok=True)
+ path = os.pathsep.join([str(toolchain_bin), str(shim_dir), os.environ.get("PATH", "")])
+
+ def have(tool):
+ # File-based check (shutil.which is quirky across OSes / Git Bash). The
+ # shims are extensionless bash scripts, which Git Bash resolves via the
+ # shebang, so a real binary is anything matching tool or tool.exe.
+ tb = Path(toolchain_bin)
+ return (tb / tool).exists() or (tb / (tool + ".exe")).exists()
+
+ def write_shim(name, body):
+ p = shim_dir / name
+ p.write_text(body)
+ p.chmod(p.stat().st_mode | stat.S_IXUSR | stat.S_IXGRP | stat.S_IXOTH)
+
+ if not have("FileCheck"):
+ write_shim(
+ "FileCheck",
+ '#!/usr/bin/env bash\n'
+ 'if command -v filecheck >/dev/null 2>&1; then exec filecheck "$@"; fi\n'
+ 'exec python3 -m filecheck "$@"\n',
+ )
+ if not have("not"):
+ write_shim(
+ "not",
+ "#!/usr/bin/env bash\n"
+ 'if [ "$1" = "--crash" ]; then shift; "$@"; ec=$?; '
+ "[ $ec -ge 128 ] && exit 0 || exit 1; fi\n"
+ '"$@"; ec=$?; [ $ec -eq 0 ] && exit 1 || exit 0\n',
+ )
+ return path
+
+
+# --- execution --------------------------------------------------------------
+
+
+def run_one(path, args, features, base_env):
+ runs, requires, unsupported = parse_test(path)
+
+ for expr in requires:
+ if not eval_requires(expr, features):
+ return "UNSUPPORTED", "missing requirement: %s" % expr
+ for expr in unsupported:
+ if eval_unsupported(expr, features):
+ return "UNSUPPORTED", "unsupported: %s" % expr
+ if not runs:
+ return "UNSUPPORTED", "no RUN lines"
+
+ workdir = Path(tempfile.mkdtemp(prefix="profgpu-"))
+ tprefix = str(workdir / "t")
+ subs = make_substitutions(
+ args.clang, args.clangxx, Path(path).resolve(), tprefix, args.amdgpu_arch,
+ args.hip_lib_path,
+ )
+
+ if args.dry_run:
+ print("# %s" % path)
+ for line in runs:
+ print(" " + apply_substitutions(line, subs).strip())
+ return "DRYRUN", ""
+
+ env = dict(base_env)
+ env["PATH"] = ensure_tools(Path(args.toolchain_bin), workdir)
+ timeout = args.timeout if args.timeout and args.timeout > 0 else None
+ for line in runs:
+ cmd = apply_substitutions(line, subs).strip()
+ try:
+ proc = subprocess.run(
+ ["bash", "-e", "-o", "pipefail", "-c", cmd],
+ cwd=str(workdir),
+ env=env,
+ capture_output=True,
+ text=True,
+ timeout=timeout,
+ )
+ except subprocess.TimeoutExpired as e:
+ out = e.stdout or ""
+ err = e.stderr or ""
+ if isinstance(out, bytes):
+ out = out.decode("utf-8", "replace")
+ if isinstance(err, bytes):
+ err = err.decode("utf-8", "replace")
+ detail = "RUN timed out after %gs: %s\n%s%s" % (
+ timeout, cmd, out, err,
+ )
+ if not args.keep:
+ shutil.rmtree(workdir, ignore_errors=True)
+ return "FAIL", detail
+ if proc.returncode != 0:
+ detail = "RUN failed (rc=%d): %s\n%s%s" % (
+ proc.returncode, cmd, proc.stdout, proc.stderr,
+ )
+ if not args.keep:
+ shutil.rmtree(workdir, ignore_errors=True)
+ return "FAIL", detail
+ if not args.keep:
+ shutil.rmtree(workdir, ignore_errors=True)
+ return "PASS", ""
+
+
+def discover(paths):
+ tests = []
+ for p in paths:
+ p = Path(p)
+ if p.is_dir():
+ tests.extend(sorted(str(x) for x in p.rglob("*.hip")))
+ elif p.is_file():
+ tests.append(str(p))
+ return tests
+
+
+def main():
+ ap = argparse.ArgumentParser(description=__doc__)
+ ap.add_argument("tests", nargs="+", help="Test files or directories")
+ ap.add_argument("--toolchain-bin", required=False,
+ help="Directory with clang and llvm-* tools")
+ ap.add_argument("--hip-lib-path", default="", help="Directory with libamdhip64")
+ ap.add_argument("--amdgpu-arch", default="native")
+ ap.add_argument("--clang", help="Override clang path")
+ ap.add_argument("--clangxx", help="Override clang++ path")
+ ap.add_argument("--feature", action="append", default=[],
+ help="Force-enable an extra lit feature")
+ ap.add_argument("--dry-run", action="store_true",
+ help="Print resolved RUN lines without executing")
+ ap.add_argument("--keep", action="store_true", help="Keep per-test temp dirs")
+ ap.add_argument("--timeout", type=float, default=600,
+ help="Per-RUN-line timeout in seconds (<=0 disables); "
+ "guards against a hung GPU/compiler wedging the run")
+ args = ap.parse_args()
+
+ if not args.dry_run and not args.toolchain_bin:
+ ap.error("--toolchain-bin is required unless --dry-run is given")
+
+ if args.toolchain_bin:
+ binp = Path(args.toolchain_bin)
+ args.clang = args.clang or str(binp / "clang")
+ args.clangxx = args.clangxx or str(binp / "clang++")
+ else:
+ args.clang = args.clang or "clang"
+ args.clangxx = args.clangxx or "clang++"
+
+ features = detect_features(args.toolchain_bin, args.feature)
+ print("# features: %s" % ", ".join(sorted(features)))
+
+ base_env = dict(os.environ)
+ if args.toolchain_bin:
+ lib_dirs = [
+ str(Path(args.toolchain_bin).parent / "lib"), # toolchain libs
+ ]
+ if args.hip_lib_path:
+ lib_dirs.append(args.hip_lib_path)
+ existing = base_env.get("LD_LIBRARY_PATH", "")
+ base_env["LD_LIBRARY_PATH"] = os.pathsep.join(
+ [d for d in lib_dirs if d] + ([existing] if existing else [])
+ )
+
+ tests = discover(args.tests)
+ if not tests:
+ print("error: no tests found", file=sys.stderr)
+ return 2
+
+ results = {"PASS": [], "FAIL": [], "UNSUPPORTED": [], "DRYRUN": []}
+ for t in tests:
+ status, detail = run_one(t, args, features, base_env)
+ results[status].append(t)
+ if status == "FAIL":
+ print("FAIL: %s" % t)
+ print(detail)
+ elif status in ("PASS", "UNSUPPORTED"):
+ print("%s: %s" % (status, t))
+
+ print(
+ "\n# summary: %d passed, %d failed, %d unsupported (of %d)"
+ % (len(results["PASS"]), len(results["FAIL"]),
+ len(results["UNSUPPORTED"]), len(tests))
+ )
+ return 1 if results["FAIL"] else 0
+
+
+if __name__ == "__main__":
+ sys.exit(main())
More information about the cfe-commits
mailing list