[clang] [compiler-rt] [PGO][HIP] HSA-introspection device profile drain + GPU PGO tests (PR #203056)
via cfe-commits
cfe-commits at lists.llvm.org
Sun Jun 14 12:28:12 PDT 2026
llvmorg-github-actions[bot] wrote:
<!--LLVM PR SUMMARY COMMENT-->
@llvm/pr-subscribers-pgo
Author: Larry Meadows (lfmeadow)
<details>
<summary>Changes</summary>
## Summary
Follow-up to #<!-- -->202095 (now landed). #<!-- -->202095's host-shadow device-profile drain can
only collect device counters for kernels that registered a host-side shadow via
`__hipRegisterVar`. Device-linked programs (e.g. RCCL), whose instrumented code
objects are linked directly into the device image with no host shadow, are never
drained.
This adds a **supplemental, Linux-only HSA-introspection drain** that runs after
the host-shadow drain: it walks each GPU agent, enumerates only the code objects
actually resident there, reads each one's `__llvm_profile_sections` table on the
device, and routes them through the existing `processDeviceOffloadPrf()` path so
the emitted `.profraw` layout is identical. A content-dedup set keyed on the
`(data, counters, names)` device-pointer triple ensures a section already drained
by the host-shadow pass is not drained twice, so the two passes compose without
double-counting.
It is purely additive — it does not modify #<!-- -->202095's host-shadow drain or its
launch-tracking. Highlights:
- `compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp`: HSA agent/segment/
symbol walk + dedup; record drained bounds after each host-shadow drain; lazy
HSA init (no library constructor, for fork-safety).
- Because the HSA walk only touches resident code objects, it lets us avoid the
host-shadow drain's collect-all fallback on Linux. When **no** kernel launch was
tracked (program never launches, collects before its first launch, or launches
only via an untracked API), the host-shadow pass is skipped and the HSA drain
covers it safely — instead of faulting/hanging reading a non-resident device on
a multi-GPU host. This also closes the silent-data-loss gap for untracked launch
APIs (`hipExtLaunchKernel`, cooperative/graph launches).
- `clang/lib/Driver/ToolChains/Clang.cpp` / `HIPAMD.cpp`: link the device profile
runtime on both the new-offload-driver (`LinkerWrapper::ConstructJob`) and
traditional (`lld`) link paths, guarded by `needsProfileRT` + VFS existence.
- New GPU/AMDGPU HIP device-PGO tests, a dependency-free `run_gpu_tests.py`
"lit-lite" runner (no `llvm-lit`/in-tree `FileCheck` required), and a
`device-pgo/` standalone build helper.
## Why a separate test harness
There are no AMD GPUs in upstream CI, so these `.hip` tests don't run in-tree;
`run_gpu_tests.py` lets a downstream GPU CI (e.g. ROCm/TheRock) execute them
against an installed toolchain. It parses the `REQUIRES`/`UNSUPPORTED`/`RUN`
slice of lit markup, applies a fixed substitution set, detects `multi-device`
from the runtime-visible GPU count, and provides `FileCheck`/`not` shims when the
real binaries aren't in the artifact.
## Test plan
- 4x gfx90a (`gfx90a:sramecc+:xnack-`), ROCm 7.1.
- `python3 compiler-rt/test/profile/run_gpu_tests.py --toolchain-bin <abs>/bin --hip-lib-path /opt/rocm/lib compiler-rt/test/profile/GPU compiler-rt/test/profile/AMDGPU`
- **12 passed, 0 failed, 0 unsupported.** Covers: basic/coverage/pgo-use,
multiple-kernels, device-branching, multi-gpu and non-default-device drain,
early-collect / no-kernel edges, RDC vs non-RDC `__llvm_profile_sections`,
dedup (host-shadow drains the used device, HSA finds it and dedups), and
fork-safety (the RCCL parent-no-HIP / kernel-in-forked-child pattern).
- Build is warning-clean and `git clang-format` clean.
---
Patch is 99.22 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/203056.diff
26 Files Affected:
- (modified) clang/lib/Driver/ToolChains/Clang.cpp (+15)
- (modified) clang/lib/Driver/ToolChains/HIPAMD.cpp (+20)
- (modified) clang/lib/Driver/ToolChains/Linux.cpp (+18-3)
- (modified) clang/lib/Driver/ToolChains/MSVC.cpp (+19-12)
- (modified) clang/test/Driver/hip-profile-rocm-runtime.hip (+16-2)
- (modified) compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp (+630-52)
- (added) compiler-rt/test/profile/AMDGPU/device-basic.hip (+67)
- (added) compiler-rt/test/profile/AMDGPU/device-early-collect.hip (+68)
- (added) compiler-rt/test/profile/AMDGPU/device-no-kernel.hip (+44)
- (added) compiler-rt/test/profile/AMDGPU/device-symbols.hip (+42)
- (added) compiler-rt/test/profile/AMDGPU/lit.local.cfg.py (+4)
- (added) compiler-rt/test/profile/GPU/instrprof-hip-basic.hip (+51)
- (added) compiler-rt/test/profile/GPU/instrprof-hip-collect-after.hip (+63)
- (added) compiler-rt/test/profile/GPU/instrprof-hip-counter-correctness.hip (+56)
- (added) compiler-rt/test/profile/GPU/instrprof-hip-coverage.hip (+51)
- (added) compiler-rt/test/profile/GPU/instrprof-hip-device-branching.hip (+67)
- (added) compiler-rt/test/profile/GPU/instrprof-hip-fork-safety.hip (+61)
- (added) compiler-rt/test/profile/GPU/instrprof-hip-multi-gpu.hip (+57)
- (added) compiler-rt/test/profile/GPU/instrprof-hip-multi-process-merge.hip (+63)
- (added) compiler-rt/test/profile/GPU/instrprof-hip-multiple-kernels.hip (+58)
- (added) compiler-rt/test/profile/GPU/instrprof-hip-nondefault-device.hip (+60)
- (added) compiler-rt/test/profile/GPU/instrprof-hip-pgo-use.hip (+63)
- (added) compiler-rt/test/profile/device-pgo/README.md (+125)
- (added) compiler-rt/test/profile/device-pgo/build.sh (+56)
- (added) compiler-rt/test/profile/device-pgo/toolchain-cache.cmake (+55)
- (added) compiler-rt/test/profile/run_gpu_tests.py (+408)
``````````diff
diff --git a/clang/lib/Driver/ToolChains/Clang.cpp b/clang/lib/Driver/ToolChains/Clang.cpp
index c2ac478d84929..3b8bc46820af6 100644
--- a/clang/lib/Driver/ToolChains/Clang.cpp
+++ b/clang/lib/Driver/ToolChains/Clang.cpp
@@ -9658,6 +9658,21 @@ void LinkerWrapper::ConstructJob(Compilation &C, const JobAction &JA,
(TC->getTriple().isAMDGPU() || TC->getTriple().isNVPTX()))
LinkerArgs.emplace_back("-lompdevice");
+ // With PGO/coverage instrumentation, GPU device code references the
+ // device profile runtime (__llvm_profile_instrument_gpu and the
+ // __llvm_profile_sections bounds table emitted by
+ // InstrProfilingPlatformGPU). The offload device link does not otherwise
+ // pull it in, so forward the static device profile runtime to the GPU
+ // device linker. The archive is arch-suffixed, so pass its full path
+ // rather than a -l name.
+ if (ToolChain::needsProfileRT(Args) &&
+ (TC->getTriple().isAMDGPU() || TC->getTriple().isNVPTX())) {
+ std::string ProfileRT =
+ TC->getCompilerRT(Args, "profile", ToolChain::FT_Static);
+ if (TC->getVFS().exists(ProfileRT))
+ LinkerArgs.emplace_back(Args.MakeArgString(ProfileRT));
+ }
+
// For SPIR-V, pass some extra flags to `spirv-link`, the out-of-tree
// SPIR-V linker. `spirv-link` isn't called in LTO mode so restrict these
// flags to normal compilation.
diff --git a/clang/lib/Driver/ToolChains/HIPAMD.cpp b/clang/lib/Driver/ToolChains/HIPAMD.cpp
index 01cb23d0aa230..1bd4e073b4e27 100644
--- a/clang/lib/Driver/ToolChains/HIPAMD.cpp
+++ b/clang/lib/Driver/ToolChains/HIPAMD.cpp
@@ -19,6 +19,7 @@
#include "clang/Options/Options.h"
#include "llvm/Support/FileSystem.h"
#include "llvm/Support/Path.h"
+#include "llvm/Support/VirtualFileSystem.h"
#include "llvm/TargetParser/TargetParser.h"
using namespace clang::driver;
@@ -142,6 +143,25 @@ void AMDGCN::Linker::constructLldCommand(Compilation &C, const JobAction &JA,
LldArgs.push_back("--no-whole-archive");
+ // With PGO/coverage instrumentation, instrumented device code references the
+ // device profile runtime (__llvm_profile_instrument_gpu and the
+ // __llvm_profile_sections bounds table emitted by InstrProfilingPlatformGPU).
+ // The new-offload-driver path injects this in LinkerWrapper::ConstructJob,
+ // but HIP using the traditional offload path (e.g. on Windows, which does not
+ // route device linking through clang-linker-wrapper) reaches the device link
+ // here instead. Forward the static device profile runtime to this lld device
+ // link so the runtime is pulled in regardless of offload-driver/host OS. The
+ // archive is arch-suffixed, so pass its full path rather than a -l name.
+ if (ToolChain::needsProfileRT(Args)) {
+ std::string ProfileRT =
+ TC.getCompilerRT(Args, "profile", ToolChain::FT_Static);
+ // Use the ToolChain VFS (matches the new-offload-driver path in
+ // Clang.cpp) so overlay/virtual filesystems used by the driver are
+ // honored; llvm::sys::fs bypasses them and can wrongly skip the runtime.
+ if (TC.getVFS().exists(ProfileRT))
+ LldArgs.push_back(Args.MakeArgString(ProfileRT));
+ }
+
const char *Lld = Args.MakeArgStringRef(getToolChain().GetProgramPath("lld"));
C.addCommand(std::make_unique<Command>(JA, *this, ResponseFileSupport::None(),
Lld, LldArgs, Inputs, Output));
diff --git a/clang/lib/Driver/ToolChains/Linux.cpp b/clang/lib/Driver/ToolChains/Linux.cpp
index 512788d235fec..00ae53af4865f 100644
--- a/clang/lib/Driver/ToolChains/Linux.cpp
+++ b/clang/lib/Driver/ToolChains/Linux.cpp
@@ -906,13 +906,28 @@ void Linux::addOffloadRTLibs(unsigned ActiveKinds, const ArgList &Args,
Args.MakeArgString(StringRef("-L") + RocmInstallation->getLibPath()));
// For HIP device PGO, link clang_rt.profile_rocm when available. It is a
- // self-contained superset of clang_rt.profile, emitted first so the base
- // archive stays inert.
- if ((ActiveKinds & Action::OFK_HIP) && needsProfileRT(Args) &&
+ // self-contained superset of clang_rt.profile, emitted first (before the
+ // base archive added by addProfileRTLibs) so the base archive stays inert.
+ //
+ // This is intentionally not gated on Action::OFK_HIP. HIP host objects are
+ // routinely linked into a shared library or executable from pre-compiled
+ // .o files (e.g. RCCL's librccl.so), a link command that carries no HIP
+ // offload action yet still needs the device-counter drain. Gating on
+ // OFK_HIP would silently drop the drain for those object-only links and
+ // the resulting .profraw would contain host counters only. profile_rocm is
+ // self-contained and both its hipModuleLoad interceptor and its
+ // device-collection drain self-skip at runtime when the process has no
+ // resident device code, so linking it into a non-HIP instrumented binary is
+ // harmless. It is only present on ROCm-equipped toolchains in the first
+ // place (the getVFS().exists check below).
+ if (needsProfileRT(Args) &&
getVFS().exists(getCompilerRT(Args, "profile_rocm", FT_Static))) {
CmdArgs.push_back(getCompilerRTArgString(Args, "profile_rocm"));
// Force-retain the constructor-only hipModuleLoad* interceptor object; its
// constructor self-skips when the program does not use hipModuleLoad.
+ // Pulling this object in also pulls the device-counter drain
+ // (__llvm_profile_hip_collect_device_data) from the same translation unit,
+ // which InstrProfilingFile.c invokes through a weak reference at exit.
CmdArgs.push_back("-u");
CmdArgs.push_back("__llvm_profile_offload_register_dynamic_module");
}
diff --git a/clang/lib/Driver/ToolChains/MSVC.cpp b/clang/lib/Driver/ToolChains/MSVC.cpp
index 0796bdff96d46..9a7df6af7727c 100644
--- a/clang/lib/Driver/ToolChains/MSVC.cpp
+++ b/clang/lib/Driver/ToolChains/MSVC.cpp
@@ -598,19 +598,26 @@ void MSVCToolChain::addOffloadRTLibs(unsigned ActiveKinds, const ArgList &Args,
CmdArgs.append({Args.MakeArgString(StringRef("-libpath:") +
RocmInstallation->getLibPath()),
"amdhip64.lib"});
+ }
- // For HIP device PGO, link clang_rt.profile_rocm when available. It is a
- // self-contained superset of clang_rt.profile, emitted first so the base
- // archive stays inert (avoiding a /MD-vs-/MT CRT mix in the host image).
- if (needsProfileRT(Args) &&
- getVFS().exists(getCompilerRT(Args, "profile_rocm", FT_Static))) {
- CmdArgs.push_back(getCompilerRTArgString(Args, "profile_rocm"));
- // Force the linker to retain the constructor-only hipModuleLoad*
- // interceptor object from clang_rt.profile_rocm (see Linux.cpp). The
- // constructor self-skips for programs that do not use hipModuleLoad.
- CmdArgs.push_back(
- "-include:__llvm_profile_offload_register_dynamic_module");
- }
+ // For HIP device PGO, link clang_rt.profile_rocm when available. It is a
+ // self-contained superset of clang_rt.profile, emitted first so the base
+ // archive stays inert (avoiding a /MD-vs-/MT CRT mix in the host image).
+ //
+ // Not gated on Action::OFK_HIP: HIP host objects are routinely linked into a
+ // DLL or executable from pre-compiled .obj files, a link that carries no HIP
+ // offload action yet still needs the device-counter drain (see Linux.cpp for
+ // the full rationale). profile_rocm self-skips at runtime when the process
+ // has no resident device code, and is only present on ROCm-equipped
+ // toolchains (the getVFS().exists check below).
+ if (needsProfileRT(Args) &&
+ getVFS().exists(getCompilerRT(Args, "profile_rocm", FT_Static))) {
+ CmdArgs.push_back(getCompilerRTArgString(Args, "profile_rocm"));
+ // Force the linker to retain the constructor-only hipModuleLoad*
+ // interceptor object from clang_rt.profile_rocm (see Linux.cpp). The
+ // constructor self-skips for programs that do not use hipModuleLoad.
+ CmdArgs.push_back(
+ "-include:__llvm_profile_offload_register_dynamic_module");
}
}
diff --git a/clang/test/Driver/hip-profile-rocm-runtime.hip b/clang/test/Driver/hip-profile-rocm-runtime.hip
index fc82db4fc13c0..9346f05dedf42 100644
--- a/clang/test/Driver/hip-profile-rocm-runtime.hip
+++ b/clang/test/Driver/hip-profile-rocm-runtime.hip
@@ -25,9 +25,23 @@
// RUN: | FileCheck -check-prefix=HIP-NOPGO %s
// HIP-NOPGO-NOT: libclang_rt.profile_rocm.a
-// A non-HIP host link with PGO does not link the ROCm device-profile runtime.
+// An object-only host link with PGO (no HIP offload action) still links the
+// ROCm device-profile runtime when it is available in the toolchain. HIP host
+// code is frequently linked into a library/executable from pre-compiled
+// objects, a link that carries no OFK_HIP yet still needs the device drain.
// RUN: %clang -### --target=x86_64-unknown-linux \
// RUN: -fprofile-instr-generate -resource-dir=%t %t.o 2>&1 \
// RUN: | FileCheck -check-prefix=HOST-PGO %s
+// HOST-PGO: "{{.*}}libclang_rt.profile_rocm.a"
+// HOST-PGO: "-u" "__llvm_profile_offload_register_dynamic_module"
// HOST-PGO: "{{.*}}libclang_rt.profile.a"
-// HOST-PGO-NOT: libclang_rt.profile_rocm.a
+
+// On a lean toolchain that ships only the base profile runtime (no
+// profile_rocm), nothing extra is linked and the link still succeeds.
+// RUN: rm -rf %t2 && mkdir -p %t2/lib/x86_64-unknown-linux
+// RUN: touch %t2/lib/x86_64-unknown-linux/libclang_rt.profile.a
+// RUN: %clang -### --target=x86_64-unknown-linux \
+// RUN: -fprofile-instr-generate -resource-dir=%t2 %t.o 2>&1 \
+// RUN: | FileCheck -check-prefix=NO-ROCM-RT %s
+// NO-ROCM-RT: "{{.*}}libclang_rt.profile.a"
+// NO-ROCM-RT-NOT: libclang_rt.profile_rocm.a
diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
index d0d9b1ea8f61d..b1db1d8a74041 100644
--- a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
+++ b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.cpp
@@ -66,6 +66,15 @@ struct OffloadSectionShadowGroup;
static int processDeviceOffloadPrf(void *DeviceOffloadPrf, const char *Target,
const OffloadSectionShadowGroup *Sections);
+#if defined(__linux__) && !defined(_WIN32)
+// Record a drained section-bounds tuple so the supplemental HSA-introspection
+// pass (Linux only) skips any code object the host-shadow path already
+// drained. Defined alongside the HSA drain below; forward-declared here so
+// processDeviceOffloadPrf can register every successful host-shadow drain.
+static void profRecordDrainedBounds(const void *Data, const void *Counters,
+ const void *Names);
+#endif
+
static int isVerboseMode() {
static int IsVerbose = -1;
if (IsVerbose == -1)
@@ -1119,8 +1128,14 @@ static int processDeviceOffloadPrf(void *DeviceOffloadPrf, const char *Target,
if (ret != 0) {
PROF_ERR("%s\n", "failed to write device profile using shared API");
- } else if (isVerboseMode()) {
- PROF_NOTE("%s\n", "Successfully wrote device profile using shared API");
+ } else {
+#if defined(__linux__) && !defined(_WIN32)
+ // Dedup against the supplemental HSA pass: this section is now drained, so
+ // the HSA walk must not drain the same device code object again.
+ profRecordDrainedBounds(DevDataBegin, DevCntsBegin, DevNamesBegin);
+#endif
+ if (isVerboseMode())
+ PROF_NOTE("%s\n", "Successfully wrote device profile using shared API");
}
return ret;
@@ -1148,72 +1163,635 @@ static int isHipAvailable(void) {
return pHipMemcpy != nullptr && pHipGetSymbolAddress != nullptr;
}
-/* -------------------------------------------------------------------------- */
-/* Collect device-side profile data */
-/* -------------------------------------------------------------------------- */
+/* ========================================================================== */
+/* Supplemental HSA-introspection drain (Linux only) */
+/* */
+/* The host-shadow drain above only sees device code objects registered */
+/* host-side (__hipRegisterVar shadows) or loaded through an intercepted */
+/* hipModuleLoad* call. Device code linked by the offload device linker with */
+/* no host-side shadow -- e.g. RCCL, whose many device functions are glued */
+/* into a single kernel with no source module -- is invisible to it. This */
+/* pass walks every GPU agent's loaded executables via HSA, finds each */
+/* __llvm_profile_sections table directly on the device, and drains the ones */
+/* the host-shadow pass did not already handle (deduped by the device */
+/* section-bounds tuple). It reuses processDeviceOffloadPrf() for the */
+/* copy/relocate/write so the on-disk profraw layout is identical. */
+/* ========================================================================== */
+#if defined(__linux__) && !defined(_WIN32)
-extern "C" int __llvm_profile_hip_collect_device_data(void) {
- if (NumShadowVariables == 0 && NumDynamicModules == 0)
+/* Minimal HSA type/enum stubs. compiler-rt cannot depend on ROCm headers at
+ * build time, so mirror just the handful of HSA declarations the drain needs.
+ * Values match hsa/hsa.h and hsa/hsa_ven_amd_loader.h. */
+typedef uint32_t prof_hsa_status_t;
+#define PROF_HSA_STATUS_SUCCESS ((prof_hsa_status_t)0x0)
+#define PROF_HSA_STATUS_INFO_BREAK ((prof_hsa_status_t)0x1)
+
+typedef struct {
+ uint64_t handle;
+} prof_hsa_agent_t;
+typedef struct {
+ uint64_t handle;
+} prof_hsa_executable_t;
+typedef struct {
+ uint64_t handle;
+} prof_hsa_executable_symbol_t;
+
+typedef uint32_t prof_hsa_agent_info_t;
+#define PROF_HSA_AGENT_INFO_NAME ((prof_hsa_agent_info_t)0)
+#define PROF_HSA_AGENT_INFO_DEVICE ((prof_hsa_agent_info_t)17)
+
+typedef uint32_t prof_hsa_device_type_t;
+#define PROF_HSA_DEVICE_TYPE_GPU ((prof_hsa_device_type_t)1)
+
+typedef uint32_t prof_hsa_symbol_kind_t;
+#define PROF_HSA_SYMBOL_KIND_VARIABLE ((prof_hsa_symbol_kind_t)0)
+
+typedef uint32_t prof_hsa_executable_symbol_info_t;
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_TYPE \
+ ((prof_hsa_executable_symbol_info_t)0)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME_LENGTH \
+ ((prof_hsa_executable_symbol_info_t)1)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_NAME \
+ ((prof_hsa_executable_symbol_info_t)2)
+#define PROF_HSA_EXECUTABLE_SYMBOL_INFO_VARIABLE_ADDRESS \
+ ((prof_hsa_executable_symbol_info_t)21)
+
+#define PROF_HSA_EXTENSION_AMD_LOADER ((uint16_t)0x201)
+
+typedef uint32_t prof_hsa_loader_storage_type_t;
+
+typedef struct {
+ prof_hsa_agent_t agent;
+ prof_hsa_executable_t executable;
+ prof_hsa_loader_storage_type_t code_object_storage_type;
+ const void *code_object_storage_base;
+ size_t code_object_storage_size;
+ size_t code_object_storage_offset;
+ const void *segment_base;
+ size_t segment_size;
+} prof_hsa_loader_segment_descriptor_t;
+
+typedef prof_hsa_status_t (*hsa_init_ty)(void);
+typedef prof_hsa_status_t (*hsa_iterate_agents_ty)(
+ prof_hsa_status_t (*)(prof_hsa_agent_t, void *), void *);
+typedef prof_hsa_status_t (*hsa_agent_get_info_ty)(prof_hsa_agent_t,
+ prof_hsa_agent_info_t,
+ void *);
+typedef prof_hsa_status_t (*hsa_executable_iterate_agent_symbols_ty)(
+ prof_hsa_executable_t, prof_hsa_agent_t,
+ prof_hsa_status_t (*)(prof_hsa_executable_t, prof_hsa_agent_t,
+ prof_hsa_executable_symbol_t, void *),
+ void *);
+typedef prof_hsa_status_t (*hsa_executable_symbol_get_info_ty)(
+ prof_hsa_executable_symbol_t, prof_hsa_executable_symbol_info_t, void *);
+typedef prof_hsa_status_t (*hsa_system_get_major_extension_table_ty)(uint16_t,
+ uint16_t,
+ size_t,
+ void *);
+typedef prof_hsa_status_t (*hsa_loader_query_segment_descriptors_ty)(
+ prof_hsa_loader_segment_descriptor_t *, size_t *);
+
+/* First two members of hsa_ven_amd_loader_1_00_pfn_t. Only
+ * query_segment_descriptors is used; query_host_address keeps the offset. */
+typedef struct {
+ void *query_host_address;
+ hsa_loader_query_segment_descriptors_ty query_segment_descriptors;
+} prof_hsa_loader_pfn_t;
+
+static hsa_iterate_agents_ty pHsaIterateAgents = nullptr;
+static hsa_agent_get_info_ty pHsaAgentGetInfo = nullptr;
+static hsa_executable_iterate_agent_symbols_ty pHsaExecIterAgentSyms = nullptr;
+static hsa_executable_symbol_get_info_ty pHsaSymGetInfo = nullptr;
+static hsa_loader_query_segment_descriptors_ty pQuerySegDescs = nullptr;
+
+/* 0 = not yet attempted, 1 = ready, -1 = unavailable. Accessed with acquire/
+ * release atomics: a thread observing HsaRuntimeState==1 (acquire) also sees
+ * the fully-written p* function pointers (published before the release store
+ * of HsaRuntimeState=1 below). */
+static int HsaRuntimeState = 0;
+
+static int setHsaRuntimeState(int S) {
+ __atomic_store_n(&HsaRuntimeState, S, __ATOMIC_RELEASE);
+ return S > 0 ? 0 : -1;
+}
+
+/* Resolve HSA entry points (and the AMD loader extension) once, and confirm
+ * HIP's hipMemcpy is reachable for the device-to-host copies. HIP itself is
+ * resolved by the shared ensureHipLoaded() above. */
+static int loadHsaRuntimePointers(void) {
+ int State = __atomic_load_n(&HsaRuntimeState, __ATOMIC_ACQUIRE);
+ if (State)
+ return State > 0 ? 0 : -1;
+
+ if (!__interception::DynamicLoaderAvailable()) {
+ if (isVerboseMode())
+ PROF_NOTE("%s", "Dynamic library loading not available - "
+ "HSA device profiling disabled\n");
+ return setHsaRuntimeState(-1);
+ }
+
+ void *Hsa = __interception::OpenLibrary("libhsa-runtime64.so");
+ if (!Hsa)
+ Hsa = __interception::OpenLibrary("libhsa-runtime64.so.1");
+ if (!Hsa) {
+ if (isVerboseMode())
+ PROF_NOTE("%s", "libhsa-runtime64.so not loadable - "
+ "HSA device profiling disabled\n");
+ return setHsaRuntimeState(-1);
+ }
+
+ hsa_init_ty pHsaInit =
+ (hsa_init_ty)__interception::LookupSymbol(Hsa, "hsa_init");
+ hsa_system_get_major_extension_table_ty pGetExtTable =
+ (hsa_system_get_major_extension_table_ty)__interception::LookupSymbol(
+ Hsa, "hsa_system_get_major_extension_table");
+ pHsaIterateAgents = (hsa_iterate_agents_ty)__interception::LookupSymbol(
+ Hsa, "hsa_iterate_agents");
+ pHsaAgentGetInfo = (hsa_agent_get_info_ty)__interception::LookupSymbol(
+ Hsa, "hsa_agent_get_info");
+ pHsaExecIterAgentSyms =
+ (hsa_executable_iterate_agent_symbols_ty)__interception::LookupSymbol(
+ Hsa, "hsa_executable_iterate_agent_symbols");
+ pHsaSymGetInfo =
+ (hsa_executable_symbol_get_info_ty)__interception::LookupSymbol(
+ Hsa, "hsa_executable_symbol_get_info");
+
+ if (!pHsaInit || !pGetExtTable || !pHsaIterateAgents || !pHsaAgentGetInfo ||
+ !pHsaExecIterAgentSyms || !pHsaSymGetInfo) {
+ PROF_WARN("%s",
+ "required HSA symbols missing - HSA device profiling disabled\n");
+ return setHsaRuntimeState(-1);
+ }
+
+ /* Bring HSA up (idempotent, refcounted). This runs lazily on the first drain
+ * rather than from the library constructor, so merely loading the
+ * instrumented library does not initialize HSA in the process -- which would
+ * break fork-based callers that deliberately keep HIP/HSA uninitialized in
+ * the parent (see the constructor note at the end of the HSA block). In the
+ * common case the drain runs from the profile write path while HSA is still
+ * alive; if it only runs after HSA's own atexit(hsa_shut_down) has executed,
+ * this simply re-initializes HSA (the process is exiting anyway). */
+ prof_hsa_status_t St = pHsaInit();
+ if (St != PROF_HSA_STATUS_SUCCESS && St != PROF_HSA_STATUS_INFO_BREAK) {
+ if (isVerboseMode())
+ PROF_NOTE("hsa_init failed (0x%x) - HSA device ...
[truncated]
``````````
</details>
https://github.com/llvm/llvm-project/pull/203056
More information about the cfe-commits
mailing list