[clang] [compiler-rt] [llvm] [PGO][AMDGPU] Add offload profiling with uniformity-aware optimization (PR #177665)
Yaxun Liu via cfe-commits
cfe-commits at lists.llvm.org
Wed Mar 18 07:12:09 PDT 2026
https://github.com/yxsamliu updated https://github.com/llvm/llvm-project/pull/177665
>From 7bad09dbb3dd3b9cd8dd048473f503adc6a217f9 Mon Sep 17 00:00:00 2001
From: "Yaxun (Sam) Liu" <yaxun.liu at amd.com>
Date: Tue, 17 Mar 2026 23:59:49 -0400
Subject: [PATCH] [PGO][AMDGPU] Add offload profiling infrastructure for HIP
Core PGO infrastructure for device-side profile-guided optimization on
AMDGPU. Instruments HIP device code with profiling counters, collects
profile data from GPU at program exit, and supports PGO-use for
optimized recompilation.
Key components:
- Contiguous per-TU counter layout with CUID-based symbol naming
- Device profile collection via HIP runtime (atexit handler)
- Multi-GPU and multi-TU support with automatic profile file naming
- Wave-size embedding in raw profile data (no --wave-size CLI needed)
- Uniform counter tracking for divergence-aware spill placement
- GPU profile library (InstrProfilingGPU.c) using gpuintrin.h:
__gpu_pgo_is_sampled (block sampling), __gpu_pgo_increment
(warp-aggregate atomic counter update)
- Driver linking of profile.bc at device link time
- Sampling via -offload-pgo-sampling=N (default 3 = 12.5% of blocks)
- Dynamic module profiling via ELF symbol enumeration
- Windows platform abstraction (LoadLibrary/GetProcAddress)
- HIP PGO documentation in clang/docs/HIPSupport.rst
---
clang/docs/HIPSupport.rst | 77 +-
clang/lib/Driver/ToolChains/Clang.cpp | 12 +
clang/lib/Driver/ToolChains/HIPAMD.cpp | 8 +
compiler-rt/include/profile/InstrProfData.inc | 23 +-
compiler-rt/lib/profile/CMakeLists.txt | 31 +
compiler-rt/lib/profile/InstrProfiling.h | 42 +-
compiler-rt/lib/profile/InstrProfilingFile.c | 96 +-
compiler-rt/lib/profile/InstrProfilingGPU.c | 69 ++
.../lib/profile/InstrProfilingInternal.h | 7 +
.../lib/profile/InstrProfilingPlatformROCm.c | 805 +++++++++++++++
llvm/include/llvm/IR/FixedMetadataKinds.def | 1 +
llvm/include/llvm/ProfileData/InstrProf.h | 55 +-
.../llvm/ProfileData/InstrProfData.inc | 23 +-
.../llvm/ProfileData/InstrProfWriter.h | 1 +
.../llvm/Transforms/Instrumentation/CFGMST.h | 26 +-
llvm/lib/Passes/StandardInstrumentations.cpp | 10 +-
llvm/lib/ProfileData/InstrProf.cpp | 141 ++-
llvm/lib/ProfileData/InstrProfCorrelator.cpp | 3 +
llvm/lib/ProfileData/InstrProfReader.cpp | 61 +-
llvm/lib/ProfileData/InstrProfWriter.cpp | 78 +-
.../Instrumentation/InstrProfiling.cpp | 951 +++++++++++++++++-
.../Instrumentation/PGOInstrumentation.cpp | 88 +-
.../InstrProfiling/amdgpu-3d-grid.ll | 27 +
.../amdgpu-contiguous-counters.ll | 44 +
.../InstrProfiling/amdgpu-instrumentation.ll | 57 ++
.../InstrProfiling/amdgpu-uniform-counters.ll | 21 +
.../InstrProfiling/amdgpu-wave32.ll | 29 +
.../InstrProfiling/amdgpu-wave64.ll | 28 +
.../InstrProfiling/coverage.ll | 8 +-
.../InstrProfiling/gpu-weak.ll | 38 +
.../InstrProfiling/inline-data-var-create.ll | 23 +-
.../InstrProfiling/platform.ll | 12 +
.../thinlto_indirect_call_promotion.profraw | Bin 544 -> 576 bytes
.../amdgpu-disable-value-profiling.ll | 23 +
.../Transforms/PGOProfile/comdat_internal.ll | 4 +-
.../instrprof_burst_sampling_fast.ll | 2 +-
.../Transforms/PGOProfile/vtable_profile.ll | 2 +-
.../llvm-profdata/Inputs/c-general.profraw | Bin 2032 -> 2128 bytes
.../llvm-profdata/Inputs/compressed.profraw | Bin 1984 -> 2080 bytes
.../llvm-profdata/binary-ids-padding.test | 6 +-
.../llvm-profdata/large-binary-id-size.test | 2 +-
...alformed-not-space-for-another-header.test | 3 +-
.../malformed-num-counters-zero.test | 5 +-
.../malformed-ptr-to-counter-array.test | 3 +-
.../misaligned-binary-ids-size.test | 2 +-
.../tools/llvm-profdata/profile-version.test | 2 +-
.../tools/llvm-profdata/raw-32-bits-be.test | 10 +-
.../tools/llvm-profdata/raw-32-bits-le.test | 10 +-
.../tools/llvm-profdata/raw-64-bits-be.test | 10 +-
.../tools/llvm-profdata/raw-64-bits-le.test | 10 +-
.../tools/llvm-profdata/raw-two-profiles.test | 6 +-
llvm/tools/llvm-profdata/llvm-profdata.cpp | 106 ++
.../common/include/GlobalHandler.h | 8 +-
.../common/src/GlobalHandler.cpp | 6 +-
54 files changed, 2948 insertions(+), 167 deletions(-)
create mode 100644 compiler-rt/lib/profile/InstrProfilingGPU.c
create mode 100644 compiler-rt/lib/profile/InstrProfilingPlatformROCm.c
create mode 100644 llvm/test/Instrumentation/InstrProfiling/amdgpu-3d-grid.ll
create mode 100644 llvm/test/Instrumentation/InstrProfiling/amdgpu-contiguous-counters.ll
create mode 100644 llvm/test/Instrumentation/InstrProfiling/amdgpu-instrumentation.ll
create mode 100644 llvm/test/Instrumentation/InstrProfiling/amdgpu-uniform-counters.ll
create mode 100644 llvm/test/Instrumentation/InstrProfiling/amdgpu-wave32.ll
create mode 100644 llvm/test/Instrumentation/InstrProfiling/amdgpu-wave64.ll
create mode 100644 llvm/test/Instrumentation/InstrProfiling/gpu-weak.ll
create mode 100644 llvm/test/Transforms/PGOProfile/amdgpu-disable-value-profiling.ll
diff --git a/clang/docs/HIPSupport.rst b/clang/docs/HIPSupport.rst
index e7f00202c34e7..ea46a2c3afcdb 100644
--- a/clang/docs/HIPSupport.rst
+++ b/clang/docs/HIPSupport.rst
@@ -53,20 +53,20 @@ To compile a HIP program, use the following command:
.. code-block:: shell
- clang++ -c --offload-arch=gfx906 -xhip sample.cpp -o sample.o
+ clang++ -c --offload-arch=gfx1200 -xhip sample.cpp -o sample.o
The ``-xhip`` option indicates that the source is a HIP program. If the file has a ``.hip`` extension,
Clang will automatically recognize it as a HIP program:
.. code-block:: shell
- clang++ -c --offload-arch=gfx906 sample.hip -o sample.o
+ clang++ -c --offload-arch=gfx1200 sample.hip -o sample.o
To link a HIP program, use this command:
.. code-block:: shell
- clang++ --hip-link --offload-arch=gfx906 sample.o -o sample
+ clang++ --hip-link --offload-arch=gfx1200 sample.o -o sample
In the above command, the ``--hip-link`` flag instructs Clang to link the HIP runtime library. However,
the use of this flag is unnecessary if a HIP input file is already present in your program.
@@ -75,9 +75,9 @@ For convenience, Clang also supports compiling and linking in a single step:
.. code-block:: shell
- clang++ --offload-arch=gfx906 -xhip sample.cpp -o sample
+ clang++ --offload-arch=gfx1200 -xhip sample.cpp -o sample
-In the above commands, ``gfx906`` is the GPU architecture that the code is being compiled for. The supported GPU
+In the above commands, ``gfx1200`` is the GPU architecture that the code is being compiled for. The supported GPU
architectures can be found in the `AMDGPU Processor Table <https://llvm.org/docs/AMDGPUUsage.html#processors>`_.
Alternatively, you can use the ``amdgpu-arch`` tool that comes with Clang to list the GPU architecture on your system:
@@ -412,6 +412,73 @@ Example Usage
__host__ __device__ int Four(void) __attribute__((weak, alias("_Z6__Fourv")));
__host__ __device__ float Four(float f) __attribute__((weak, alias("_Z6__Fourf")));
+Profile Guided Optimization (PGO)
+=================================
+
+Clang supports Profile Guided Optimization (PGO) for HIP, enabling optimization
+of both host and device code based on runtime execution profiles.
+
+Workflow
+--------
+
+The PGO workflow consists of three phases:
+
+1. **Instrumented Build**: Compile with ``-fprofile-generate`` to create an
+ instrumented binary that collects execution profiles:
+
+ .. code-block:: shell
+
+ clang++ -O2 -fprofile-generate --offload-arch=gfx1200 -xhip app.hip -o app_instrumented
+
+2. **Profile Collection**: Run the instrumented binary with representative
+ workloads. This generates separate profile files for host and each GPU
+ architecture:
+
+ .. code-block:: shell
+
+ ./app_instrumented
+ # Creates: default_<id>.profraw (host)
+ # default_<id>.<arch>.<tu>.profraw (device, e.g. default_12345.gfx1200.0.profraw)
+
+3. **Merge Profiles**: Use ``llvm-profdata`` to merge the raw profiles:
+
+ .. code-block:: shell
+
+ # Merge host profiles
+ llvm-profdata merge -o app.profdata default_*_0.profraw
+
+ # Merge device profiles (use the GPU arch name from your target)
+ llvm-profdata merge -o app.device.profdata \
+ default_*.gfx1200.*.profraw
+
+4. **Optimized Build**: Rebuild with ``-fprofile-use``, specifying separate
+ profile files for host and device using ``-Xarch_host`` and ``-Xarch_device``:
+
+ .. code-block:: shell
+
+ clang++ -O2 --offload-arch=gfx1200 -xhip app.hip -o app_optimized \
+ -Xarch_host -fprofile-use=app.profdata \
+ -Xarch_device -fprofile-use=app.device.profdata
+
+Debug Output
+------------
+
+Set ``LLVM_PROFILE_VERBOSE=1`` to see diagnostic messages during profile
+collection:
+
+.. code-block:: shell
+
+ LLVM_PROFILE_VERBOSE=1 ./app_instrumented
+
+This shows information about profile data registration, device memory
+operations, and profile file creation.
+
+Limitations
+-----------
+
+- Device PGO is supported only on AMD GPUs with HIP.
+- Value profiling is not supported for device code.
+
C++17 Class Template Argument Deduction (CTAD) Support
======================================================
diff --git a/clang/lib/Driver/ToolChains/Clang.cpp b/clang/lib/Driver/ToolChains/Clang.cpp
index 3b852528d92c4..9b1456c88687a 100644
--- a/clang/lib/Driver/ToolChains/Clang.cpp
+++ b/clang/lib/Driver/ToolChains/Clang.cpp
@@ -9429,6 +9429,18 @@ void LinkerWrapper::ConstructJob(Compilation &C, const JobAction &JA,
CmdArgs.push_back(Args.MakeArgString(
"--device-linker=" + TC->getTripleString() + "=" + Arg));
+ if (TC->getTriple().isAMDGPU() &&
+ Args.getLastArg(options::OPT_fprofile_generate,
+ options::OPT_fprofile_generate_EQ)) {
+ SmallString<128> ProfileBCPath(C.getDriver().ResourceDir);
+ llvm::sys::path::append(ProfileBCPath, "amdgcn", "bitcode",
+ "profile.bc");
+ if (C.getDriver().getVFS().exists(ProfileBCPath))
+ CmdArgs.push_back(Args.MakeArgString(
+ "--device-linker=" + TC->getTripleString() + "=" +
+ ProfileBCPath));
+ }
+
// Forward the LTO mode relying on the Driver's parsing.
if (C.getDriver().getOffloadLTOMode() == LTOK_Full)
CmdArgs.push_back(Args.MakeArgString(
diff --git a/clang/lib/Driver/ToolChains/HIPAMD.cpp b/clang/lib/Driver/ToolChains/HIPAMD.cpp
index 5b1bc6d8b6fd7..2de02ddcf90e8 100644
--- a/clang/lib/Driver/ToolChains/HIPAMD.cpp
+++ b/clang/lib/Driver/ToolChains/HIPAMD.cpp
@@ -140,6 +140,14 @@ void AMDGCN::Linker::constructLldCommand(Compilation &C, const JobAction &JA,
AddStaticDeviceLibsLinking(C, *this, JA, Inputs, Args, LldArgs, "amdgcn",
TargetID, /*IsBitCodeSDL=*/true);
+ if (Args.getLastArg(options::OPT_fprofile_generate,
+ options::OPT_fprofile_generate_EQ)) {
+ SmallString<128> ProfileBCPath(TC.getDriver().ResourceDir);
+ llvm::sys::path::append(ProfileBCPath, "amdgcn", "bitcode", "profile.bc");
+ if (llvm::sys::fs::exists(ProfileBCPath))
+ LldArgs.push_back(Args.MakeArgString(ProfileBCPath));
+ }
+
LldArgs.push_back("--no-whole-archive");
const char *Lld = Args.MakeArgString(getToolChain().GetProgramPath("lld"));
diff --git a/compiler-rt/include/profile/InstrProfData.inc b/compiler-rt/include/profile/InstrProfData.inc
index 46d6bb5bd8896..4645b89420501 100644
--- a/compiler-rt/include/profile/InstrProfData.inc
+++ b/compiler-rt/include/profile/InstrProfData.inc
@@ -89,9 +89,17 @@ INSTR_PROF_DATA(IntPtrT, llvm::PointerType::getUnqual(Ctx), Values, \
ValuesPtrExpr)
INSTR_PROF_DATA(const uint32_t, llvm::Type::getInt32Ty(Ctx), NumCounters, \
ConstantInt::get(llvm::Type::getInt32Ty(Ctx), NumCounters))
-INSTR_PROF_DATA(const uint16_t, Int16ArrayTy, NumValueSites[IPVK_Last+1], \
- ConstantArray::get(Int16ArrayTy, Int16ArrayVals)) \
-INSTR_PROF_DATA(const uint32_t, llvm::Type::getInt32Ty(Ctx), NumBitmapBytes, \
+INSTR_PROF_DATA(const uint16_t, Int16ArrayTy, NumValueSites[IPVK_Last + 1],
+ ConstantArray::get(Int16ArrayTy, Int16ArrayVals))
+INSTR_PROF_DATA(const uint16_t, llvm::Type::getInt16Ty(Ctx),
+ NumOffloadProfilingThreads,
+ ConstantInt::get(llvm::Type::getInt16Ty(Ctx),
+ NumOffloadProfilingThreadsVal))
+INSTR_PROF_DATA(const uint16_t, llvm::Type::getInt16Ty(Ctx),
+ OffloadDeviceWaveSize,
+ ConstantInt::get(llvm::Type::getInt16Ty(Ctx),
+ OffloadDeviceWaveSizeVal))
+INSTR_PROF_DATA(const uint32_t, llvm::Type::getInt32Ty(Ctx), NumBitmapBytes,
ConstantInt::get(llvm::Type::getInt32Ty(Ctx), NumBitmapBytes))
#undef INSTR_PROF_DATA
/* INSTR_PROF_DATA end. */
@@ -324,6 +332,9 @@ INSTR_PROF_SECT_ENTRY(IPSK_data, \
INSTR_PROF_SECT_ENTRY(IPSK_cnts, \
INSTR_PROF_QUOTE(INSTR_PROF_CNTS_COMMON), \
INSTR_PROF_CNTS_COFF, "__DATA,")
+INSTR_PROF_SECT_ENTRY(IPSK_ucnts, \
+ INSTR_PROF_QUOTE(INSTR_PROF_UCNTS_COMMON), \
+ INSTR_PROF_UCNTS_COFF, "__DATA,")
INSTR_PROF_SECT_ENTRY(IPSK_bitmap, \
INSTR_PROF_QUOTE(INSTR_PROF_BITS_COMMON), \
INSTR_PROF_BITS_COFF, "__DATA,")
@@ -720,9 +731,9 @@ serializeValueProfDataFrom(ValueProfRecordClosure *Closure,
(uint64_t)'f' << 16 | (uint64_t)'R' << 8 | (uint64_t)129
/* Raw profile format version (start from 1). */
-#define INSTR_PROF_RAW_VERSION 10
+#define INSTR_PROF_RAW_VERSION 11
/* Indexed profile format version (start from 1). */
-#define INSTR_PROF_INDEX_VERSION 13
+#define INSTR_PROF_INDEX_VERSION 14
/* Coverage mapping format version (start from 0). */
#define INSTR_PROF_COVMAP_VERSION 6
@@ -767,6 +778,7 @@ serializeValueProfDataFrom(ValueProfRecordClosure *Closure,
#define INSTR_PROF_NAME_COMMON __llvm_prf_names
#define INSTR_PROF_VNAME_COMMON __llvm_prf_vns
#define INSTR_PROF_CNTS_COMMON __llvm_prf_cnts
+#define INSTR_PROF_UCNTS_COMMON __llvm_prf_ucnts
#define INSTR_PROF_BITS_COMMON __llvm_prf_bits
#define INSTR_PROF_VALS_COMMON __llvm_prf_vals
#define INSTR_PROF_VNODES_COMMON __llvm_prf_vnds
@@ -784,6 +796,7 @@ serializeValueProfDataFrom(ValueProfRecordClosure *Closure,
#define INSTR_PROF_NAME_COFF ".lprfn$M"
#define INSTR_PROF_VNAME_COFF ".lprfvn$M"
#define INSTR_PROF_CNTS_COFF ".lprfc$M"
+#define INSTR_PROF_UCNTS_COFF ".lprfuc$M"
#define INSTR_PROF_BITS_COFF ".lprfb$M"
#define INSTR_PROF_VALS_COFF ".lprfv$M"
#define INSTR_PROF_VNODES_COFF ".lprfnd$M"
diff --git a/compiler-rt/lib/profile/CMakeLists.txt b/compiler-rt/lib/profile/CMakeLists.txt
index 4cc2610cec870..b850ae6a2d094 100644
--- a/compiler-rt/lib/profile/CMakeLists.txt
+++ b/compiler-rt/lib/profile/CMakeLists.txt
@@ -63,6 +63,7 @@ set(PROFILE_SOURCES
InstrProfiling.c
InstrProfilingInternal.c
InstrProfilingBuffer.c
+ InstrProfilingGPU.c
InstrProfilingMerge.c
InstrProfilingMergeFile.c
InstrProfilingNameVar.c
@@ -73,6 +74,7 @@ set(PROFILE_SOURCES
InstrProfilingPlatformFuchsia.c
InstrProfilingPlatformLinux.c
InstrProfilingPlatformOther.c
+ InstrProfilingPlatformROCm.c
InstrProfilingPlatformWindows.c
)
@@ -201,3 +203,32 @@ else()
ADDITIONAL_HEADERS ${PROFILE_HEADERS}
PARENT_TARGET profile)
endif()
+
+# Build GPU profile bitcode when AMDGPU is an enabled target.
+# This compiles InstrProfilingGPU.c to bitcode using the just-built clang
+# and installs it alongside device-libs for the driver to link.
+if(NOT "${COMPILER_RT_DEFAULT_TARGET_ARCH}" MATCHES "amdgcn|nvptx")
+ set(_GPU_PROFILE_SRC "${CMAKE_CURRENT_SOURCE_DIR}/InstrProfilingGPU.c")
+ if(TARGET clang AND EXISTS "${_GPU_PROFILE_SRC}")
+ set(_GPU_PROFILE_BC "${CMAKE_CURRENT_BINARY_DIR}/InstrProfilingGPU.bc")
+ set(_GPU_PROFILE_INSTALL_DIR "lib/clang/${LLVM_VERSION_MAJOR}/amdgcn/bitcode")
+ add_custom_command(
+ OUTPUT "${_GPU_PROFILE_BC}"
+ COMMAND $<TARGET_FILE:clang>
+ --target=amdgcn-amd-amdhsa
+ -nogpulib -ffreestanding -flto
+ -Xclang -mcode-object-version=none
+ -O2 -emit-llvm -c
+ "${_GPU_PROFILE_SRC}"
+ -o "${_GPU_PROFILE_BC}"
+ DEPENDS clang "${_GPU_PROFILE_SRC}"
+ COMMENT "Building GPU profile bitcode"
+ )
+ add_custom_target(gpu-profile-bc DEPENDS "${_GPU_PROFILE_BC}")
+ add_dependencies(profile gpu-profile-bc)
+ install(FILES "${_GPU_PROFILE_BC}"
+ DESTINATION "${_GPU_PROFILE_INSTALL_DIR}"
+ RENAME "profile.bc"
+ COMPONENT gpu-profile-bc)
+ endif()
+endif()
diff --git a/compiler-rt/lib/profile/InstrProfiling.h b/compiler-rt/lib/profile/InstrProfiling.h
index 187ef55ef3784..453b57241a4e0 100644
--- a/compiler-rt/lib/profile/InstrProfiling.h
+++ b/compiler-rt/lib/profile/InstrProfiling.h
@@ -308,14 +308,28 @@ void __llvm_profile_set_dumped(void);
/*!
* \brief Write custom target-specific profiling data to a separate file.
- * Used by offload PGO.
+ * Used by offload PGO (HIP and OpenMP).
+ *
+ * \param Target Target triple (e.g., "amdgcn-amd-amdhsa")
+ * \param TUSuffix TU index suffix (e.g., "0", "1") or NULL for no suffix
+ * \param DataBegin Start of profile data records
+ * \param DataEnd End of profile data records
+ * \param CountersBegin Start of counter data
+ * \param CountersEnd End of counter data
+ * \param UniformCountersBegin Start of uniform counters (NULL if not used)
+ * \param UniformCountersEnd End of uniform counters (NULL if not used)
+ * \param NamesBegin Start of names data
+ * \param NamesEnd End of names data
+ * \param VersionOverride Profile version override (NULL to use default)
*/
-int __llvm_write_custom_profile(const char *Target,
+int __llvm_write_custom_profile(const char *Target, const char *TUSuffix,
const __llvm_profile_data *DataBegin,
const __llvm_profile_data *DataEnd,
const char *CountersBegin,
- const char *CountersEnd, const char *NamesBegin,
- const char *NamesEnd,
+ const char *CountersEnd,
+ const char *UniformCountersBegin,
+ const char *UniformCountersEnd,
+ const char *NamesBegin, const char *NamesEnd,
const uint64_t *VersionOverride);
/*!
@@ -349,4 +363,24 @@ extern char INSTR_PROF_PROFILE_NAME_VAR[1]; /* __llvm_profile_filename. */
const __llvm_gcov_init_func_struct *__llvm_profile_begin_covinit();
const __llvm_gcov_init_func_struct *__llvm_profile_end_covinit();
+
+/* A struct to hold the device pointers and sizes for the profile sections. */
+typedef struct OffloadProfileSectionInfo {
+ void *CountersBegin;
+ size_t CountersSize;
+ void *DataBegin;
+ size_t DataSize;
+ void *NamesBegin;
+ size_t NamesSize;
+} OffloadProfileSectionInfo;
+
+/*!
+ * \brief Register an offload module's device-side profile data sections.
+ *
+ * This function is called by the host-side instrumentation code to provide
+ * the runtime with the necessary information to collect profile data from
+ * the device.
+ */
+void __llvm_profile_offload_register_module(OffloadProfileSectionInfo *Info);
+
#endif /* PROFILE_INSTRPROFILING_H_ */
diff --git a/compiler-rt/lib/profile/InstrProfilingFile.c b/compiler-rt/lib/profile/InstrProfilingFile.c
index 71127b05aafb8..45262d7808982 100644
--- a/compiler-rt/lib/profile/InstrProfilingFile.c
+++ b/compiler-rt/lib/profile/InstrProfilingFile.c
@@ -1198,6 +1198,8 @@ int __llvm_profile_write_file(void) {
if (rc)
PROF_ERR("Failed to write file \"%s\": %s\n", Filename, strerror(errno));
+ __llvm_profile_hip_collect_device_data();
+
// Restore SIGKILL.
if (PDeathSig == 1)
lprofRestoreSigKill();
@@ -1282,14 +1284,16 @@ COMPILER_RT_VISIBILITY int __llvm_profile_set_file_object(FILE *File,
}
#ifndef __APPLE__
-int __llvm_write_custom_profile(const char *Target,
+int __llvm_write_custom_profile(const char *Target, const char *TUSuffix,
const __llvm_profile_data *DataBegin,
const __llvm_profile_data *DataEnd,
const char *CountersBegin,
- const char *CountersEnd, const char *NamesBegin,
- const char *NamesEnd,
+ const char *CountersEnd,
+ const char *UniformCountersBegin,
+ const char *UniformCountersEnd,
+ const char *NamesBegin, const char *NamesEnd,
const uint64_t *VersionOverride) {
- int ReturnValue = 0, FilenameLength, TargetLength;
+ int ReturnValue = 0, FilenameLength, TargetLength, TUSuffixLength;
char *FilenameBuf, *TargetFilename;
const char *Filename;
@@ -1307,7 +1311,8 @@ int __llvm_write_custom_profile(const char *Target,
}
/* Check if there is llvm/runtime version mismatch. */
- if (GET_VERSION(__llvm_profile_get_version()) != INSTR_PROF_RAW_VERSION) {
+ if (VersionOverride == NULL &&
+ GET_VERSION(__llvm_profile_get_version()) != INSTR_PROF_RAW_VERSION) {
PROF_ERR("Runtime and instrumentation version mismatch : "
"expected %d, but get %d\n",
INSTR_PROF_RAW_VERSION,
@@ -1331,9 +1336,12 @@ int __llvm_write_custom_profile(const char *Target,
}
/* Allocate new space for our target-specific PGO filename */
+ /* Format: <dir>/<basename_without_ext>.<target>.<TUSuffix>.<ext> */
+ /* This matches the HIP convention for backward compatibility */
TargetLength = strlen(Target);
- TargetFilename =
- (char *)COMPILER_RT_ALLOCA(FilenameLength + TargetLength + 2);
+ TUSuffixLength = TUSuffix ? strlen(TUSuffix) : 0;
+ TargetFilename = (char *)COMPILER_RT_ALLOCA(FilenameLength + TargetLength +
+ TUSuffixLength + 3);
/* Find file basename and path sizes */
int32_t DirEnd = FilenameLength - 1;
@@ -1342,15 +1350,34 @@ int __llvm_write_custom_profile(const char *Target,
}
uint32_t DirSize = DirEnd + 1, BaseSize = FilenameLength - DirSize;
- /* Prepend "TARGET." to current filename */
+ /* Find extension within basename */
+ const char *Basename = Filename + DirSize;
+ const char *Extension = strrchr(Basename, '.');
+ uint32_t BasenameNoExtSize =
+ Extension ? (uint32_t)(Extension - Basename) : BaseSize;
+ uint32_t ExtSize = Extension ? (uint32_t)(BaseSize - BasenameNoExtSize) : 0;
+
+ /* Build filename: <dir>/<basename_without_ext>.<target>.<TUSuffix>.<ext> */
+ char *p = TargetFilename;
if (DirSize > 0) {
- memcpy(TargetFilename, Filename, DirSize);
+ memcpy(p, Filename, DirSize);
+ p += DirSize;
+ }
+ memcpy(p, Basename, BasenameNoExtSize);
+ p += BasenameNoExtSize;
+ *p++ = '.';
+ memcpy(p, Target, TargetLength);
+ p += TargetLength;
+ if (TUSuffixLength > 0) {
+ *p++ = '.';
+ memcpy(p, TUSuffix, TUSuffixLength);
+ p += TUSuffixLength;
}
- memcpy(TargetFilename + DirSize, Target, TargetLength);
- TargetFilename[TargetLength + DirSize] = '.';
- memcpy(TargetFilename + DirSize + 1 + TargetLength, Filename + DirSize,
- BaseSize);
- TargetFilename[FilenameLength + 1 + TargetLength] = 0;
+ if (ExtSize > 0) {
+ memcpy(p, Extension, ExtSize);
+ p += ExtSize;
+ }
+ *p = '\0';
/* Open and truncate target-specific PGO file */
FILE *OutputFile = fopen(TargetFilename, "w");
@@ -1381,6 +1408,47 @@ int __llvm_write_custom_profile(const char *Target,
NULL, NULL, NULL, NamesBegin, NamesEnd, 0, Version);
closeFileObject(OutputFile);
+ /* Write uniform counters to a separate file if provided */
+ if (ReturnValue == 0 && UniformCountersBegin && UniformCountersEnd &&
+ UniformCountersEnd > UniformCountersBegin) {
+ size_t UniformCountersSize = UniformCountersEnd - UniformCountersBegin;
+
+ /* Create uniform counters filename by replacing extension with .unifcnts */
+ size_t TargetFilenameLen = strlen(TargetFilename);
+ char *UniformFilename = (char *)COMPILER_RT_ALLOCA(TargetFilenameLen + 10);
+ strcpy(UniformFilename, TargetFilename);
+
+ /* Find and replace extension */
+ char *ext = strrchr(UniformFilename, '.');
+ if (ext) {
+ strcpy(ext, ".unifcnts");
+ } else {
+ strcat(UniformFilename, ".unifcnts");
+ }
+
+ FILE *UniformFile = fopen(UniformFilename, "wb");
+ if (UniformFile) {
+ /* Write a simple header: magic, version, num_counters, counters_size */
+ uint64_t UniformMagic = 0x55434E5450524F46ULL; /* "UCNTPROF" in ASCII */
+ uint64_t UniformVersion = 1;
+ uint64_t NumUniformCounters = UniformCountersSize / sizeof(uint64_t);
+
+ if (fwrite(&UniformMagic, sizeof(uint64_t), 1, UniformFile) != 1 ||
+ fwrite(&UniformVersion, sizeof(uint64_t), 1, UniformFile) != 1 ||
+ fwrite(&NumUniformCounters, sizeof(uint64_t), 1, UniformFile) != 1 ||
+ fwrite(&UniformCountersSize, sizeof(uint64_t), 1, UniformFile) != 1 ||
+ fwrite(UniformCountersBegin, 1, UniformCountersSize, UniformFile) !=
+ UniformCountersSize) {
+ PROF_WARN("Failed to write uniform counters to %s\n", UniformFilename);
+ ReturnValue = -1;
+ }
+ fclose(UniformFile);
+ } else {
+ PROF_WARN("Failed to open %s for writing uniform counters\n",
+ UniformFilename);
+ }
+ }
+
// Restore SIGKILL.
if (PDeathSig == 1)
lprofRestoreSigKill();
diff --git a/compiler-rt/lib/profile/InstrProfilingGPU.c b/compiler-rt/lib/profile/InstrProfilingGPU.c
new file mode 100644
index 0000000000000..328eb0ba089bf
--- /dev/null
+++ b/compiler-rt/lib/profile/InstrProfilingGPU.c
@@ -0,0 +1,69 @@
+/*===- InstrProfilingGPU.c - GPU profile counter functions ----------------===*\
+|*
+|* Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+|* See https://llvm.org/LICENSE.txt for license information.
+|* SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+|*
+\*===----------------------------------------------------------------------===*/
+
+#if defined(__AMDGPU__) || defined(__NVPTX__)
+
+#include <gpuintrin.h>
+#include <stdint.h>
+
+#define ATOMIC_ADD(ptr, val) \
+ __scoped_atomic_fetch_add(ptr, val, __ATOMIC_RELAXED, __MEMORY_SCOPE_DEVICE)
+
+/*
+ * Check if this block is sampled (PatternOverflow mode).
+ * Samples by matching lower bits of flat block ID to 0.
+ *
+ * sampling_bits: 0 = all blocks (100%)
+ * 1 = even blocks (50%)
+ * 2 = every 4th block (25%)
+ * 3 = every 8th block (12.5%)
+ */
+__attribute__((visibility("hidden"), used))
+int __gpu_pgo_is_sampled(uint32_t sampling_bits) {
+ if (sampling_bits == 0)
+ return 1;
+
+ uint32_t gdx = __gpu_num_blocks_x();
+ uint32_t gdy = __gpu_num_blocks_y();
+ uint32_t block_id = __gpu_block_id_x() + __gpu_block_id_y() * gdx +
+ __gpu_block_id_z() * gdx * gdy;
+
+ uint32_t mask = (1u << sampling_bits) - 1;
+ return (block_id & mask) == 0;
+}
+
+typedef uint64_t __attribute__((address_space(1))) *global_u64_ptr;
+
+/* Full wave mask: all lanes active */
+#define FULL_WAVE_MASK \
+ ((__gpu_num_lanes() == 64) ? ~0ULL : 0xFFFFFFFFULL)
+
+/*
+ * Per-BB warp-aggregate counter increment using atomic add.
+ * Elects one leader lane per wave, counts active lanes, leader atomically
+ * adds (step * active_lanes). Also updates uniform counter when all lanes
+ * in the wave are active.
+ */
+__attribute__((visibility("hidden"), used))
+void __gpu_pgo_increment(global_u64_ptr counter, global_u64_ptr uniform_counter,
+ int64_t step) {
+ uint64_t lane_mask = __gpu_lane_mask();
+ uint64_t active = __gpu_ballot(lane_mask, 1);
+ if (__gpu_is_first_in_lane(lane_mask)) {
+ int64_t count = (int64_t)__builtin_popcountg(active) * step;
+ ATOMIC_ADD(counter, count);
+ if (uniform_counter && active == FULL_WAVE_MASK)
+ ATOMIC_ADD(uniform_counter, count);
+ }
+}
+
+#if defined(__AMDGPU__)
+__attribute__((weak)) const int __oclc_ABI_version = 600;
+#endif
+
+#endif /* __AMDGPU__ || __NVPTX__ */
diff --git a/compiler-rt/lib/profile/InstrProfilingInternal.h b/compiler-rt/lib/profile/InstrProfilingInternal.h
index 5647782527eb7..be6d2627dd100 100644
--- a/compiler-rt/lib/profile/InstrProfilingInternal.h
+++ b/compiler-rt/lib/profile/InstrProfilingInternal.h
@@ -212,5 +212,12 @@ int __llvm_write_binary_ids(ProfDataWriter *Writer);
int lprofWriteOneBinaryId(ProfDataWriter *Writer, uint64_t BinaryIdLen,
const uint8_t *BinaryIdData,
uint64_t BinaryIdPadding);
+#ifdef __cplusplus
+extern "C" {
+#endif
+COMPILER_RT_VISIBILITY int __llvm_profile_hip_collect_device_data(void);
+#ifdef __cplusplus
+}
+#endif
#endif
diff --git a/compiler-rt/lib/profile/InstrProfilingPlatformROCm.c b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.c
new file mode 100644
index 0000000000000..e0356f3c03596
--- /dev/null
+++ b/compiler-rt/lib/profile/InstrProfilingPlatformROCm.c
@@ -0,0 +1,805 @@
+//===- InstrProfilingPlatformROCm.c - Profile data ROCm platform ---------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include "InstrProfiling.h"
+#include "InstrProfilingInternal.h"
+#include "InstrProfilingPort.h"
+#include <stddef.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+/* -------------------------------------------------------------------------- */
+/* Platform abstraction for dynamic library loading */
+/* -------------------------------------------------------------------------- */
+
+#ifdef _WIN32
+#define WIN32_LEAN_AND_MEAN
+#include <windows.h>
+
+typedef HMODULE DylibHandle;
+
+static DylibHandle DylibOpen(const char *Name) { return LoadLibraryA(Name); }
+
+static void *DylibSym(DylibHandle H, const char *Sym) {
+ return (void *)(uintptr_t)GetProcAddress(H, Sym);
+}
+
+static int DylibAvailable(void) { return 1; }
+
+#else /* POSIX */
+#include <dlfcn.h>
+
+/* Use weak references for dl* functions to avoid requiring -ldl at link time.
+ *
+ * The profile runtime is a static library, so its dependencies must be
+ * explicitly linked by the user. Unlike sanitizer runtimes (which are often
+ * shared libraries with their own dependencies), adding -ldl globally would
+ * affect all profiling users, including those not using HIP/ROCm.
+ *
+ * With weak references:
+ * - Programs without -ldl link successfully (dl* resolve to NULL)
+ * - HIP programs get -ldl from the HIP runtime, so dl* work normally
+ * - OpenMP offload programs without HIP gracefully skip device profiling
+ */
+#pragma weak dlopen
+#pragma weak dlsym
+#pragma weak dlerror
+
+typedef void *DylibHandle;
+
+static DylibHandle DylibOpen(const char *Name) {
+ return dlopen(Name, RTLD_LAZY | RTLD_LOCAL);
+}
+
+static void *DylibSym(DylibHandle H, const char *Sym) { return dlsym(H, Sym); }
+
+static int DylibAvailable(void) { return dlopen != NULL; }
+
+#endif /* _WIN32 */
+
+static int ProcessDeviceOffloadPrf(void *DeviceOffloadPrf, int TUIndex,
+ const char *Target);
+
+static int IsVerboseMode() {
+ static int IsVerbose = -1;
+ if (IsVerbose == -1)
+ IsVerbose = getenv("LLVM_PROFILE_VERBOSE") != NULL;
+ return IsVerbose;
+}
+
+/* -------------------------------------------------------------------------- */
+/* Dynamic loading of HIP runtime symbols */
+/* -------------------------------------------------------------------------- */
+
+typedef int (*hipGetSymbolAddressTy)(void **, const void *);
+typedef int (*hipMemcpyTy)(void *, void *, size_t, int);
+typedef int (*hipModuleGetGlobalTy)(void **, size_t *, void *, const char *);
+typedef int (*hipGetDeviceCountTy)(int *);
+typedef int (*hipGetDeviceTy)(int *);
+typedef int (*hipSetDeviceTy)(int);
+
+/* hipDeviceProp_t layout for HIP 6.x+ (R0600).
+ * We only need gcnArchName at offset 1160. Pad to 4096 to safely
+ * accommodate future struct growth without recompilation. */
+typedef struct {
+ char padding[1160];
+ char gcnArchName[256];
+ char tail_padding[2680];
+} HipDevicePropMinimal;
+typedef int (*hipGetDevicePropertiesTy)(HipDevicePropMinimal *, int);
+
+static hipGetSymbolAddressTy pHipGetSymbolAddress = NULL;
+static hipMemcpyTy pHipMemcpy = NULL;
+static hipModuleGetGlobalTy pHipModuleGetGlobal = NULL;
+static hipGetDeviceCountTy pHipGetDeviceCount = NULL;
+static hipGetDeviceTy pHipGetDevice = NULL;
+static hipSetDeviceTy pHipSetDevice = NULL;
+static hipGetDevicePropertiesTy pHipGetDeviceProperties = NULL;
+
+#define MAX_DEVICES 16
+static int NumDevices = 0;
+static char DeviceArchNames[MAX_DEVICES][256];
+
+/* -------------------------------------------------------------------------- */
+/* Device-to-host copies */
+/* Keep HIP-only to avoid an HSA dependency. */
+/* -------------------------------------------------------------------------- */
+
+static void EnsureHipLoaded(void) {
+ static int Initialized = 0;
+ if (Initialized)
+ return;
+ Initialized = 1;
+
+ if (!DylibAvailable()) {
+ if (IsVerboseMode())
+ PROF_NOTE("%s", "Dynamic library loading not available - "
+ "HIP profiling disabled\n");
+ return;
+ }
+
+#ifdef _WIN32
+ static const char HipLibName[] = "amdhip64.dll";
+#else
+ static const char HipLibName[] = "libamdhip64.so";
+#endif
+
+ DylibHandle Handle = DylibOpen(HipLibName);
+ if (!Handle) {
+#ifndef _WIN32
+ if (dlerror)
+ fprintf(stderr, "compiler-rt: failed to open %s: %s\n", HipLibName,
+ dlerror());
+#endif
+ return;
+ }
+
+ pHipGetSymbolAddress =
+ (hipGetSymbolAddressTy)DylibSym(Handle, "hipGetSymbolAddress");
+ pHipMemcpy = (hipMemcpyTy)DylibSym(Handle, "hipMemcpy");
+ pHipModuleGetGlobal =
+ (hipModuleGetGlobalTy)DylibSym(Handle, "hipModuleGetGlobal");
+ pHipGetDeviceCount =
+ (hipGetDeviceCountTy)DylibSym(Handle, "hipGetDeviceCount");
+ pHipGetDevice = (hipGetDeviceTy)DylibSym(Handle, "hipGetDevice");
+ pHipSetDevice = (hipSetDeviceTy)DylibSym(Handle, "hipSetDevice");
+ pHipGetDeviceProperties =
+ (hipGetDevicePropertiesTy)DylibSym(Handle, "hipGetDevicePropertiesR0600");
+ if (!pHipGetDeviceProperties)
+ pHipGetDeviceProperties =
+ (hipGetDevicePropertiesTy)DylibSym(Handle, "hipGetDeviceProperties");
+
+ if (pHipGetDeviceCount && pHipGetDeviceProperties) {
+ int Count = 0;
+ if (pHipGetDeviceCount(&Count) == 0) {
+ if (Count > MAX_DEVICES)
+ Count = MAX_DEVICES;
+ HipDevicePropMinimal Prop;
+ for (int i = 0; i < Count; ++i) {
+ memset(&Prop, 0, sizeof(Prop));
+ if (pHipGetDeviceProperties(&Prop, i) == 0) {
+ strncpy(DeviceArchNames[i], Prop.gcnArchName,
+ sizeof(DeviceArchNames[i]) - 1);
+ DeviceArchNames[i][sizeof(DeviceArchNames[i]) - 1] = '\0';
+ if (IsVerboseMode())
+ PROF_NOTE("Device %d arch: %s\n", i, DeviceArchNames[i]);
+ }
+ }
+ NumDevices = Count;
+ }
+ }
+}
+
+/* -------------------------------------------------------------------------- */
+/* Public wrappers that forward to the loaded HIP symbols */
+/* -------------------------------------------------------------------------- */
+
+static int hipGetSymbolAddress(void **devPtr, const void *symbol) {
+ EnsureHipLoaded();
+ return pHipGetSymbolAddress ? pHipGetSymbolAddress(devPtr, symbol) : -1;
+}
+
+static int hipMemcpy(void *dest, void *src, size_t len, int kind /*2=DToH*/) {
+ EnsureHipLoaded();
+ return pHipMemcpy ? pHipMemcpy(dest, src, len, kind) : -1;
+}
+
+/* Copy from device to host using HIP.
+ * This requires that the device section symbols are registered with CLR,
+ * otherwise hipMemcpy may attempt a CPU path and crash. */
+static int memcpyDeviceToHost(void *Dst, void *Src, size_t Size) {
+ return hipMemcpy(Dst, Src, Size, 2 /* DToH */);
+}
+
+static int hipModuleGetGlobal(void **DevPtr, size_t *Bytes, void *Module,
+ const char *Name) {
+ EnsureHipLoaded();
+ return pHipModuleGetGlobal ? pHipModuleGetGlobal(DevPtr, Bytes, Module, Name)
+ : -1;
+}
+
+static int hipGetDevice(int *DeviceId) {
+ EnsureHipLoaded();
+ return pHipGetDevice ? pHipGetDevice(DeviceId) : -1;
+}
+
+static int hipSetDevice(int DeviceId) {
+ EnsureHipLoaded();
+ return pHipSetDevice ? pHipSetDevice(DeviceId) : -1;
+}
+
+static const char *getDeviceArchName(int DeviceId) {
+ if (DeviceId < 0 || DeviceId >= NumDevices || !DeviceArchNames[DeviceId][0])
+ return "amdgpu";
+ return DeviceArchNames[DeviceId];
+}
+
+/* -------------------------------------------------------------------------- */
+/* Dynamic module tracking */
+/* -------------------------------------------------------------------------- */
+
+/* Per-TU profile entry inside a dynamic module.
+ * A single dynamic module may contain multiple TUs (e.g. -fgpu-rdc). */
+typedef struct {
+ void *DeviceVar; /* device address of __llvm_offload_prf_<CUID> */
+ int Processed; /* 0 = not yet collected, 1 = data already copied */
+} OffloadDynamicTUInfo;
+
+/* One entry per hipModuleLoad call. */
+typedef struct {
+ void *ModulePtr; /* hipModule_t handle */
+ OffloadDynamicTUInfo *TUs; /* array of per-TU entries */
+ int NumTUs;
+ int CapTUs;
+} OffloadDynamicModuleInfo;
+
+static OffloadDynamicModuleInfo *DynamicModules = NULL;
+static int NumDynamicModules = 0;
+static int CapDynamicModules = 0;
+
+/* -------------------------------------------------------------------------- */
+/* ELF symbol enumeration (Linux only) */
+/* */
+/* AMDGPU code objects are always ELF, but <elf.h> is a Linux system header. */
+/* Dynamic module profiling (hipModuleLoadData) is currently Linux-only. */
+/* -------------------------------------------------------------------------- */
+
+#if defined(__linux__)
+#include <elf.h>
+
+/* Callback invoked for every matching symbol name found in the ELF image.
+ * Return 0 to continue iteration, non-zero to stop. */
+typedef int (*SymbolCallback)(const char *Name, void *UserData);
+
+/* If Image is a clang offload bundle (__CLANG_OFFLOAD_BUNDLE__), find the
+ * first embedded code object that is a valid ELF and return a pointer to it.
+ * Otherwise return Image unchanged. Returns NULL if no ELF is found. */
+static const void *UnwrapOffloadBundle(const void *Image) {
+ static const char BundleMagic[] = "__CLANG_OFFLOAD_BUNDLE__";
+ if (memcmp(Image, BundleMagic, 24) != 0)
+ return Image; /* Not a bundle, return as-is. */
+
+ const char *Buf = (const char *)Image;
+ uint64_t NumEntries;
+ memcpy(&NumEntries, Buf + 24, sizeof(uint64_t));
+
+ /* Walk the entry table (starts at offset 32). */
+ const char *Cursor = Buf + 32;
+ for (uint64_t I = 0; I < NumEntries; ++I) {
+ uint64_t EntryOffset, EntrySize, IDSize;
+ memcpy(&EntryOffset, Cursor, 8);
+ Cursor += 8;
+ memcpy(&EntrySize, Cursor, 8);
+ Cursor += 8;
+ memcpy(&IDSize, Cursor, 8);
+ Cursor += 8;
+ /* Skip the entry ID string. */
+ Cursor += IDSize;
+
+ /* Check if this entry contains an ELF. */
+ if (EntrySize >= sizeof(Elf64_Ehdr)) {
+ const Elf64_Ehdr *E = (const Elf64_Ehdr *)(Buf + EntryOffset);
+ if (E->e_ident[EI_MAG0] == ELFMAG0 && E->e_ident[EI_MAG1] == ELFMAG1 &&
+ E->e_ident[EI_MAG2] == ELFMAG2 && E->e_ident[EI_MAG3] == ELFMAG3) {
+ if (IsVerboseMode())
+ PROF_NOTE("Unwrapped offload bundle: entry %lu at offset %lu "
+ "(size %lu)\n",
+ (unsigned long)I, (unsigned long)EntryOffset,
+ (unsigned long)EntrySize);
+ return (const void *)(Buf + EntryOffset);
+ }
+ }
+ }
+
+ PROF_WARN("%s", "Offload bundle contains no valid ELF entries\n");
+ return NULL;
+}
+
+/* Parse an AMDGPU code-object ELF and invoke CB for every global symbol whose
+ * name starts with PREFIX. Image may be NULL (e.g. hipModuleLoad from file)
+ * or a clang offload bundle containing an ELF;
+ * in that case the function unwraps the bundle first. */
+static void EnumerateElfSymbols(const void *Image, const char *Prefix,
+ SymbolCallback CB, void *UserData) {
+ if (!Image)
+ return;
+
+ /* Handle clang offload bundle wrapping. */
+ Image = UnwrapOffloadBundle(Image);
+ if (!Image)
+ return;
+
+ const Elf64_Ehdr *Ehdr = (const Elf64_Ehdr *)Image;
+ if (Ehdr->e_ident[EI_MAG0] != ELFMAG0 || Ehdr->e_ident[EI_MAG1] != ELFMAG1 ||
+ Ehdr->e_ident[EI_MAG2] != ELFMAG2 || Ehdr->e_ident[EI_MAG3] != ELFMAG3) {
+ if (IsVerboseMode())
+ PROF_NOTE("%s", "Image is not a valid ELF, skipping enumeration\n");
+ return;
+ }
+
+ size_t PrefixLen = strlen(Prefix);
+ const char *Base = (const char *)Image;
+ const Elf64_Shdr *Shdrs = (const Elf64_Shdr *)(Base + Ehdr->e_shoff);
+
+ for (int i = 0; i < Ehdr->e_shnum; ++i) {
+ if (Shdrs[i].sh_type != SHT_SYMTAB)
+ continue;
+
+ const Elf64_Sym *Syms = (const Elf64_Sym *)(Base + Shdrs[i].sh_offset);
+ int NumSyms = Shdrs[i].sh_size / sizeof(Elf64_Sym);
+ /* String table is the section referenced by sh_link. */
+ const char *StrTab = Base + Shdrs[Shdrs[i].sh_link].sh_offset;
+
+ for (int j = 0; j < NumSyms; ++j) {
+ if (Syms[j].st_name == 0)
+ continue;
+ const char *Name = StrTab + Syms[j].st_name;
+ if (strncmp(Name, Prefix, PrefixLen) == 0) {
+ if (CB(Name, UserData))
+ return;
+ }
+ }
+ }
+}
+
+/* State passed through the enumeration callback. */
+typedef struct {
+ void *Module; /* hipModule_t */
+ OffloadDynamicModuleInfo *ModInfo;
+} EnumState;
+
+/* Grow the TU array inside a module entry and register one __llvm_offload_prf_*
+ * symbol. Also pre-registers the corresponding per-TU section symbols with CLR
+ * (needed so hipMemcpy can copy from those device addresses later). */
+static int RegisterPrfSymbol(const char *Name, void *UserData) {
+ EnumState *S = (EnumState *)UserData;
+ OffloadDynamicModuleInfo *MI = S->ModInfo;
+
+ /* Look up the profile structure symbol. */
+ void *DeviceVar = NULL;
+ size_t Bytes = 0;
+ if (hipModuleGetGlobal(&DeviceVar, &Bytes, S->Module, Name) != 0) {
+ PROF_WARN("Failed to get symbol %s for module %p\n", Name, S->Module);
+ return 0; /* continue */
+ }
+
+ if (IsVerboseMode())
+ PROF_NOTE("Module %p: found %s -> %p (%zu bytes)\n", S->Module, Name,
+ DeviceVar, Bytes);
+
+ /* Grow TU array if needed. */
+ if (MI->NumTUs >= MI->CapTUs) {
+ int NewCap = MI->CapTUs ? MI->CapTUs * 2 : 4;
+ OffloadDynamicTUInfo *New = (OffloadDynamicTUInfo *)realloc(
+ MI->TUs, NewCap * sizeof(OffloadDynamicTUInfo));
+ if (!New) {
+ PROF_ERR("%s\n", "Failed to grow TU array");
+ return 0;
+ }
+ MI->TUs = New;
+ MI->CapTUs = NewCap;
+ }
+ OffloadDynamicTUInfo *TU = &MI->TUs[MI->NumTUs++];
+ TU->DeviceVar = DeviceVar;
+ TU->Processed = 0;
+
+ /* Derive the CUID suffix from the symbol name. The name has the form
+ * "__llvm_offload_prf_<CUID>", so the suffix (including underscore) starts
+ * at offset strlen("__llvm_offload_prf"). */
+ const char *Suffix = Name + strlen("__llvm_offload_prf");
+
+ /* Pre-register per-TU section symbols with CLR memory tracking.
+ * The section symbol names use the same CUID suffix:
+ * __llvm_prf_c_<CUID>, __llvm_prf_d_<CUID>,
+ * __profu_all_<CUID>, __llvm_prf_nm_<CUID> */
+ static const char *SectionPrefixes[] = {"__llvm_prf_c", "__llvm_prf_d",
+ "__profu_all", "__llvm_prf_nm"};
+ for (int s = 0; s < 4; ++s) {
+ char SectionName[256];
+ snprintf(SectionName, sizeof(SectionName), "%s%s", SectionPrefixes[s],
+ Suffix);
+ void *Dummy = NULL;
+ size_t DummyBytes = 0;
+ hipModuleGetGlobal(&Dummy, &DummyBytes, S->Module, SectionName);
+ }
+
+ return 0; /* continue enumeration */
+}
+
+#endif /* defined(__linux__) */
+
+/* -------------------------------------------------------------------------- */
+/* Registration / un-registration helpers */
+/* -------------------------------------------------------------------------- */
+
+void __llvm_profile_offload_register_dynamic_module(int ModuleLoadRc,
+ void **Ptr,
+ const void *Image) {
+ if (IsVerboseMode())
+ PROF_NOTE("Registering loaded module %d: rc=%d, module=%p, image=%p\n",
+ NumDynamicModules, ModuleLoadRc, *Ptr, Image);
+
+ if (ModuleLoadRc)
+ return;
+
+ if (NumDynamicModules >= CapDynamicModules) {
+ int NewCap = CapDynamicModules ? CapDynamicModules * 2 : 64;
+ OffloadDynamicModuleInfo *New = (OffloadDynamicModuleInfo *)realloc(
+ DynamicModules, NewCap * sizeof(OffloadDynamicModuleInfo));
+ if (!New) {
+ PROF_ERR("%s\n", "Failed to grow dynamic modules array");
+ return;
+ }
+ DynamicModules = New;
+ CapDynamicModules = NewCap;
+ }
+
+ OffloadDynamicModuleInfo *MI = &DynamicModules[NumDynamicModules++];
+ MI->ModulePtr = *Ptr;
+ MI->TUs = NULL;
+ MI->NumTUs = 0;
+ MI->CapTUs = 0;
+
+ /* Enumerate all __llvm_offload_prf_<CUID> symbols in the ELF image.
+ * For each one, look it up via hipModuleGetGlobal (which also registers
+ * the device address with CLR for later hipMemcpy) and store the entry.
+ *
+ * ELF parsing requires <elf.h> which is Linux-only. On other platforms,
+ * dynamic module profiling is not yet supported. */
+#if defined(__linux__)
+ EnumState State = {*Ptr, MI};
+ EnumerateElfSymbols(Image, "__llvm_offload_prf_", RegisterPrfSymbol, &State);
+#else
+ (void)Image;
+ if (IsVerboseMode())
+ PROF_NOTE("%s",
+ "Dynamic module profiling not supported on this platform\n");
+#endif
+
+ if (MI->NumTUs == 0) {
+ PROF_WARN("No __llvm_offload_prf_* symbols found in module %p\n", *Ptr);
+ } else if (IsVerboseMode()) {
+ PROF_NOTE("Module %p: registered %d TU(s)\n", *Ptr, MI->NumTUs);
+ }
+}
+
+void __llvm_profile_offload_unregister_dynamic_module(void *Ptr) {
+ for (int i = 0; i < NumDynamicModules; ++i) {
+ OffloadDynamicModuleInfo *MI = &DynamicModules[i];
+
+ if (MI->ModulePtr != Ptr)
+ continue;
+
+ if (IsVerboseMode())
+ PROF_NOTE("Unregistering module %p (%d TUs)\n", MI->ModulePtr,
+ MI->NumTUs);
+
+ /* Process every TU in this module. */
+ for (int t = 0; t < MI->NumTUs; ++t) {
+ OffloadDynamicTUInfo *TU = &MI->TUs[t];
+ if (TU->Processed) {
+ if (IsVerboseMode())
+ PROF_NOTE("Module %p TU %d already processed, skipping\n", Ptr, t);
+ continue;
+ }
+ /* Use a globally unique index as TU index for the output filename. */
+ int TUIndex = i * 1000 + t;
+ if (TU->DeviceVar) {
+ int CurDev = 0;
+ hipGetDevice(&CurDev);
+ const char *ArchName = getDeviceArchName(CurDev);
+ if (ProcessDeviceOffloadPrf(TU->DeviceVar, TUIndex, ArchName) == 0)
+ TU->Processed = 1;
+ else
+ PROF_WARN("Failed to process profile data for module %p TU %d\n", Ptr,
+ t);
+ }
+ }
+ return;
+ }
+
+ if (IsVerboseMode())
+ PROF_WARN("Unregister called for unknown module %p\n", Ptr);
+}
+
+static void **OffloadShadowVariables = NULL;
+static int NumShadowVariables = 0;
+static int CapShadowVariables = 0;
+
+void __llvm_profile_offload_register_shadow_variable(void *ptr) {
+ if (NumShadowVariables >= CapShadowVariables) {
+ int NewCap = CapShadowVariables ? CapShadowVariables * 2 : 64;
+ void **New =
+ (void **)realloc(OffloadShadowVariables, NewCap * sizeof(void *));
+ if (!New) {
+ PROF_ERR("%s\n", "Failed to grow shadow variables array");
+ return;
+ }
+ OffloadShadowVariables = New;
+ CapShadowVariables = NewCap;
+ }
+ OffloadShadowVariables[NumShadowVariables++] = ptr;
+}
+
+static void **OffloadSectionShadowVariables = NULL;
+static int NumSectionShadowVariables = 0;
+static int CapSectionShadowVariables = 0;
+
+void __llvm_profile_offload_register_section_shadow_variable(void *ptr) {
+ if (NumSectionShadowVariables >= CapSectionShadowVariables) {
+ int NewCap = CapSectionShadowVariables ? CapSectionShadowVariables * 2 : 64;
+ void **New = (void **)realloc(OffloadSectionShadowVariables,
+ NewCap * sizeof(void *));
+ if (!New) {
+ PROF_ERR("%s\n", "Failed to grow section shadow variables array");
+ return;
+ }
+ OffloadSectionShadowVariables = New;
+ CapSectionShadowVariables = NewCap;
+ }
+ OffloadSectionShadowVariables[NumSectionShadowVariables++] = ptr;
+}
+
+static int ProcessDeviceOffloadPrf(void *DeviceOffloadPrf, int TUIndex,
+ const char *Target) {
+ void *HostOffloadPrf[8];
+
+ if (hipMemcpy(HostOffloadPrf, DeviceOffloadPrf, sizeof(HostOffloadPrf),
+ 2 /*DToH*/) != 0) {
+ PROF_ERR("%s\n", "Failed to copy offload prf structure from device");
+ return -1;
+ }
+
+ void *DevCntsBegin = HostOffloadPrf[0];
+ void *DevDataBegin = HostOffloadPrf[1];
+ void *DevNamesBegin = HostOffloadPrf[2];
+ void *DevUniformCntsBegin = HostOffloadPrf[3];
+ void *DevCntsEnd = HostOffloadPrf[4];
+ void *DevDataEnd = HostOffloadPrf[5];
+ void *DevNamesEnd = HostOffloadPrf[6];
+ void *DevUniformCntsEnd = HostOffloadPrf[7];
+
+ size_t CountersSize = (char *)DevCntsEnd - (char *)DevCntsBegin;
+ size_t DataSize = (char *)DevDataEnd - (char *)DevDataBegin;
+ size_t NamesSize = (char *)DevNamesEnd - (char *)DevNamesBegin;
+ size_t UniformCountersSize =
+ (char *)DevUniformCntsEnd - (char *)DevUniformCntsBegin;
+
+ if (IsVerboseMode())
+ PROF_NOTE("Section sizes: Counters=%zu, Data=%zu, Names=%zu, "
+ "UniformCounters=%zu\n",
+ CountersSize, DataSize, NamesSize, UniformCountersSize);
+
+ if (CountersSize == 0 || DataSize == 0) {
+ if (IsVerboseMode())
+ PROF_NOTE("%s\n", "Counters or Data section has zero size. No profile "
+ "data to collect.");
+ return 0;
+ }
+
+ // Pre-register device section symbols with CLR memory tracking.
+ // This makes the section base pointers (and sub-pointers) safe for hipMemcpy.
+ if (IsVerboseMode())
+ PROF_NOTE("Pre-registering %d section symbols\n",
+ NumSectionShadowVariables);
+ for (int i = 0; i < NumSectionShadowVariables; ++i) {
+ void *DevPtr = NULL;
+ (void)hipGetSymbolAddress(&DevPtr, OffloadSectionShadowVariables[i]);
+ }
+
+ int ret = -1;
+
+ // Allocate host memory for the device sections
+ char *HostCountersBegin = (char *)malloc(CountersSize);
+ char *HostDataBegin = (char *)malloc(DataSize);
+ char *HostNamesBegin = (char *)malloc(NamesSize);
+ char *HostUniformCountersBegin =
+ (UniformCountersSize > 0) ? (char *)malloc(UniformCountersSize) : NULL;
+
+ if (!HostCountersBegin || !HostDataBegin ||
+ (NamesSize > 0 && !HostNamesBegin) ||
+ (UniformCountersSize > 0 && !HostUniformCountersBegin)) {
+ PROF_ERR("%s\n", "Failed to allocate host memory for device sections");
+ goto cleanup;
+ }
+
+ // Copy data from device to host using HIP.
+ if (memcpyDeviceToHost(HostCountersBegin, DevCntsBegin, CountersSize) != 0 ||
+ memcpyDeviceToHost(HostDataBegin, DevDataBegin, DataSize) != 0 ||
+ (NamesSize > 0 &&
+ memcpyDeviceToHost(HostNamesBegin, DevNamesBegin, NamesSize) != 0) ||
+ (UniformCountersSize > 0 &&
+ memcpyDeviceToHost(HostUniformCountersBegin, DevUniformCntsBegin,
+ UniformCountersSize) != 0)) {
+ PROF_ERR("%s\n", "Failed to copy profile sections from device");
+ goto cleanup;
+ }
+
+ if (IsVerboseMode())
+ PROF_NOTE("Copied device sections: Counters=%zu, Data=%zu, Names=%zu, "
+ "UniformCounters=%zu\n",
+ CountersSize, DataSize, NamesSize, UniformCountersSize);
+
+ // Compute padding sizes for proper buffer layout
+ // lprofWriteDataImpl computes CountersDelta = CountersBegin - DataBegin
+ // We need to arrange our buffer so this matches the expected file layout
+ const uint64_t NumData = DataSize / sizeof(__llvm_profile_data);
+ const uint64_t NumBitmapBytes = 0;
+ const uint64_t VTableSectionSize = 0;
+ const uint64_t VNamesSize = 0;
+ uint64_t PaddingBytesBeforeCounters, PaddingBytesAfterCounters,
+ PaddingBytesAfterBitmapBytes, PaddingBytesAfterNames,
+ PaddingBytesAfterVTable, PaddingBytesAfterVNames;
+
+ if (__llvm_profile_get_padding_sizes_for_counters(
+ DataSize, CountersSize, NumBitmapBytes, NamesSize, VTableSectionSize,
+ VNamesSize, &PaddingBytesBeforeCounters, &PaddingBytesAfterCounters,
+ &PaddingBytesAfterBitmapBytes, &PaddingBytesAfterNames,
+ &PaddingBytesAfterVTable, &PaddingBytesAfterVNames) != 0) {
+ PROF_ERR("%s\n", "Failed to get padding sizes");
+ goto cleanup;
+ }
+
+ // Create contiguous buffer with layout: [Data][Padding][Counters][Names]
+ // This ensures CountersBegin - DataBegin = DataSize +
+ // PaddingBytesBeforeCounters
+ size_t ContiguousBufferSize =
+ DataSize + PaddingBytesBeforeCounters + CountersSize + NamesSize;
+ char *ContiguousBuffer = (char *)malloc(ContiguousBufferSize);
+ if (!ContiguousBuffer) {
+ PROF_ERR("%s\n", "Failed to allocate contiguous buffer");
+ goto cleanup;
+ }
+ memset(ContiguousBuffer, 0, ContiguousBufferSize);
+
+ // Set up pointers into the contiguous buffer
+ char *BufDataBegin = ContiguousBuffer;
+ char *BufCountersBegin =
+ ContiguousBuffer + DataSize + PaddingBytesBeforeCounters;
+ char *BufNamesBegin = BufCountersBegin + CountersSize;
+
+ // Copy data into contiguous buffer
+ memcpy(BufDataBegin, HostDataBegin, DataSize);
+ memcpy(BufCountersBegin, HostCountersBegin, CountersSize);
+ memcpy(BufNamesBegin, HostNamesBegin, NamesSize);
+
+ // Relocate CounterPtr in data records for file layout
+ // CounterPtr is device-relative offset; we need to adjust for file layout
+ // where Data section comes first, then Counters section
+ __llvm_profile_data *RelocatedData = (__llvm_profile_data *)BufDataBegin;
+ for (uint64_t i = 0; i < NumData; ++i) {
+ if (RelocatedData[i].CounterPtr) {
+ ptrdiff_t DeviceCounterPtrOffset = (ptrdiff_t)RelocatedData[i].CounterPtr;
+ void *DeviceDataStructAddr =
+ (char *)DevDataBegin + (i * sizeof(__llvm_profile_data));
+ void *DeviceCountersAddr =
+ (char *)DeviceDataStructAddr + DeviceCounterPtrOffset;
+ ptrdiff_t OffsetIntoCountersSection =
+ (char *)DeviceCountersAddr - (char *)DevCntsBegin;
+
+ // New offset: from this data record to its counters in file layout
+ // CountersDelta = BufCountersBegin - BufDataBegin = DataSize + Padding
+ // CounterPtr = CountersDelta + OffsetIntoCounters - (i * sizeof)
+ ptrdiff_t NewRelativeOffset = DataSize + PaddingBytesBeforeCounters +
+ OffsetIntoCountersSection -
+ (i * sizeof(__llvm_profile_data));
+ memcpy((char *)RelocatedData + i * sizeof(__llvm_profile_data) +
+ offsetof(__llvm_profile_data, CounterPtr),
+ &NewRelativeOffset, sizeof(NewRelativeOffset));
+ }
+ // Zero out unused fields
+ memset((char *)RelocatedData + i * sizeof(__llvm_profile_data) +
+ offsetof(__llvm_profile_data, BitmapPtr),
+ 0,
+ sizeof(RelocatedData[i].BitmapPtr) +
+ sizeof(RelocatedData[i].FunctionPointer) +
+ sizeof(RelocatedData[i].Values));
+ }
+
+ // Build TU suffix string for filename
+ char TUIndexStr[16] = "";
+ if (TUIndex >= 0) {
+ snprintf(TUIndexStr, sizeof(TUIndexStr), "%d", TUIndex);
+ }
+
+ ret = __llvm_write_custom_profile(
+ Target, TUIndex >= 0 ? TUIndexStr : NULL,
+ (__llvm_profile_data *)BufDataBegin,
+ (__llvm_profile_data *)(BufDataBegin + DataSize), BufCountersBegin,
+ BufCountersBegin + CountersSize, HostUniformCountersBegin,
+ HostUniformCountersBegin ? HostUniformCountersBegin + UniformCountersSize
+ : NULL,
+ BufNamesBegin, BufNamesBegin + NamesSize, NULL);
+
+ free(ContiguousBuffer);
+
+ if (ret != 0) {
+ PROF_ERR("%s\n", "Failed to write device profile using shared API");
+ } else if (IsVerboseMode()) {
+ PROF_NOTE("%s\n", "Successfully wrote device profile using shared API");
+ }
+
+cleanup:
+ free(HostCountersBegin);
+ free(HostDataBegin);
+ free(HostNamesBegin);
+ free(HostUniformCountersBegin);
+ return ret;
+}
+
+static int ProcessShadowVariable(void *ShadowVar, int TUIndex,
+ const char *Target) {
+ void *DeviceOffloadPrf = NULL;
+ if (hipGetSymbolAddress(&DeviceOffloadPrf, ShadowVar) != 0) {
+ PROF_WARN("Failed to get symbol address for shadow variable %p\n",
+ ShadowVar);
+ return -1;
+ }
+ return ProcessDeviceOffloadPrf(DeviceOffloadPrf, TUIndex, Target);
+}
+
+/* Check if HIP runtime is available and loaded */
+static int IsHipAvailable(void) {
+ EnsureHipLoaded();
+ return pHipMemcpy != NULL && pHipGetSymbolAddress != NULL;
+}
+
+/* -------------------------------------------------------------------------- */
+/* Collect device-side profile data */
+/* -------------------------------------------------------------------------- */
+
+int __llvm_profile_hip_collect_device_data(void) {
+ if (NumShadowVariables == 0 && NumDynamicModules == 0)
+ return 0;
+
+ if (!IsHipAvailable())
+ return 0;
+
+ int Ret = 0;
+
+ /* Shadow variables (static-linked kernels).
+ * Iterate over all devices to collect profile data from each GPU. */
+ if (NumShadowVariables > 0) {
+ int OrigDevice = -1;
+ hipGetDevice(&OrigDevice);
+
+ for (int Dev = 0; Dev < NumDevices; ++Dev) {
+ if (hipSetDevice(Dev) != 0) {
+ if (IsVerboseMode())
+ PROF_NOTE("Failed to set device %d, skipping\n", Dev);
+ continue;
+ }
+ const char *ArchName = getDeviceArchName(Dev);
+ if (IsVerboseMode())
+ PROF_NOTE("Collecting static profile data from device %d (%s)\n", Dev,
+ ArchName);
+ for (int i = 0; i < NumShadowVariables; ++i) {
+ if (ProcessShadowVariable(OffloadShadowVariables[i], i, ArchName) != 0)
+ Ret = -1;
+ }
+ }
+
+ if (OrigDevice >= 0)
+ hipSetDevice(OrigDevice);
+ }
+
+ /* Dynamically-loaded modules — warn about any unprocessed TUs */
+ for (int i = 0; i < NumDynamicModules; ++i) {
+ OffloadDynamicModuleInfo *MI = &DynamicModules[i];
+ for (int t = 0; t < MI->NumTUs; ++t) {
+ if (!MI->TUs[t].Processed) {
+ PROF_WARN("Dynamic module %p TU %d was not processed before exit\n",
+ MI->ModulePtr, t);
+ Ret = -1;
+ }
+ }
+ }
+
+ return Ret;
+}
diff --git a/llvm/include/llvm/IR/FixedMetadataKinds.def b/llvm/include/llvm/IR/FixedMetadataKinds.def
index 0d79677d7079e..d238c81ecc152 100644
--- a/llvm/include/llvm/IR/FixedMetadataKinds.def
+++ b/llvm/include/llvm/IR/FixedMetadataKinds.def
@@ -60,3 +60,4 @@ LLVM_FIXED_MD_KIND(MD_alloc_token, "alloc_token", 45)
LLVM_FIXED_MD_KIND(MD_implicit_ref, "implicit.ref", 46)
LLVM_FIXED_MD_KIND(MD_nofpclass, "nofpclass", 47)
LLVM_FIXED_MD_KIND(MD_call_target, "call_target", 48)
+LLVM_FIXED_MD_KIND(MD_block_uniformity_profile, "block-uniformity-profile", 49)
diff --git a/llvm/include/llvm/ProfileData/InstrProf.h b/llvm/include/llvm/ProfileData/InstrProf.h
index f742476ac854a..1923b553b82bc 100644
--- a/llvm/include/llvm/ProfileData/InstrProf.h
+++ b/llvm/include/llvm/ProfileData/InstrProf.h
@@ -894,15 +894,31 @@ struct InstrProfValueSiteRecord {
struct InstrProfRecord {
std::vector<uint64_t> Counts;
std::vector<uint8_t> BitmapBytes;
+ /// For AMDGPU offload profiling: 1 bit per basic block indicating whether
+ /// the block is entered via a wave-uniform branch. Set during merge when
+ /// per-slot counters are reduced. If a counter value is a multiple of the
+ /// wave size, the branch is considered wave-uniform.
+ std::vector<uint8_t> UniformityBits;
+ uint16_t NumOffloadProfilingThreads = 0;
+ uint16_t OffloadDeviceWaveSize = 0;
InstrProfRecord() = default;
InstrProfRecord(std::vector<uint64_t> Counts) : Counts(std::move(Counts)) {}
+ InstrProfRecord(std::vector<uint64_t> Counts,
+ uint16_t NumOffloadProfilingThreads,
+ uint16_t OffloadDeviceWaveSize = 0)
+ : Counts(std::move(Counts)),
+ NumOffloadProfilingThreads(NumOffloadProfilingThreads),
+ OffloadDeviceWaveSize(OffloadDeviceWaveSize) {}
InstrProfRecord(std::vector<uint64_t> Counts,
std::vector<uint8_t> BitmapBytes)
: Counts(std::move(Counts)), BitmapBytes(std::move(BitmapBytes)) {}
InstrProfRecord(InstrProfRecord &&) = default;
InstrProfRecord(const InstrProfRecord &RHS)
: Counts(RHS.Counts), BitmapBytes(RHS.BitmapBytes),
+ UniformityBits(RHS.UniformityBits),
+ NumOffloadProfilingThreads(RHS.NumOffloadProfilingThreads),
+ OffloadDeviceWaveSize(RHS.OffloadDeviceWaveSize),
ValueData(RHS.ValueData
? std::make_unique<ValueProfData>(*RHS.ValueData)
: nullptr) {}
@@ -910,6 +926,9 @@ struct InstrProfRecord {
InstrProfRecord &operator=(const InstrProfRecord &RHS) {
Counts = RHS.Counts;
BitmapBytes = RHS.BitmapBytes;
+ UniformityBits = RHS.UniformityBits;
+ NumOffloadProfilingThreads = RHS.NumOffloadProfilingThreads;
+ OffloadDeviceWaveSize = RHS.OffloadDeviceWaveSize;
if (!RHS.ValueData) {
ValueData = nullptr;
return *this;
@@ -921,6 +940,17 @@ struct InstrProfRecord {
return *this;
}
+ /// Check if a basic block is entered via a wave-uniform branch.
+ /// Returns true if uniform (safe for PGO spill optimization) or if no
+ /// uniformity data is available (conservative default).
+ bool isBlockUniform(unsigned BlockIdx) const {
+ if (UniformityBits.empty())
+ return true; // No uniformity data, assume uniform (conservative)
+ if (BlockIdx / 8 >= UniformityBits.size())
+ return true; // Out of range, assume uniform
+ return (UniformityBits[BlockIdx / 8] >> (BlockIdx % 8)) & 1;
+ }
+
/// Return the number of value profile kinds with non-zero number
/// of profile sites.
inline uint32_t getNumValueKinds() const;
@@ -945,8 +975,12 @@ struct InstrProfRecord {
/// Merge the counts in \p Other into this one.
/// Optionally scale merged counts by \p Weight.
+ /// If \p WaveSize is non-zero and Other has offload profiling slots,
+ /// compute uniformity bits based on whether counter values are multiples
+ /// of WaveSize.
LLVM_ABI void merge(InstrProfRecord &Other, uint64_t Weight,
- function_ref<void(instrprof_error)> Warn);
+ function_ref<void(instrprof_error)> Warn,
+ unsigned WaveSize = 0);
/// Scale up profile counts (including value profile data) by
/// a factor of (N / D).
@@ -1066,11 +1100,26 @@ struct NamedInstrProfRecord : InstrProfRecord {
NamedInstrProfRecord(StringRef Name, uint64_t Hash,
std::vector<uint64_t> Counts)
: InstrProfRecord(std::move(Counts)), Name(Name), Hash(Hash) {}
+ NamedInstrProfRecord(StringRef Name, uint64_t Hash,
+ std::vector<uint64_t> Counts,
+ uint16_t NumOffloadProfilingThreads,
+ uint16_t OffloadDeviceWaveSize = 0)
+ : InstrProfRecord(std::move(Counts), NumOffloadProfilingThreads,
+ OffloadDeviceWaveSize),
+ Name(Name), Hash(Hash) {}
NamedInstrProfRecord(StringRef Name, uint64_t Hash,
std::vector<uint64_t> Counts,
std::vector<uint8_t> BitmapBytes)
: InstrProfRecord(std::move(Counts), std::move(BitmapBytes)), Name(Name),
Hash(Hash) {}
+ NamedInstrProfRecord(StringRef Name, uint64_t Hash,
+ std::vector<uint64_t> Counts,
+ std::vector<uint8_t> BitmapBytes,
+ std::vector<uint8_t> UniformityBits)
+ : InstrProfRecord(std::move(Counts), std::move(BitmapBytes)), Name(Name),
+ Hash(Hash) {
+ this->UniformityBits = std::move(UniformityBits);
+ }
static bool hasCSFlagInHash(uint64_t FuncHash) {
return ((FuncHash >> CS_FLAG_IN_FUNC_HASH) & 1);
@@ -1177,7 +1226,9 @@ enum ProfVersion {
Version12 = 12,
// In this version, the frontend PGO stable hash algorithm defaults to V4.
Version13 = 13,
- // The current version is 13.
+ // UniformityBits added for AMDGPU offload profiling divergence detection.
+ Version14 = 14,
+ // The current version is 14.
CurrentVersion = INSTR_PROF_INDEX_VERSION
};
const uint64_t Version = ProfVersion::CurrentVersion;
diff --git a/llvm/include/llvm/ProfileData/InstrProfData.inc b/llvm/include/llvm/ProfileData/InstrProfData.inc
index 46d6bb5bd8896..4645b89420501 100644
--- a/llvm/include/llvm/ProfileData/InstrProfData.inc
+++ b/llvm/include/llvm/ProfileData/InstrProfData.inc
@@ -89,9 +89,17 @@ INSTR_PROF_DATA(IntPtrT, llvm::PointerType::getUnqual(Ctx), Values, \
ValuesPtrExpr)
INSTR_PROF_DATA(const uint32_t, llvm::Type::getInt32Ty(Ctx), NumCounters, \
ConstantInt::get(llvm::Type::getInt32Ty(Ctx), NumCounters))
-INSTR_PROF_DATA(const uint16_t, Int16ArrayTy, NumValueSites[IPVK_Last+1], \
- ConstantArray::get(Int16ArrayTy, Int16ArrayVals)) \
-INSTR_PROF_DATA(const uint32_t, llvm::Type::getInt32Ty(Ctx), NumBitmapBytes, \
+INSTR_PROF_DATA(const uint16_t, Int16ArrayTy, NumValueSites[IPVK_Last + 1],
+ ConstantArray::get(Int16ArrayTy, Int16ArrayVals))
+INSTR_PROF_DATA(const uint16_t, llvm::Type::getInt16Ty(Ctx),
+ NumOffloadProfilingThreads,
+ ConstantInt::get(llvm::Type::getInt16Ty(Ctx),
+ NumOffloadProfilingThreadsVal))
+INSTR_PROF_DATA(const uint16_t, llvm::Type::getInt16Ty(Ctx),
+ OffloadDeviceWaveSize,
+ ConstantInt::get(llvm::Type::getInt16Ty(Ctx),
+ OffloadDeviceWaveSizeVal))
+INSTR_PROF_DATA(const uint32_t, llvm::Type::getInt32Ty(Ctx), NumBitmapBytes,
ConstantInt::get(llvm::Type::getInt32Ty(Ctx), NumBitmapBytes))
#undef INSTR_PROF_DATA
/* INSTR_PROF_DATA end. */
@@ -324,6 +332,9 @@ INSTR_PROF_SECT_ENTRY(IPSK_data, \
INSTR_PROF_SECT_ENTRY(IPSK_cnts, \
INSTR_PROF_QUOTE(INSTR_PROF_CNTS_COMMON), \
INSTR_PROF_CNTS_COFF, "__DATA,")
+INSTR_PROF_SECT_ENTRY(IPSK_ucnts, \
+ INSTR_PROF_QUOTE(INSTR_PROF_UCNTS_COMMON), \
+ INSTR_PROF_UCNTS_COFF, "__DATA,")
INSTR_PROF_SECT_ENTRY(IPSK_bitmap, \
INSTR_PROF_QUOTE(INSTR_PROF_BITS_COMMON), \
INSTR_PROF_BITS_COFF, "__DATA,")
@@ -720,9 +731,9 @@ serializeValueProfDataFrom(ValueProfRecordClosure *Closure,
(uint64_t)'f' << 16 | (uint64_t)'R' << 8 | (uint64_t)129
/* Raw profile format version (start from 1). */
-#define INSTR_PROF_RAW_VERSION 10
+#define INSTR_PROF_RAW_VERSION 11
/* Indexed profile format version (start from 1). */
-#define INSTR_PROF_INDEX_VERSION 13
+#define INSTR_PROF_INDEX_VERSION 14
/* Coverage mapping format version (start from 0). */
#define INSTR_PROF_COVMAP_VERSION 6
@@ -767,6 +778,7 @@ serializeValueProfDataFrom(ValueProfRecordClosure *Closure,
#define INSTR_PROF_NAME_COMMON __llvm_prf_names
#define INSTR_PROF_VNAME_COMMON __llvm_prf_vns
#define INSTR_PROF_CNTS_COMMON __llvm_prf_cnts
+#define INSTR_PROF_UCNTS_COMMON __llvm_prf_ucnts
#define INSTR_PROF_BITS_COMMON __llvm_prf_bits
#define INSTR_PROF_VALS_COMMON __llvm_prf_vals
#define INSTR_PROF_VNODES_COMMON __llvm_prf_vnds
@@ -784,6 +796,7 @@ serializeValueProfDataFrom(ValueProfRecordClosure *Closure,
#define INSTR_PROF_NAME_COFF ".lprfn$M"
#define INSTR_PROF_VNAME_COFF ".lprfvn$M"
#define INSTR_PROF_CNTS_COFF ".lprfc$M"
+#define INSTR_PROF_UCNTS_COFF ".lprfuc$M"
#define INSTR_PROF_BITS_COFF ".lprfb$M"
#define INSTR_PROF_VALS_COFF ".lprfv$M"
#define INSTR_PROF_VNODES_COFF ".lprfnd$M"
diff --git a/llvm/include/llvm/ProfileData/InstrProfWriter.h b/llvm/include/llvm/ProfileData/InstrProfWriter.h
index 1b24425e68a9e..93212c19769e0 100644
--- a/llvm/include/llvm/ProfileData/InstrProfWriter.h
+++ b/llvm/include/llvm/ProfileData/InstrProfWriter.h
@@ -215,6 +215,7 @@ class InstrProfWriter {
MemProfVersionRequested = Version;
}
void setMemProfFullSchema(bool Full) { MemProfFullSchema = Full; }
+
// Compute the overlap b/w this object and Other. Program level result is
// stored in Overlap and function level result is stored in FuncLevelOverlap.
LLVM_ABI void overlapRecord(NamedInstrProfRecord &&Other,
diff --git a/llvm/include/llvm/Transforms/Instrumentation/CFGMST.h b/llvm/include/llvm/Transforms/Instrumentation/CFGMST.h
index 6b93b6cb83b4e..efef78eadd31e 100644
--- a/llvm/include/llvm/Transforms/Instrumentation/CFGMST.h
+++ b/llvm/include/llvm/Transforms/Instrumentation/CFGMST.h
@@ -286,12 +286,30 @@ template <class Edge, class BBInfo> class CFGMST {
if (!Message.str().empty())
OS << Message << "\n";
OS << " Number of Basic Blocks: " << BBInfos.size() << "\n";
- for (auto &BI : BBInfos) {
- const BasicBlock *BB = BI.first;
+ // Collect and sort BBInfos deterministically by their assigned Index.
+ std::vector<std::pair<const BasicBlock *, const BBInfo *>> SortedBBInfos;
+ SortedBBInfos.reserve(BBInfos.size());
+ for (const auto &BI : BBInfos)
+ SortedBBInfos.emplace_back(BI.first, BI.second.get());
+
+ llvm::sort(SortedBBInfos, [](const auto &A, const auto &B) {
+ // Primary key: BBInfo Index
+ if (A.second->Index != B.second->Index)
+ return A.second->Index < B.second->Index;
+ // Secondary key: name string to keep a stable order even if
+ // indices tie (ties shouldn't happen, but this makes ordering
+ // explicit).
+ StringRef NameA = A.first ? A.first->getName() : StringRef("FakeNode");
+ StringRef NameB = B.first ? B.first->getName() : StringRef("FakeNode");
+ return NameA < NameB;
+ });
+
+ for (const auto &P : SortedBBInfos) {
+ const BasicBlock *BB = P.first;
+ const BBInfo *Info = P.second;
OS << " BB: " << (BB == nullptr ? "FakeNode" : BB->getName()) << " "
- << BI.second->infoString() << "\n";
+ << Info->infoString() << "\n";
}
-
OS << " Number of Edges: " << AllEdges.size()
<< " (*: Instrument, C: CriticalEdge, -: Removed)\n";
uint32_t Count = 0;
diff --git a/llvm/lib/Passes/StandardInstrumentations.cpp b/llvm/lib/Passes/StandardInstrumentations.cpp
index 19e72a8612c4a..7b9cd55c62f09 100644
--- a/llvm/lib/Passes/StandardInstrumentations.cpp
+++ b/llvm/lib/Passes/StandardInstrumentations.cpp
@@ -2031,7 +2031,8 @@ DotCfgDiff::DotCfgDiff(StringRef Title, const FuncDataT<DCData> &Before,
assert(NodePosition.count(Source) == 1 && "Expected to find node.");
DotCfgDiffNode &SourceNode = Nodes[NodePosition[Source]];
- assert(NodePosition.count(Sink) == 1 && "Expected to find node.");
+ if (NodePosition.count(Sink) == 0)
+ continue;
unsigned SinkNode = NodePosition[Sink];
StringRef Colour = E.second;
@@ -2247,7 +2248,12 @@ void DotCfgChangeReporter::handleFunctionCompare(
// Use the before entry block if the after entry block was removed.
if (EntryBlockName == "")
EntryBlockName = Before.getEntryBlockName();
- assert(EntryBlockName != "" && "Expected to find entry block");
+
+ if (EntryBlockName.empty()) {
+ errs() << "Warning: could not find entry block for function " << Name
+ << ", skipping dot-cfg output for pass " << PassID << ".\n";
+ return;
+ }
DotCfgDiffDisplayGraph DG = Diff.createDisplayGraph(Text, EntryBlockName);
DG.generateDotFile(DotFile);
diff --git a/llvm/lib/ProfileData/InstrProf.cpp b/llvm/lib/ProfileData/InstrProf.cpp
index 82469481881c0..19ab6591db8cc 100644
--- a/llvm/lib/ProfileData/InstrProf.cpp
+++ b/llvm/lib/ProfileData/InstrProf.cpp
@@ -957,7 +957,8 @@ void InstrProfRecord::mergeValueProfData(
}
void InstrProfRecord::merge(InstrProfRecord &Other, uint64_t Weight,
- function_ref<void(instrprof_error)> Warn) {
+ function_ref<void(instrprof_error)> Warn,
+ unsigned WaveSize) {
// If the number of counters doesn't match we either have bad data
// or a hash collision.
if (Counts.size() != Other.Counts.size()) {
@@ -965,24 +966,94 @@ void InstrProfRecord::merge(InstrProfRecord &Other, uint64_t Weight,
return;
}
- // Special handling of the first count as the PseudoCount.
- CountPseudoKind OtherKind = Other.getCountPseudoKind();
- CountPseudoKind ThisKind = getCountPseudoKind();
- if (OtherKind != NotPseudo || ThisKind != NotPseudo) {
- // We don't allow the merge of a profile with pseudo counts and
- // a normal profile (i.e. without pesudo counts).
- // Profile supplimenation should be done after the profile merge.
- if (OtherKind == NotPseudo || ThisKind == NotPseudo) {
- Warn(instrprof_error::count_mismatch);
- return;
+ if (Other.NumOffloadProfilingThreads > 0) {
+ uint64_t NumThreads = Other.NumOffloadProfilingThreads;
+ uint64_t NumCounters = Other.Counts.size() / (NumThreads + 1);
+ std::vector<uint64_t> NewCounts(NumCounters, 0);
+
+ // If WaveSize is specified, compute uniformity bits for each block.
+ // A block is considered wave-uniform if all its per-slot counter values
+ // are multiples of WaveSize (meaning all lanes were active when executed).
+ //
+ // However, if Other.UniformityBits is already set (e.g., from .unifcnts
+ // file), use that instead of the WaveSize-modulo heuristic, as the
+ // .unifcnts-based detection is more accurate for data-dependent divergence.
+ std::vector<uint8_t> NewUniformityBits;
+ bool UseExistingUniformity = !Other.UniformityBits.empty();
+ if (UseExistingUniformity) {
+ // Use the uniformity bits already computed from .unifcnts
+ NewUniformityBits = Other.UniformityBits;
+ } else if (WaveSize > 0) {
+ NewUniformityBits.resize((NumCounters + 7) / 8, 0xFF); // Default: uniform
}
- if (OtherKind == PseudoHot || ThisKind == PseudoHot)
- setPseudoCount(PseudoHot);
- else
- setPseudoCount(PseudoWarm);
+
+ for (size_t I = 0; I < NumCounters; ++I) {
+ uint64_t Sum = 0;
+ bool IsUniform = true;
+
+ for (size_t J = 0; J < NumThreads; ++J) {
+ uint64_t RawCount = Other.Counts[I * (NumThreads + 1) + J];
+
+ // Check uniformity: if count is non-zero and not a multiple of
+ // WaveSize, the block was entered via a divergent branch.
+ // Skip this check if we're using existing uniformity bits from
+ // .unifcnts.
+ if (!UseExistingUniformity && WaveSize > 0 && RawCount != 0 &&
+ (RawCount % WaveSize) != 0) {
+ IsUniform = false;
+ }
+
+ bool Overflowed;
+ uint64_t Value =
+ SaturatingMultiplyAdd(RawCount, Weight, uint64_t(0), &Overflowed);
+ if (Value > getInstrMaxCountValue()) {
+ Value = getInstrMaxCountValue();
+ Overflowed = true;
+ }
+ Sum += Value;
+ if (Overflowed)
+ Warn(instrprof_error::counter_overflow);
+ }
+ NewCounts[I] = Sum;
+
+ // Update uniformity bit for this block (only if not using existing bits)
+ if (!UseExistingUniformity && WaveSize > 0 && !IsUniform) {
+ // Clear the bit for non-uniform blocks
+ NewUniformityBits[I / 8] &= ~(1 << (I % 8));
+ }
+ }
+ Counts = NewCounts;
+ if (UseExistingUniformity || WaveSize > 0) {
+ UniformityBits = std::move(NewUniformityBits);
+ }
+ NumOffloadProfilingThreads = 0;
+ OffloadDeviceWaveSize = 0;
+
+ // Early return: offload data has been processed and reduced.
+ // Don't fall through to the regular merge loop which expects matching
+ // sizes.
return;
+ } else {
+ // Special handling of the first count as the PseudoCount.
+ CountPseudoKind OtherKind = Other.getCountPseudoKind();
+ CountPseudoKind ThisKind = getCountPseudoKind();
+ if (OtherKind != NotPseudo || ThisKind != NotPseudo) {
+ // We don't allow the merge of a profile with pseudo counts and
+ // a normal profile (i.e. without pesudo counts).
+ // Profile supplimenation should be done after the profile merge.
+ if (OtherKind == NotPseudo || ThisKind == NotPseudo) {
+ Warn(instrprof_error::count_mismatch);
+ return;
+ }
+ if (OtherKind == PseudoHot || ThisKind == PseudoHot)
+ setPseudoCount(PseudoHot);
+ else
+ setPseudoCount(PseudoWarm);
+ return;
+ }
}
-
+ NumOffloadProfilingThreads = Other.NumOffloadProfilingThreads;
+ OffloadDeviceWaveSize = Other.OffloadDeviceWaveSize;
for (size_t I = 0, E = Other.Counts.size(); I < E; ++I) {
bool Overflowed;
uint64_t Value =
@@ -1022,15 +1093,32 @@ void InstrProfRecord::scaleValueProfData(
void InstrProfRecord::scale(uint64_t N, uint64_t D,
function_ref<void(instrprof_error)> Warn) {
assert(D != 0 && "D cannot be 0");
- for (auto &Count : this->Counts) {
- bool Overflowed;
- Count = SaturatingMultiply(Count, N, &Overflowed) / D;
- if (Count > getInstrMaxCountValue()) {
- Count = getInstrMaxCountValue();
- Overflowed = true;
+ if (NumOffloadProfilingThreads > 0) {
+ uint64_t NumThreads = NumOffloadProfilingThreads;
+ for (size_t I = 0, E = Counts.size(); I < E; I += NumThreads + 1) {
+ for (size_t J = 0; J < NumThreads; ++J) {
+ bool Overflowed;
+ uint64_t &Count = this->Counts[I + J];
+ Count = SaturatingMultiply(Count, N, &Overflowed) / D;
+ if (Count > getInstrMaxCountValue()) {
+ Count = getInstrMaxCountValue();
+ Overflowed = true;
+ }
+ if (Overflowed)
+ Warn(instrprof_error::counter_overflow);
+ }
+ }
+ } else {
+ for (auto &Count : this->Counts) {
+ bool Overflowed;
+ Count = SaturatingMultiply(Count, N, &Overflowed) / D;
+ if (Count > getInstrMaxCountValue()) {
+ Count = getInstrMaxCountValue();
+ Overflowed = true;
+ }
+ if (Overflowed)
+ Warn(instrprof_error::counter_overflow);
}
- if (Overflowed)
- Warn(instrprof_error::counter_overflow);
}
for (uint32_t Kind = IPVK_First; Kind <= IPVK_Last; ++Kind)
scaleValueProfData(Kind, N, D, Warn);
@@ -1692,7 +1780,7 @@ Expected<Header> Header::readFromBuffer(const unsigned char *Buffer) {
IndexedInstrProf::ProfVersion::CurrentVersion)
return make_error<InstrProfError>(instrprof_error::unsupported_version);
- static_assert(IndexedInstrProf::ProfVersion::CurrentVersion == Version13,
+ static_assert(IndexedInstrProf::ProfVersion::CurrentVersion == Version14,
"Please update the reader as needed when a new field is added "
"or when indexed profile version gets bumped.");
@@ -1725,10 +1813,11 @@ size_t Header::size() const {
// of the header, and byte offset of existing fields shouldn't change when
// indexed profile version gets incremented.
static_assert(
- IndexedInstrProf::ProfVersion::CurrentVersion == Version13,
+ IndexedInstrProf::ProfVersion::CurrentVersion == Version14,
"Please update the size computation below if a new field has "
"been added to the header; for a version bump without new "
"fields, add a case statement to fall through to the latest version.");
+ case 14ull: // UniformityBits added in record data, no header change
case 13ull:
case 12ull:
return 72;
diff --git a/llvm/lib/ProfileData/InstrProfCorrelator.cpp b/llvm/lib/ProfileData/InstrProfCorrelator.cpp
index b38189de31606..68f0a0c68015b 100644
--- a/llvm/lib/ProfileData/InstrProfCorrelator.cpp
+++ b/llvm/lib/ProfileData/InstrProfCorrelator.cpp
@@ -318,6 +318,9 @@ void InstrProfCorrelatorImpl<IntPtrT>::addDataProbe(uint64_t NameRef,
/*ValuesPtr=*/maybeSwap<IntPtrT>(0),
maybeSwap<uint32_t>(NumCounters),
/*NumValueSites=*/{maybeSwap<uint16_t>(0), maybeSwap<uint16_t>(0)},
+ // Offload profiling not used in correlation mode.
+ /*NumOffloadProfilingThreads=*/maybeSwap<uint16_t>(0),
+ /*OffloadDeviceWaveSize=*/maybeSwap<uint16_t>(0),
// TODO: MC/DC is not yet supported.
/*NumBitmapBytes=*/maybeSwap<uint32_t>(0),
});
diff --git a/llvm/lib/ProfileData/InstrProfReader.cpp b/llvm/lib/ProfileData/InstrProfReader.cpp
index 8147ee8d0e816..ad79581b205c1 100644
--- a/llvm/lib/ProfileData/InstrProfReader.cpp
+++ b/llvm/lib/ProfileData/InstrProfReader.cpp
@@ -723,6 +723,14 @@ Error RawInstrProfReader<IntPtrT>::readRawCounts(
if (NumCounters == 0)
return error(instrprof_error::malformed, "number of counters is zero");
+ // For GPU profiles with per-slot counters, the actual number of counter
+ // entries in the file is NumCounters * (NumOffloadProfilingThreads + 1).
+ // NumCounters in the data structure stores the base count (number of blocks),
+ // while the file contains expanded slots for wave-level profiling.
+ uint16_t NumOffloadThreads = swap(Data->NumOffloadProfilingThreads);
+ if (NumOffloadThreads > 0)
+ NumCounters *= (NumOffloadThreads + 1);
+
ptrdiff_t CounterBaseOffset = swap(Data->CounterPtr) - CountersDelta;
if (CounterBaseOffset < 0)
return error(
@@ -873,6 +881,9 @@ Error RawInstrProfReader<IntPtrT>::readNextRecord(NamedInstrProfRecord &Record)
if (Error E = readFuncHash(Record))
return error(std::move(E));
+ Record.NumOffloadProfilingThreads = swap(Data->NumOffloadProfilingThreads);
+ Record.OffloadDeviceWaveSize = swap(Data->OffloadDeviceWaveSize);
+
// Read raw counts and set Record.
if (Error E = readRawCounts(Record))
return error(std::move(E));
@@ -945,11 +956,12 @@ data_type InstrProfLookupTrait::ReadData(StringRef K, const unsigned char *D,
DataBuffer.clear();
std::vector<uint64_t> CounterBuffer;
std::vector<uint8_t> BitmapByteBuffer;
+ std::vector<uint8_t> UniformityBitsBuffer;
const unsigned char *End = D + N;
while (D < End) {
// Read hash.
- if (D + sizeof(uint64_t) >= End)
+ if (D + sizeof(uint64_t) > End)
return data_type();
uint64_t Hash = endian::readNext<uint64_t, llvm::endianness::little>(D);
@@ -977,18 +989,51 @@ data_type InstrProfLookupTrait::ReadData(StringRef K, const unsigned char *D,
if (D + sizeof(uint64_t) > End)
return data_type();
BitmapBytes = endian::readNext<uint64_t, llvm::endianness::little>(D);
- // Read bitmap byte values.
- if (D + BitmapBytes * sizeof(uint8_t) > End)
- return data_type();
BitmapByteBuffer.clear();
BitmapByteBuffer.reserve(BitmapBytes);
- for (uint64_t J = 0; J < BitmapBytes; ++J)
- BitmapByteBuffer.push_back(static_cast<uint8_t>(
- endian::readNext<uint64_t, llvm::endianness::little>(D)));
+
+ if (GET_VERSION(FormatVersion) >=
+ IndexedInstrProf::ProfVersion::Version14) {
+ // Version 14+: bitmap bytes stored as uint8_t with padding.
+ uint64_t PaddedSize = alignTo(BitmapBytes, sizeof(uint64_t));
+ if (D + PaddedSize > End)
+ return data_type();
+ for (uint64_t J = 0; J < BitmapBytes; ++J)
+ BitmapByteBuffer.push_back(
+ endian::readNext<uint8_t, llvm::endianness::little>(D));
+ for (uint64_t J = BitmapBytes; J < PaddedSize; ++J)
+ (void)endian::readNext<uint8_t, llvm::endianness::little>(D);
+
+ // Read uniformity bits (AMDGPU offload profiling).
+ uint64_t UniformityBitsSize = 0;
+ if (D + sizeof(uint64_t) > End)
+ return data_type();
+ UniformityBitsSize =
+ endian::readNext<uint64_t, llvm::endianness::little>(D);
+ uint64_t PaddedUniformitySize =
+ alignTo(UniformityBitsSize, sizeof(uint64_t));
+ if (D + PaddedUniformitySize > End)
+ return data_type();
+ UniformityBitsBuffer.clear();
+ UniformityBitsBuffer.reserve(UniformityBitsSize);
+ for (uint64_t J = 0; J < UniformityBitsSize; ++J)
+ UniformityBitsBuffer.push_back(
+ endian::readNext<uint8_t, llvm::endianness::little>(D));
+ for (uint64_t J = UniformityBitsSize; J < PaddedUniformitySize; ++J)
+ (void)endian::readNext<uint8_t, llvm::endianness::little>(D);
+ } else {
+ // Version 11-13: each bitmap byte stored as a uint64_t.
+ if (D + BitmapBytes * sizeof(uint64_t) > End)
+ return data_type();
+ for (uint64_t J = 0; J < BitmapBytes; ++J)
+ BitmapByteBuffer.push_back(static_cast<uint8_t>(
+ endian::readNext<uint64_t, llvm::endianness::little>(D)));
+ }
}
DataBuffer.emplace_back(K, Hash, std::move(CounterBuffer),
- std::move(BitmapByteBuffer));
+ std::move(BitmapByteBuffer),
+ std::move(UniformityBitsBuffer));
// Read value profiling data.
if (GET_VERSION(FormatVersion) > IndexedInstrProf::ProfVersion::Version2 &&
diff --git a/llvm/lib/ProfileData/InstrProfWriter.cpp b/llvm/lib/ProfileData/InstrProfWriter.cpp
index 0f15ca8ff6df7..7603cfbd5cb99 100644
--- a/llvm/lib/ProfileData/InstrProfWriter.cpp
+++ b/llvm/lib/ProfileData/InstrProfWriter.cpp
@@ -51,6 +51,7 @@ class InstrProfRecordWriterTrait {
llvm::endianness ValueProfDataEndianness = llvm::endianness::little;
InstrProfSummaryBuilder *SummaryBuilder;
InstrProfSummaryBuilder *CSSummaryBuilder;
+ bool WritePrevVersion = false;
InstrProfRecordWriterTrait() = default;
@@ -58,7 +59,7 @@ class InstrProfRecordWriterTrait {
return IndexedInstrProf::ComputeHash(K);
}
- static std::pair<offset_type, offset_type>
+ std::pair<offset_type, offset_type>
EmitKeyDataLength(raw_ostream &Out, key_type_ref K, data_type_ref V) {
using namespace support;
@@ -72,9 +73,22 @@ class InstrProfRecordWriterTrait {
const InstrProfRecord &ProfRecord = ProfileData.second;
M += sizeof(uint64_t); // The function hash
M += sizeof(uint64_t); // The size of the Counts vector
- M += ProfRecord.Counts.size() * sizeof(uint64_t);
+ size_t NumCounters = ProfRecord.Counts.size();
+ if (ProfRecord.NumOffloadProfilingThreads > 0) {
+ NumCounters /= (ProfRecord.NumOffloadProfilingThreads + 1);
+ }
+ M += NumCounters * sizeof(uint64_t);
M += sizeof(uint64_t); // The size of the Bitmap vector
- M += ProfRecord.BitmapBytes.size() * sizeof(uint64_t);
+ if (WritePrevVersion) {
+ // Version 13: each bitmap byte stored as a uint64_t.
+ M += ProfRecord.BitmapBytes.size() * sizeof(uint64_t);
+ } else {
+ // Version 14+: bitmap bytes as uint8_t with padding, plus
+ // uniformity bits.
+ M += alignTo(ProfRecord.BitmapBytes.size(), sizeof(uint64_t));
+ M += sizeof(uint64_t); // The size of the UniformityBits vector
+ M += alignTo(ProfRecord.UniformityBits.size(), sizeof(uint64_t));
+ }
// Value data
M += ValueProfData::getSize(ProfileData.second);
@@ -88,7 +102,8 @@ class InstrProfRecordWriterTrait {
Out.write(K.data(), N);
}
- void EmitData(raw_ostream &Out, key_type_ref, data_type_ref V, offset_type) {
+ void EmitData(raw_ostream &Out, key_type_ref K, data_type_ref V,
+ offset_type) {
using namespace support;
endian::Writer LE(Out, llvm::endianness::little);
@@ -100,13 +115,44 @@ class InstrProfRecordWriterTrait {
SummaryBuilder->addRecord(ProfRecord);
LE.write<uint64_t>(ProfileData.first); // Function hash
- LE.write<uint64_t>(ProfRecord.Counts.size());
- for (uint64_t I : ProfRecord.Counts)
- LE.write<uint64_t>(I);
+ if (ProfRecord.NumOffloadProfilingThreads > 0) {
+ uint64_t NumThreads = ProfRecord.NumOffloadProfilingThreads;
+ uint64_t NumCounters = ProfRecord.Counts.size() / (NumThreads + 1);
+ LE.write<uint64_t>(NumCounters);
+ for (size_t I = 0; I < NumCounters; ++I) {
+ uint64_t Sum = 0;
+ for (size_t J = 0; J < NumThreads; ++J)
+ Sum += ProfRecord.Counts[I * (NumThreads + 1) + J];
+ LE.write<uint64_t>(Sum);
+ }
+ } else {
+ LE.write<uint64_t>(ProfRecord.Counts.size());
+ for (uint64_t I : ProfRecord.Counts)
+ LE.write<uint64_t>(I);
+ }
LE.write<uint64_t>(ProfRecord.BitmapBytes.size());
- for (uint64_t I : ProfRecord.BitmapBytes)
- LE.write<uint64_t>(I);
+ if (WritePrevVersion) {
+ // Version 13: each bitmap byte stored as a uint64_t.
+ for (uint8_t I : ProfRecord.BitmapBytes)
+ LE.write<uint64_t>(I);
+ } else {
+ // Version 14+: bitmap bytes as uint8_t with padding.
+ for (uint8_t I : ProfRecord.BitmapBytes)
+ LE.write<uint8_t>(I);
+ for (size_t I = ProfRecord.BitmapBytes.size();
+ I < alignTo(ProfRecord.BitmapBytes.size(), sizeof(uint64_t)); ++I)
+ LE.write<uint8_t>(0);
+
+ // Write uniformity bits (AMDGPU offload profiling).
+ LE.write<uint64_t>(ProfRecord.UniformityBits.size());
+ for (uint8_t I : ProfRecord.UniformityBits)
+ LE.write<uint8_t>(I);
+ for (size_t I = ProfRecord.UniformityBits.size();
+ I < alignTo(ProfRecord.UniformityBits.size(), sizeof(uint64_t));
+ ++I)
+ LE.write<uint8_t>(0);
+ }
// Write value data
std::unique_ptr<ValueProfData> VDataPtr =
@@ -207,9 +253,18 @@ void InstrProfWriter::addRecord(StringRef Name, uint64_t Hash,
Dest = std::move(I);
if (Weight > 1)
Dest.scale(Weight, 1, MapWarn);
+ // For new records with offload profiling slots, compute uniformity bits
+ // using the wave size embedded in the raw profile data.
+ unsigned WaveSize = Dest.OffloadDeviceWaveSize;
+ if (WaveSize > 0 && Dest.NumOffloadProfilingThreads > 0) {
+ InstrProfRecord Temp;
+ Temp.Counts.resize(Dest.Counts.size());
+ Temp.merge(Dest, 1, MapWarn, WaveSize);
+ Dest = std::move(Temp);
+ }
} else {
// We're updating a function we've seen before.
- Dest.merge(I, Weight, MapWarn);
+ Dest.merge(I, Weight, MapWarn, I.OffloadDeviceWaveSize);
}
Dest.sortValueData();
@@ -524,6 +579,7 @@ Error InstrProfWriter::writeImpl(ProfOStream &OS) {
InfoObj->SummaryBuilder = &ISB;
InstrProfSummaryBuilder CSISB(ProfileSummaryBuilder::DefaultCutoffs);
InfoObj->CSSummaryBuilder = &CSISB;
+ InfoObj->WritePrevVersion = WritePrevVersion;
// Populate the hash table generator.
SmallVector<std::pair<StringRef, const ProfilingData *>> OrderedData;
@@ -542,7 +598,7 @@ Error InstrProfWriter::writeImpl(ProfOStream &OS) {
// The WritePrevVersion handling will either need to be removed or updated
// if the version is advanced beyond 12.
static_assert(IndexedInstrProf::ProfVersion::CurrentVersion ==
- IndexedInstrProf::ProfVersion::Version13);
+ IndexedInstrProf::ProfVersion::Version14);
if (static_cast<bool>(ProfileKind & InstrProfKind::IRInstrumentation))
Header.Version |= VARIANT_MASK_IR_PROF;
if (static_cast<bool>(ProfileKind & InstrProfKind::ContextSensitive))
diff --git a/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp b/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
index 199b7357fa860..7ed84940f96f8 100644
--- a/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
+++ b/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp
@@ -33,12 +33,15 @@
#include "llvm/IR/DiagnosticInfo.h"
#include "llvm/IR/Dominators.h"
#include "llvm/IR/Function.h"
+#include "llvm/IR/GlobalAlias.h"
#include "llvm/IR/GlobalValue.h"
#include "llvm/IR/GlobalVariable.h"
#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/Instruction.h"
#include "llvm/IR/Instructions.h"
#include "llvm/IR/IntrinsicInst.h"
+#include "llvm/IR/Intrinsics.h"
+#include "llvm/IR/IntrinsicsAMDGPU.h"
#include "llvm/IR/MDBuilder.h"
#include "llvm/IR/Module.h"
#include "llvm/IR/Type.h"
@@ -50,6 +53,7 @@
#include "llvm/Support/Compiler.h"
#include "llvm/Support/Error.h"
#include "llvm/Support/ErrorHandling.h"
+#include "llvm/TargetParser/TargetParser.h"
#include "llvm/TargetParser/Triple.h"
#include "llvm/Transforms/Instrumentation/PGOInstrumentation.h"
#include "llvm/Transforms/Utils/BasicBlockUtils.h"
@@ -160,6 +164,14 @@ cl::opt<bool> SpeculativeCounterPromotionToLoop(
" update can be further/iteratively promoted into an acyclic "
" region."));
+static cl::opt<unsigned> OffloadPGOSampling(
+ "offload-pgo-sampling",
+ cl::desc("Log2 of the sampling period for offload PGO instrumentation. "
+ "Only 1 in every 2^N blocks is instrumented. "
+ "0 = all blocks, 1 = 50%, 2 = 25%, 3 = 12.5% (default). "
+ "Higher values reduce overhead at the cost of sparser profiles."),
+ cl::init(3));
+
cl::opt<bool> IterativeCounterPromotion(
"iterative-counter-promotion", cl::init(true),
cl::desc("Allow counter promotion across the whole loop nest."));
@@ -241,6 +253,20 @@ static bool profDataReferencedByCode(const Module &M) {
return enablesValueProfiling(M);
}
+// Extract CUID (Compilation Unit ID) from the module.
+// HIP/CUDA modules have a global variable __hip_cuid_<hash> that uniquely
+// identifies each translation unit. Returns empty string if not found.
+static std::string getCUIDFromModule(const Module &M) {
+ for (const GlobalVariable &GV : M.globals()) {
+ StringRef Name = GV.getName();
+ if (Name.starts_with("__hip_cuid_")) {
+ // Extract the hash suffix after "__hip_cuid_"
+ return Name.drop_front(strlen("__hip_cuid_")).str();
+ }
+ }
+ return "";
+}
+
class InstrLowerer final {
public:
InstrLowerer(Module &M, const InstrProfOptions &Options,
@@ -265,7 +291,8 @@ class InstrLowerer final {
struct PerFunctionProfileData {
uint32_t NumValueSites[IPVK_Last + 1] = {};
GlobalVariable *RegionCounters = nullptr;
- GlobalVariable *DataVar = nullptr;
+ GlobalVariable *UniformCounters = nullptr; // For AMDGPU divergence tracking
+ GlobalValue *DataVar = nullptr;
GlobalVariable *RegionBitmaps = nullptr;
uint32_t NumBitmapBytes = 0;
@@ -287,11 +314,40 @@ class InstrLowerer final {
GlobalVariable *NamesVar = nullptr;
size_t NamesSize = 0;
+ // For GPU targets: per-TU contiguous allocation of profile data.
+ // Instead of separate per-function counters (which linker can reorder),
+ // we allocate one contiguous array for all counters in the TU.
+ GlobalVariable *ContiguousCnts = nullptr; // All counters in one array
+ GlobalVariable *ContiguousData =
+ nullptr; // All __llvm_profile_data in one array
+ GlobalVariable *ContiguousUCnts =
+ nullptr; // All uniform counters in one array
+ StructType *ProfileDataTy = nullptr;
+ SmallVector<Constant *, 16> ContiguousDataInits;
+ std::string CachedCUID; // CUID cached for consistent section naming
+
+ // Map from function name GlobalVariable to offset in contiguous arrays
+ DenseMap<GlobalVariable *, uint64_t> FunctionCounterOffsets;
+ DenseMap<GlobalVariable *, uint64_t> FunctionDataOffsets;
+ uint64_t TotalCounterSlots = 0; // Total slots across all functions
+ uint64_t TotalDataEntries = 0; // Total __llvm_profile_data entries
+
// vector of counter load/store pairs to be register promoted.
std::vector<LoadStorePair> PromotionCandidates;
int64_t TotalCountersPromoted = 0;
+ // Per-function cache of invariant values for AMDGPU PGO instrumentation.
+ // Computed once at the function entry and reused across all instrumentation
+ // points to avoid redundant IR and help the optimizer.
+ struct AMDGPUPGOInvariants {
+ Value *Matched = nullptr;
+ };
+ DenseMap<Function *, AMDGPUPGOInvariants> AMDGPUInvariantsCache;
+
+ /// Emit invariant PGO values at the function entry block and cache them.
+ AMDGPUPGOInvariants &getOrCreateAMDGPUInvariants(Function *F);
+
/// Lower instrumentation intrinsics in the function. Returns true if there
/// any lowering.
bool lowerIntrinsics(Function *F);
@@ -324,6 +380,9 @@ class InstrLowerer final {
/// Replace instrprof.increment with an increment of the appropriate value.
void lowerIncrement(InstrProfIncrementInst *Inc);
+ /// AMDGPU specific implementation of lowerIncrement.
+ void lowerIncrementAMDGPU(InstrProfIncrementInst *Inc);
+
/// Force emitting of name vars for unused functions.
void lowerCoverageData(GlobalVariable *CoverageNamesVar);
@@ -348,6 +407,10 @@ class InstrLowerer final {
/// referring to them will also be created.
GlobalVariable *getOrCreateRegionCounters(InstrProfCntrInstBase *Inc);
+ /// Get the uniform entry counters for AMDGPU divergence tracking.
+ /// These counters track how often blocks are entered with all lanes active.
+ GlobalVariable *getOrCreateUniformCounters(InstrProfCntrInstBase *Inc);
+
/// Create the region counters.
GlobalVariable *createRegionCounters(InstrProfCntrInstBase *Inc,
StringRef Name,
@@ -407,6 +470,30 @@ class InstrLowerer final {
/// Create a static initializer for our data, on platforms that need it,
/// and for any profile output file that was specified.
void emitInitialization();
+
+ /// For GPU targets: Collect all profiling intrinsics and allocate
+ /// contiguous arrays for counters, data, and uniform counters.
+ /// This avoids linker reordering issues with section boundaries.
+ void allocateContiguousProfileArrays();
+
+ /// Return the __llvm_profile_data struct type.
+ StructType *getProfileDataTy();
+
+ /// Finalize initializer for contiguous __llvm_profile_data array.
+ void finalizeContiguousProfileData();
+
+ /// Create __llvm_offload_prf structure for GPU targets.
+ /// Must be called AFTER contiguous arrays are allocated.
+ void createProfileSectionSymbols();
+
+ /// Create HIP device variable registration for profile symbols
+ void createHIPDeviceVariableRegistration();
+
+ /// Create HIP dynamic module registration call
+ void createHIPDynamicModuleRegistration();
+
+ /// Create HIP dynamic module unregistration call
+ void createHIPDynamicModuleUnregistration();
};
///
@@ -938,6 +1025,10 @@ bool InstrLowerer::lower() {
if (!ContainsProfiling && !CoverageNamesVar)
return MadeChange;
+ // For GPU targets: allocate contiguous arrays for all profile data.
+ // This avoids linker reordering issues with per-function arrays.
+ allocateContiguousProfileArrays();
+
// We did not know how many value sites there would be inside
// the instrumented function. This is counting the number of instrumented
// target value sites to enter it as field in the profile data variable.
@@ -982,10 +1073,22 @@ bool InstrLowerer::lower() {
if (!MadeChange)
return false;
+ finalizeContiguousProfileData();
+
emitVNodes();
emitNameData();
emitVTableNames();
+ // Create start/stop symbols for device code profile sections
+ createProfileSectionSymbols();
+
+ // Create host shadow variables and registration calls for HIP device profile
+ // symbols
+ createHIPDeviceVariableRegistration();
+
+ createHIPDynamicModuleRegistration();
+ createHIPDynamicModuleUnregistration();
+
// Emit runtime hook for the cases where the target does not unconditionally
// require pulling in profile runtime, and coverage is enabled on code that is
// not eliminated by the front-end, e.g. unused functions with internal
@@ -1045,7 +1148,7 @@ void InstrLowerer::lowerValueProfileInst(InstrProfValueProfileInst *Ind) {
assert(It != ProfileDataMap.end() && It->second.DataVar &&
"value profiling detected in function with no counter increment");
- GlobalVariable *DataVar = It->second.DataVar;
+ GlobalValue *DataVar = It->second.DataVar;
uint64_t ValueKind = Ind->getValueKind()->getZExtValue();
uint64_t Index = Ind->getIndex()->getZExtValue();
for (uint32_t Kind = IPVK_First; Kind < ValueKind; ++Kind)
@@ -1057,7 +1160,7 @@ void InstrLowerer::lowerValueProfileInst(InstrProfValueProfileInst *Ind) {
CallInst *Call = nullptr;
auto *TLI = &GetTLI(*Ind->getFunction());
auto *NormalizedDataVarPtr = ConstantExpr::getPointerBitCastOrAddrSpaceCast(
- DataVar, PointerType::get(M.getContext(), 0));
+ cast<Constant>(DataVar), PointerType::get(M.getContext(), 0));
// To support value profiling calls within Windows exception handlers, funclet
// information contained within operand bundles needs to be copied over to
@@ -1107,6 +1210,9 @@ GlobalVariable *InstrLowerer::getOrCreateBiasVar(StringRef VarName) {
}
Value *InstrLowerer::getCounterAddress(InstrProfCntrInstBase *I) {
+ // Note: For AMDGPU targets, lowerIncrementAMDGPU handles counter addressing
+ // directly using ContiguousCnts. This function is called for non-AMDGPU
+ // targets.
auto *Counters = getOrCreateRegionCounters(I);
IRBuilder<> Builder(I);
@@ -1189,6 +1295,10 @@ void InstrLowerer::lowerTimestamp(
}
void InstrLowerer::lowerIncrement(InstrProfIncrementInst *Inc) {
+ if (TT.isAMDGPU()) {
+ lowerIncrementAMDGPU(Inc);
+ return;
+ }
auto *Addr = getCounterAddress(Inc);
IRBuilder<> Builder(Inc);
@@ -1207,6 +1317,158 @@ void InstrLowerer::lowerIncrement(InstrProfIncrementInst *Inc) {
Inc->eraseFromParent();
}
+// Determine the wavefront size for an AMDGPU function.
+// Checks target-features attribute first (+wavefrontsize32/+wavefrontsize64),
+// then falls back to the default wavefront size for the target-cpu.
+// Returns 32 or 64. Defaults to 32 if undetermined.
+static unsigned getAMDGPUWavefrontSize(const Function &F) {
+ // Check target-features attribute for explicit wavefront size
+ StringRef Features = F.getFnAttribute("target-features").getValueAsString();
+ if (Features.contains("+wavefrontsize64"))
+ return 64;
+ if (Features.contains("+wavefrontsize32"))
+ return 32;
+
+ // Fall back to default wavefront size based on target-cpu
+ StringRef CPU = F.getFnAttribute("target-cpu").getValueAsString();
+ if (!CPU.empty()) {
+ AMDGPU::GPUKind Kind = AMDGPU::parseArchAMDGCN(CPU);
+ unsigned Features = AMDGPU::getArchAttrAMDGCN(Kind);
+ if (Features & AMDGPU::FEATURE_WAVE32)
+ return 32;
+ return 64; // gfx9 and older default to Wave64
+ }
+
+ return 32; // conservative default
+}
+
+InstrLowerer::AMDGPUPGOInvariants &
+InstrLowerer::getOrCreateAMDGPUInvariants(Function *F) {
+ auto It = AMDGPUInvariantsCache.find(F);
+ if (It != AMDGPUInvariantsCache.end())
+ return It->second;
+
+ LLVMContext &Context = M.getContext();
+
+ BasicBlock &EntryBB = F->getEntryBlock();
+ IRBuilder<> Builder(&*EntryBB.getFirstInsertionPt());
+
+ Value *Matched = ConstantInt::getTrue(Context);
+ if (OffloadPGOSampling > 0) {
+ auto *Int32Ty = Type::getInt32Ty(Context);
+ FunctionCallee IsSampledFn = M.getOrInsertFunction(
+ "__gpu_pgo_is_sampled", Int32Ty, Int32Ty);
+ Value *SampledInt = Builder.CreateCall(
+ IsSampledFn,
+ {ConstantInt::get(Int32Ty, OffloadPGOSampling)},
+ "pgo.sampled");
+ Matched = Builder.CreateICmpNE(
+ SampledInt, ConstantInt::get(Int32Ty, 0), "pgo.matched");
+ }
+
+ auto &Inv = AMDGPUInvariantsCache[F];
+ Inv.Matched = Matched;
+ return Inv;
+}
+
+void InstrLowerer::lowerIncrementAMDGPU(InstrProfIncrementInst *Inc) {
+ Function *F = Inc->getFunction();
+ auto &Inv = getOrCreateAMDGPUInvariants(F);
+
+ IRBuilder<> Builder(Inc);
+ LLVMContext &Context = M.getContext();
+ auto *Int32Ty = Type::getInt32Ty(Context);
+ auto *Int64Ty = Type::getInt64Ty(Context);
+
+ Value *Matched = Inv.Matched;
+
+ auto *CounterIdx = Inc->getIndex();
+
+ // --- Counter address ---
+ GlobalVariable *Counters = nullptr;
+ GlobalVariable *UniformCounters = nullptr;
+ Value *Addr = nullptr;
+ Value *UniformAddr = nullptr;
+
+ if (ContiguousCnts) {
+ GlobalVariable *NamePtr = Inc->getName();
+ uint64_t FuncOffset = FunctionCounterOffsets.lookup(NamePtr);
+
+ Value *OffsetCounterIdx = Builder.CreateAdd(
+ CounterIdx, Builder.getInt32(FuncOffset), "OffsetCounterIdx");
+
+ Counters = ContiguousCnts;
+ Value *Indices[] = {Builder.getInt32(0), OffsetCounterIdx};
+ Addr = Builder.CreateInBoundsGEP(Counters->getValueType(), Counters,
+ Indices, "ctr.addr");
+
+ if (ContiguousUCnts) {
+ UniformCounters = ContiguousUCnts;
+ Value *UniformIndices[] = {Builder.getInt32(0), OffsetCounterIdx};
+ UniformAddr = Builder.CreateInBoundsGEP(UniformCounters->getValueType(),
+ UniformCounters, UniformIndices,
+ "unifctr.addr");
+ }
+ } else {
+ Counters = getOrCreateRegionCounters(Inc);
+ Value *Indices[] = {Builder.getInt32(0), CounterIdx};
+ Addr = Builder.CreateInBoundsGEP(Counters->getValueType(), Counters,
+ Indices, "ctr.addr");
+
+ UniformCounters = getOrCreateUniformCounters(Inc);
+ if (UniformCounters) {
+ Value *UniformIndices[] = {Builder.getInt32(0), CounterIdx};
+ UniformAddr = Builder.CreateInBoundsGEP(UniformCounters->getValueType(),
+ UniformCounters, UniformIndices,
+ "unifctr.addr");
+ }
+ }
+
+ // Use addrspace(1) pointers directly for the library call to generate
+ // global_load/global_store instead of slower flat_load/flat_store.
+ auto *GlobalPtrTy = PointerType::get(Context, 1);
+ Value *UniformAddrArg =
+ UniformAddr ? UniformAddr
+ : ConstantPointerNull::get(cast<PointerType>(GlobalPtrTy));
+
+ Value *IncStep = Inc->getStep();
+ Value *StepI64 =
+ Builder.CreateZExtOrTrunc(IncStep, Int64Ty, "step.i64");
+
+ // --- Increment via library call ---
+ if (OffloadPGOSampling > 0) {
+ // Sampled mode: guard the call behind the sampling decision.
+ // Non-sampled blocks skip entirely.
+ BasicBlock *CurBB = Builder.GetInsertBlock();
+ BasicBlock *ContBB =
+ CurBB->splitBasicBlock(BasicBlock::iterator(Inc), "po_cont");
+ BasicBlock *ThenBB =
+ BasicBlock::Create(Context, "po_then", F);
+
+ CurBB->getTerminator()->eraseFromParent();
+ IRBuilder<> HeadBuilder(CurBB);
+ HeadBuilder.CreateCondBr(Matched, ThenBB, ContBB);
+
+ IRBuilder<> ThenBuilder(ThenBB);
+ FunctionCallee IncrFnPO = M.getOrInsertFunction(
+ "__gpu_pgo_increment",
+ Type::getVoidTy(Context), GlobalPtrTy, GlobalPtrTy, Int64Ty);
+ ThenBuilder.CreateCall(IncrFnPO,
+ {Addr, UniformAddrArg, StepI64});
+ ThenBuilder.CreateBr(ContBB);
+
+ Builder.SetInsertPoint(ContBB, ContBB->begin());
+ } else {
+ // No sampling: always call the increment function.
+ FunctionCallee IncrFn = M.getOrInsertFunction(
+ "__gpu_pgo_increment",
+ Type::getVoidTy(Context), GlobalPtrTy, GlobalPtrTy, Int64Ty);
+ Builder.CreateCall(IncrFn, {Addr, UniformAddrArg, StepI64});
+ }
+
+ Inc->eraseFromParent();
+}
+
void InstrLowerer::lowerCoverageData(GlobalVariable *CoverageNamesVar) {
ConstantArray *Names =
cast<ConstantArray>(CoverageNamesVar->getInitializer());
@@ -1389,6 +1651,12 @@ static inline Constant *getFuncAddrForProfData(Function *Fn) {
if (shouldUsePublicSymbol(Fn))
return Fn;
+ // For GPU targets, weak functions cannot use private aliases because
+ // LTO may pick a different TU's copy, leaving the alias undefined
+ if (isGPUProfTarget(*Fn->getParent()) &&
+ GlobalValue::isWeakForLinker(Fn->getLinkage()))
+ return Fn;
+
// When possible use a private alias to avoid symbolic relocations.
auto *GA = GlobalAlias::create(GlobalValue::LinkageTypes::PrivateLinkage,
Fn->getName() + ".local", Fn);
@@ -1610,7 +1878,15 @@ GlobalVariable *InstrLowerer::setupProfileSection(InstrProfInstBase *Inc,
Ptr->setVisibility(Visibility);
// Put the counters and bitmaps in their own sections so linkers can
// remove unneeded sections.
- Ptr->setSection(getInstrProfSectionName(IPSK, TT.getObjectFormat()));
+ // For GPU targets, use per-TU sections with CUID suffix for proper
+ // memory tracking via anchor variable registration.
+ std::string SectionName = getInstrProfSectionName(IPSK, TT.getObjectFormat());
+ if (isGPUProfTarget(M)) {
+ std::string CUID = getCUIDFromModule(M);
+ if (!CUID.empty())
+ SectionName = SectionName + "_" + CUID;
+ }
+ Ptr->setSection(SectionName);
Ptr->setLinkage(Linkage);
maybeSetComdat(Ptr, Fn, VarName);
return Ptr;
@@ -1675,6 +1951,18 @@ InstrLowerer::getOrCreateRegionCounters(InstrProfCntrInstBase *Inc) {
if (PD.RegionCounters)
return PD.RegionCounters;
+ // For GPU targets with contiguous allocation, use the contiguous array
+ // instead of creating a per-function array
+ if (ContiguousCnts) {
+ // Store the contiguous array as RegionCounters for this function
+ // The actual offset is handled in lowerIncrementAMDGPU
+ PD.RegionCounters = ContiguousCnts;
+
+ // Still create the data variable (it will point to the right offset)
+ createDataVariable(Inc);
+ return PD.RegionCounters;
+ }
+
// If RegionCounters doesn't already exist, create it by first setting up
// the corresponding profile section.
auto *CounterPtr = setupProfileSection(Inc, IPSK_cnts);
@@ -1722,6 +2010,55 @@ InstrLowerer::getOrCreateRegionCounters(InstrProfCntrInstBase *Inc) {
return PD.RegionCounters;
}
+GlobalVariable *
+InstrLowerer::getOrCreateUniformCounters(InstrProfCntrInstBase *Inc) {
+ // Only create uniform counters for AMDGPU targets
+ if (!TT.isAMDGPU())
+ return nullptr;
+
+ GlobalVariable *NamePtr = Inc->getName();
+ auto &PD = ProfileDataMap[NamePtr];
+ if (PD.UniformCounters)
+ return PD.UniformCounters;
+
+ // For contiguous allocation, use the contiguous uniform counter array
+ if (ContiguousUCnts) {
+ PD.UniformCounters = ContiguousUCnts;
+ return PD.UniformCounters;
+ }
+
+ // Ensure RegionCounters exists first (we need the same size)
+ getOrCreateRegionCounters(Inc);
+
+ // Create uniform counters with the same size as region counters
+ uint64_t NumCounters = Inc->getNumCounters()->getZExtValue();
+
+ auto &Ctx = M.getContext();
+ auto *CounterTy = ArrayType::get(Type::getInt64Ty(Ctx), NumCounters);
+
+ // Use a different prefix for uniform counters
+ bool Renamed;
+ std::string VarName = getVarName(Inc, "__llvm_prf_unifcnt_", Renamed);
+
+ auto *GV = new GlobalVariable(M, CounterTy, false, NamePtr->getLinkage(),
+ Constant::getNullValue(CounterTy), VarName);
+ GV->setAlignment(Align(8));
+ GV->setVisibility(NamePtr->getVisibility());
+
+ // For GPU targets, use per-TU sections with CUID suffix
+ std::string SectionName =
+ getInstrProfSectionName(IPSK_ucnts, TT.getObjectFormat());
+ std::string CUID = getCUIDFromModule(M);
+ if (!CUID.empty())
+ SectionName = SectionName + "_" + CUID;
+ GV->setSection(SectionName);
+
+ PD.UniformCounters = GV;
+ CompilerUsedVars.push_back(GV);
+
+ return PD.UniformCounters;
+}
+
void InstrLowerer::createDataVariable(InstrProfCntrInstBase *Inc) {
// When debug information is correlated to profile data, a data variable
// is not needed.
@@ -1783,8 +2120,25 @@ void InstrLowerer::createDataVariable(InstrProfCntrInstBase *Inc) {
ValuesVar, PointerType::get(Fn->getContext(), 0));
}
+ // NumCounters in __llvm_profile_data is the ORIGINAL counter count,
+ // not the expanded count with slots. The expansion factor is stored
+ // separately in NumOffloadProfilingThreads.
uint64_t NumCounters = Inc->getNumCounters()->getZExtValue();
- auto *CounterPtr = PD.RegionCounters;
+
+ // For contiguous allocation, CounterPtr should point to this function's
+ // offset within the contiguous array
+ Constant *CounterPtr;
+ if (ContiguousCnts && PD.RegionCounters == ContiguousCnts) {
+ uint64_t FuncOffset = FunctionCounterOffsets.lookup(NamePtr);
+ // Create a GEP to the function's counter offset
+ CounterPtr = ConstantExpr::getInBoundsGetElementPtr(
+ ContiguousCnts->getValueType(), ContiguousCnts,
+ ArrayRef<Constant *>{
+ ConstantInt::get(Type::getInt64Ty(Ctx), 0),
+ ConstantInt::get(Type::getInt64Ty(Ctx), FuncOffset)});
+ } else {
+ CounterPtr = PD.RegionCounters;
+ }
uint64_t NumBitmapBytes = PD.NumBitmapBytes;
@@ -1792,11 +2146,7 @@ void InstrLowerer::createDataVariable(InstrProfCntrInstBase *Inc) {
auto *IntPtrTy = M.getDataLayout().getIntPtrType(M.getContext());
auto *Int16Ty = Type::getInt16Ty(Ctx);
auto *Int16ArrayTy = ArrayType::get(Int16Ty, IPVK_Last + 1);
- Type *DataTypes[] = {
-#define INSTR_PROF_DATA(Type, LLVMType, Name, Init) LLVMType,
-#include "llvm/ProfileData/InstrProfData.inc"
- };
- auto *DataTy = StructType::get(Ctx, ArrayRef(DataTypes));
+ auto *DataTy = getProfileDataTy();
Constant *FunctionAddr = getFuncAddrForProfData(Fn);
@@ -1804,8 +2154,18 @@ void InstrLowerer::createDataVariable(InstrProfCntrInstBase *Inc) {
for (uint32_t Kind = IPVK_First; Kind <= IPVK_Last; ++Kind)
Int16ArrayVals[Kind] = ConstantInt::get(Int16Ty, PD.NumValueSites[Kind]);
+ uint16_t NumOffloadProfilingThreadsVal = 0;
+ uint16_t OffloadDeviceWaveSizeVal = 0;
+ if (TT.isAMDGPU())
+ OffloadDeviceWaveSizeVal = getAMDGPUWavefrontSize(*Fn);
+
if (isGPUProfTarget(M)) {
- Linkage = GlobalValue::ExternalLinkage;
+ // For GPU targets, weak functions need weak linkage for their profile data
+ // aliases to allow linker deduplication across TUs
+ if (GlobalValue::isWeakForLinker(Fn->getLinkage()))
+ Linkage = Fn->getLinkage();
+ else
+ Linkage = GlobalValue::ExternalLinkage;
Visibility = GlobalValue::ProtectedVisibility;
}
// If the data variable is not referenced by code (if we don't emit
@@ -1825,8 +2185,25 @@ void InstrLowerer::createDataVariable(InstrProfCntrInstBase *Inc) {
Linkage = GlobalValue::PrivateLinkage;
Visibility = GlobalValue::DefaultVisibility;
}
- auto *Data =
- new GlobalVariable(M, DataTy, false, Linkage, nullptr, DataVarName);
+ GlobalValue *DataVar = nullptr;
+ Constant *DataAddr = nullptr;
+ uint64_t DataIndex = 0;
+ if (ContiguousData) {
+ DataIndex = FunctionDataOffsets.lookup(NamePtr);
+ assert(DataIndex < ContiguousDataInits.size() &&
+ "missing contiguous data slot");
+ DataAddr = ConstantExpr::getInBoundsGetElementPtr(
+ ContiguousData->getValueType(), ContiguousData,
+ ArrayRef<Constant *>{
+ ConstantInt::get(Type::getInt64Ty(Ctx), 0),
+ ConstantInt::get(Type::getInt64Ty(Ctx), DataIndex)});
+ } else {
+ auto *Data =
+ new GlobalVariable(M, DataTy, false, Linkage, nullptr, DataVarName);
+ DataVar = Data;
+ DataAddr = Data;
+ }
+
Constant *RelativeCounterPtr;
GlobalVariable *BitmapPtr = PD.RegionBitmaps;
Constant *RelativeBitmapPtr = ConstantInt::get(IntPtrTy, 0);
@@ -1844,29 +2221,48 @@ void InstrLowerer::createDataVariable(InstrProfCntrInstBase *Inc) {
DataSectionKind = IPSK_data;
RelativeCounterPtr =
ConstantExpr::getSub(ConstantExpr::getPtrToInt(CounterPtr, IntPtrTy),
- ConstantExpr::getPtrToInt(Data, IntPtrTy));
+ ConstantExpr::getPtrToInt(DataAddr, IntPtrTy));
if (BitmapPtr != nullptr)
RelativeBitmapPtr =
ConstantExpr::getSub(ConstantExpr::getPtrToInt(BitmapPtr, IntPtrTy),
- ConstantExpr::getPtrToInt(Data, IntPtrTy));
+ ConstantExpr::getPtrToInt(DataAddr, IntPtrTy));
}
Constant *DataVals[] = {
#define INSTR_PROF_DATA(Type, LLVMType, Name, Init) Init,
#include "llvm/ProfileData/InstrProfData.inc"
};
- Data->setInitializer(ConstantStruct::get(DataTy, DataVals));
-
- Data->setVisibility(Visibility);
- Data->setSection(
- getInstrProfSectionName(DataSectionKind, TT.getObjectFormat()));
- Data->setAlignment(Align(INSTR_PROF_DATA_ALIGNMENT));
- maybeSetComdat(Data, Fn, CntsVarName);
+ auto *DataInit = ConstantStruct::get(DataTy, DataVals);
+
+ if (ContiguousData) {
+ ContiguousDataInits[DataIndex] = DataInit;
+ auto *Alias = GlobalAlias::create(
+ DataTy, DataAddr->getType()->getPointerAddressSpace(), Linkage,
+ DataVarName, DataAddr, &M);
+ Alias->setVisibility(Visibility);
+ DataVar = Alias;
+ } else {
+ auto *DataGV = cast<GlobalVariable>(DataVar);
+ DataGV->setInitializer(DataInit);
+
+ DataGV->setVisibility(Visibility);
+ // For GPU targets, use per-TU sections with CUID suffix
+ std::string DataSectionName =
+ getInstrProfSectionName(DataSectionKind, TT.getObjectFormat());
+ if (isGPUProfTarget(M)) {
+ std::string CUID = getCUIDFromModule(M);
+ if (!CUID.empty())
+ DataSectionName = DataSectionName + "_" + CUID;
+ }
+ DataGV->setSection(DataSectionName);
+ DataGV->setAlignment(Align(INSTR_PROF_DATA_ALIGNMENT));
+ maybeSetComdat(DataGV, Fn, CntsVarName);
+ }
- PD.DataVar = Data;
+ PD.DataVar = DataVar;
// Mark the data variable as used so that it isn't stripped out.
- CompilerUsedVars.push_back(Data);
+ CompilerUsedVars.push_back(DataVar);
// Now that the linkage set by the FE has been passed to the data and counter
// variables, reset Name variable's linkage and visibility to private so that
// it can be removed later by the compiler.
@@ -1926,6 +2322,103 @@ void InstrLowerer::emitVNodes() {
UsedVars.push_back(VNodesVar);
}
+void InstrLowerer::createHIPDynamicModuleRegistration() {
+ if (isGPUProfTarget(M))
+ return;
+ StringRef FuncNames[] = {"hipModuleLoad", "hipModuleLoadData",
+ "hipModuleLoadDataEx"};
+ for (StringRef FuncName : FuncNames) {
+ Function *F = M.getFunction(FuncName);
+ if (!F)
+ continue;
+
+ for (User *U : F->users()) {
+ if (auto *CB = dyn_cast<CallBase>(U)) {
+ Instruction *InsertPt = nullptr;
+ // If the call is an invoke instruction, we should insert the
+ // registration call in the normal destination block.
+ if (auto *Invoke = dyn_cast<InvokeInst>(CB)) {
+ InsertPt = &*Invoke->getNormalDest()->getFirstInsertionPt();
+ } else if (CB->isTerminator()) {
+ // If it's another kind of terminator (e.g., callbr), we don't
+ // know the semantics of the successors, so we conservatively
+ // skip it. The hipModuleLoad* functions are not expected to be
+ // used in other terminator instructions.
+ continue;
+ } else {
+ // This is a normal call instruction, so we can insert after it.
+ InsertPt = CB->getNextNode();
+ }
+
+ // If there's no valid insertion point (e.g., a malformed block),
+ // skip.
+ if (!InsertPt)
+ continue;
+
+ IRBuilder<> Builder(InsertPt);
+ auto *VoidTy = Type::getVoidTy(M.getContext());
+ auto *VoidPtrTy = PointerType::getUnqual(M.getContext());
+ auto *Int32Ty = Type::getInt32Ty(M.getContext());
+ // register(int rc, void **modulePtr, const void *image)
+ auto *RegisterDynamicModuleTy =
+ FunctionType::get(VoidTy, {Int32Ty, VoidPtrTy, VoidPtrTy}, false);
+ FunctionCallee RegisterFunc = M.getOrInsertFunction(
+ "__llvm_profile_offload_register_dynamic_module",
+ RegisterDynamicModuleTy);
+
+ // Arg 0: return value of the hipModuleLoad* call (hipError_t / i32).
+ Value *ReturnValue = CB;
+ // Arg 1: module handle (out-parameter, hipModule_t*).
+ Value *ModuleHandle = CB->getArgOperand(0);
+ // Arg 2: code object image pointer.
+ // For hipModuleLoadData(module, image) and
+ // hipModuleLoadDataEx(module, image, ...), image is arg 1.
+ // For hipModuleLoad(module, fname), arg 1 is a filename — pass NULL.
+ Value *ImagePtr;
+ if (FuncName == "hipModuleLoad")
+ ImagePtr =
+ ConstantPointerNull::get(PointerType::getUnqual(M.getContext()));
+ else
+ ImagePtr = CB->getArgOperand(1);
+
+ Builder.CreateCall(RegisterFunc,
+ {ReturnValue, ModuleHandle, ImagePtr});
+ }
+ }
+ }
+}
+
+void InstrLowerer::createHIPDynamicModuleUnregistration() {
+ Function *F = M.getFunction("hipModuleUnload");
+ if (!F)
+ return;
+
+ for (User *U : F->users()) {
+ if (auto *CB = dyn_cast_or_null<CallBase>(U)) {
+ // The insertion point is right before the call to hipModuleUnload.
+ Instruction *InsertPt = CB;
+
+ IRBuilder<> Builder(InsertPt);
+ auto *VoidTy = Type::getVoidTy(M.getContext());
+ auto *VoidPtrTy = PointerType::getUnqual(M.getContext());
+
+ auto *UnregisterDynamicModuleTy =
+ FunctionType::get(VoidTy, {VoidPtrTy}, false);
+ FunctionCallee UnregisterFunc = M.getOrInsertFunction(
+ "__llvm_profile_offload_unregister_dynamic_module",
+ UnregisterDynamicModuleTy);
+
+ // The argument is the module handle, which is the first
+ // argument to the hipModuleUnload call.
+ Value *ModuleHandle = CB->getArgOperand(0);
+ Value *CastedModuleHandle =
+ Builder.CreatePointerCast(ModuleHandle, VoidPtrTy);
+
+ Builder.CreateCall(UnregisterFunc, {CastedModuleHandle});
+ }
+ }
+}
+
void InstrLowerer::emitNameData() {
if (ReferencedNames.empty())
return;
@@ -1939,9 +2432,15 @@ void InstrLowerer::emitNameData() {
auto &Ctx = M.getContext();
auto *NamesVal =
ConstantDataArray::getString(Ctx, StringRef(CompressedNameStr), false);
- NamesVar = new GlobalVariable(M, NamesVal->getType(), true,
- GlobalValue::PrivateLinkage, NamesVal,
- getInstrProfNamesVarName());
+ std::string NamesVarName = std::string(getInstrProfNamesVarName());
+ if (isGPUProfTarget(M)) {
+ std::string CUID = CachedCUID.empty() ? getCUIDFromModule(M) : CachedCUID;
+ if (!CUID.empty())
+ NamesVarName = NamesVarName + "_" + CUID;
+ }
+ NamesVar =
+ new GlobalVariable(M, NamesVal->getType(), true,
+ GlobalValue::PrivateLinkage, NamesVal, NamesVarName);
if (isGPUProfTarget(M)) {
NamesVar->setLinkage(GlobalValue::ExternalLinkage);
NamesVar->setVisibility(GlobalValue::ProtectedVisibility);
@@ -1949,10 +2448,17 @@ void InstrLowerer::emitNameData() {
NamesSize = CompressedNameStr.size();
setGlobalVariableLargeSection(TT, *NamesVar);
- NamesVar->setSection(
+ // For GPU targets, use per-TU sections with CUID suffix
+ std::string NamesSectionName =
ProfileCorrelate == InstrProfCorrelator::BINARY
? getInstrProfSectionName(IPSK_covname, TT.getObjectFormat())
- : getInstrProfSectionName(IPSK_name, TT.getObjectFormat()));
+ : getInstrProfSectionName(IPSK_name, TT.getObjectFormat());
+ if (isGPUProfTarget(M)) {
+ std::string CUID = getCUIDFromModule(M);
+ if (!CUID.empty())
+ NamesSectionName = NamesSectionName + "_" + CUID;
+ }
+ NamesVar->setSection(NamesSectionName);
// On COFF, it's important to reduce the alignment down to 1 to prevent the
// linker from inserting padding before the start of the names section or
// between names entries.
@@ -2159,3 +2665,390 @@ void createProfileSamplingVar(Module &M) {
appendToCompilerUsed(M, SamplingVar);
}
} // namespace llvm
+
+namespace {
+
+// For GPU targets: Allocate contiguous arrays for all profile data.
+// This solves the linker reordering problem by using ONE symbol per section
+// type, so there's nothing for the linker to reorder.
+StructType *InstrLowerer::getProfileDataTy() {
+ if (ProfileDataTy)
+ return ProfileDataTy;
+
+ auto &Ctx = M.getContext();
+ auto *IntPtrTy = M.getDataLayout().getIntPtrType(M.getContext());
+ auto *Int16Ty = Type::getInt16Ty(Ctx);
+ auto *Int16ArrayTy = ArrayType::get(Int16Ty, IPVK_Last + 1);
+ Type *DataTypes[] = {
+#define INSTR_PROF_DATA(Type, LLVMType, Name, Init) LLVMType,
+#include "llvm/ProfileData/InstrProfData.inc"
+ };
+ ProfileDataTy = StructType::get(Ctx, ArrayRef(DataTypes));
+ return ProfileDataTy;
+}
+
+void InstrLowerer::finalizeContiguousProfileData() {
+ if (!ContiguousData || ContiguousDataInits.empty())
+ return;
+
+ auto *DataTy = getProfileDataTy();
+ for (auto &Entry : ContiguousDataInits)
+ if (!Entry)
+ Entry = Constant::getNullValue(DataTy);
+
+ auto *DataArrayTy = cast<ArrayType>(ContiguousData->getValueType());
+ ContiguousData->setInitializer(
+ ConstantArray::get(DataArrayTy, ContiguousDataInits));
+}
+
+void InstrLowerer::allocateContiguousProfileArrays() {
+ if (!isGPUProfTarget(M))
+ return;
+
+ CachedCUID = getCUIDFromModule(M);
+ if (CachedCUID.empty())
+ return;
+
+ LLVM_DEBUG(llvm::dbgs() << "Allocating contiguous arrays for CUID="
+ << CachedCUID << "\n");
+
+ TotalCounterSlots = 0;
+ TotalDataEntries = 0;
+
+ for (Function &F : M) {
+ for (BasicBlock &BB : F) {
+ for (Instruction &I : BB) {
+ if (auto *Inc = dyn_cast<InstrProfIncrementInst>(&I)) {
+ GlobalVariable *NamePtr = Inc->getName();
+ if (FunctionCounterOffsets.count(NamePtr) == 0) {
+ uint64_t NumCounters = Inc->getNumCounters()->getZExtValue();
+
+ FunctionCounterOffsets[NamePtr] = TotalCounterSlots;
+ FunctionDataOffsets[NamePtr] = TotalDataEntries;
+ TotalCounterSlots += NumCounters;
+ TotalDataEntries++;
+ }
+ break;
+ }
+ if (auto *Cover = dyn_cast<InstrProfCoverInst>(&I)) {
+ GlobalVariable *NamePtr = Cover->getName();
+ if (FunctionCounterOffsets.count(NamePtr) == 0) {
+ uint64_t NumCounters = Cover->getNumCounters()->getZExtValue();
+ FunctionCounterOffsets[NamePtr] = TotalCounterSlots;
+ FunctionDataOffsets[NamePtr] = TotalDataEntries;
+ TotalCounterSlots += NumCounters;
+ TotalDataEntries++;
+ }
+ break;
+ }
+ }
+ }
+ }
+
+ if (TotalCounterSlots == 0)
+ return;
+
+ auto &Ctx = M.getContext();
+ auto *Int64Ty = Type::getInt64Ty(Ctx);
+
+ // Create contiguous counter array
+ auto *CntsArrayTy = ArrayType::get(Int64Ty, TotalCounterSlots);
+ std::string CntsSectionName = "__llvm_prf_cnts_" + CachedCUID;
+ ContiguousCnts = new GlobalVariable(
+ M, CntsArrayTy, /*isConstant=*/false, GlobalValue::ExternalLinkage,
+ Constant::getNullValue(CntsArrayTy), "__llvm_prf_c_" + CachedCUID);
+ ContiguousCnts->setSection(CntsSectionName);
+ ContiguousCnts->setAlignment(Align(8));
+ ContiguousCnts->setVisibility(GlobalValue::ProtectedVisibility);
+ CompilerUsedVars.push_back(ContiguousCnts);
+
+ // Create contiguous uniform counter array (for AMDGPU divergence tracking)
+ std::string UCntsSectionName = "__llvm_prf_ucnts_" + CachedCUID;
+ ContiguousUCnts = new GlobalVariable(
+ M, CntsArrayTy, /*isConstant=*/false, GlobalValue::ExternalLinkage,
+ Constant::getNullValue(CntsArrayTy), "__profu_all_" + CachedCUID);
+ ContiguousUCnts->setSection(UCntsSectionName);
+ ContiguousUCnts->setAlignment(Align(8));
+ ContiguousUCnts->setVisibility(GlobalValue::ProtectedVisibility);
+ CompilerUsedVars.push_back(ContiguousUCnts);
+
+ if (TotalDataEntries > 0) {
+ auto *DataTy = getProfileDataTy();
+ auto *DataArrayTy = ArrayType::get(DataTy, TotalDataEntries);
+ std::string DataSectionName = getInstrProfSectionName(
+ ProfileCorrelate == InstrProfCorrelator::BINARY ? IPSK_covdata
+ : IPSK_data,
+ TT.getObjectFormat());
+ DataSectionName = DataSectionName + "_" + CachedCUID;
+
+ ContiguousData = new GlobalVariable(M, DataArrayTy, /*isConstant=*/false,
+ GlobalValue::ExternalLinkage, nullptr,
+ "__llvm_prf_d_" + CachedCUID);
+ ContiguousData->setSection(DataSectionName);
+ ContiguousData->setAlignment(Align(INSTR_PROF_DATA_ALIGNMENT));
+ ContiguousData->setVisibility(GlobalValue::ProtectedVisibility);
+ CompilerUsedVars.push_back(ContiguousData);
+
+ ContiguousDataInits.assign(TotalDataEntries,
+ Constant::getNullValue(DataTy));
+ }
+}
+
+// Create __llvm_offload_prf structure for GPU targets.
+// Uses the contiguous arrays allocated by allocateContiguousProfileArrays().
+void InstrLowerer::createProfileSectionSymbols() {
+ LLVM_DEBUG(llvm::dbgs() << "createProfileSectionSymbols() called\n");
+
+ if (!isGPUProfTarget(M) || !ContiguousCnts)
+ return;
+
+ auto &Ctx = M.getContext();
+ auto *Int8Ty = Type::getInt8Ty(Ctx);
+ auto *Int64Ty = Type::getInt64Ty(Ctx);
+
+ // Get address space from the contiguous counters
+ unsigned AS = ContiguousCnts->getType()->getPointerAddressSpace();
+ auto *Int8PtrTy = PointerType::get(Ctx, AS);
+
+ // Calculate sizes
+ uint64_t CntsSize =
+ M.getDataLayout().getTypeAllocSize(ContiguousCnts->getValueType());
+ uint64_t UCntsSize =
+ M.getDataLayout().getTypeAllocSize(ContiguousUCnts->getValueType());
+
+ // Data section boundaries.
+ GlobalValue *DataStart = nullptr;
+ GlobalValue *DataEndBase = nullptr;
+ uint64_t DataSize = 0;
+ if (ContiguousData) {
+ DataStart = ContiguousData;
+ DataEndBase = ContiguousData;
+ DataSize =
+ M.getDataLayout().getTypeAllocSize(ContiguousData->getValueType());
+ } else {
+ // Legacy per-function data variables: best-effort by scanning.
+ GlobalVariable *FirstData = nullptr;
+ GlobalVariable *LastData = nullptr;
+ for (auto &PD : ProfileDataMap) {
+ if (auto *GV = dyn_cast_or_null<GlobalVariable>(PD.second.DataVar)) {
+ if (!FirstData)
+ FirstData = GV;
+ LastData = GV;
+ }
+ }
+ DataStart = FirstData;
+ DataEndBase = LastData;
+ if (LastData)
+ DataSize = M.getDataLayout().getTypeAllocSize(LastData->getValueType());
+ }
+
+ LLVM_DEBUG({
+ llvm::dbgs() << "Section sizes: Cnts=" << CntsSize << " UCnts=" << UCntsSize
+ << " Data=" << DataSize << " Names=" << NamesSize << "\n";
+ });
+
+ // Helper to get start pointer
+ auto getStartPtr = [&](GlobalValue *GV) -> Constant * {
+ if (!GV)
+ return Constant::getNullValue(Int8PtrTy);
+ return ConstantExpr::getPointerBitCastOrAddrSpaceCast(GV, Int8PtrTy);
+ };
+
+ // Helper to get end pointer (base + size)
+ auto getEndPtr = [&](GlobalValue *GV, uint64_t Size) -> Constant * {
+ if (!GV)
+ return Constant::getNullValue(Int8PtrTy);
+ auto *BasePtr =
+ ConstantExpr::getPointerBitCastOrAddrSpaceCast(GV, Int8PtrTy);
+ return ConstantExpr::getGetElementPtr(Int8Ty, BasePtr,
+ ConstantInt::get(Int64Ty, Size));
+ };
+
+ // Build the __llvm_offload_prf structure
+ // Order: cnts_start, data_start, names_start, ucnts_start, cnts_end,
+ // data_end, names_end, ucnts_end
+ std::vector<Type *> StructFields(8, Int8PtrTy);
+ std::vector<Constant *> StructValues = {
+ getStartPtr(ContiguousCnts), // cnts_start
+ getStartPtr(DataStart), // data_start
+ getStartPtr(NamesVar), // names_start
+ getStartPtr(ContiguousUCnts), // ucnts_start
+ getEndPtr(ContiguousCnts, CntsSize), // cnts_end
+ getEndPtr(DataEndBase, DataSize), // data_end
+ getEndPtr(NamesVar, NamesSize), // names_end
+ getEndPtr(ContiguousUCnts, UCntsSize) // ucnts_end
+ };
+
+ auto *UnifiedStructTy = StructType::get(Ctx, StructFields);
+ auto *UnifiedStructInit = ConstantStruct::get(UnifiedStructTy, StructValues);
+
+ // Use CUID-suffixed name to avoid symbol collision in multi-TU programs.
+ // For static modules, the host side registers each TU's shadow variable.
+ // For dynamic modules (hipModuleLoad), the runtime enumerates symbols
+ // matching __llvm_offload_prf_* by parsing the code object ELF.
+ std::string OffloadPrfName = "__llvm_offload_prf_" + CachedCUID;
+ auto *UnifiedStruct = new GlobalVariable(
+ M, UnifiedStructTy, /*isConstant=*/true, GlobalValue::ExternalLinkage,
+ UnifiedStructInit, OffloadPrfName);
+ UnifiedStruct->setVisibility(GlobalValue::DefaultVisibility);
+ CompilerUsedVars.push_back(UnifiedStruct);
+}
+
+void InstrLowerer::createHIPDeviceVariableRegistration() {
+ if (isGPUProfTarget(M))
+ return;
+
+ std::string CUID = getCUIDFromModule(M);
+ if (CUID.empty())
+ return;
+
+ Function *Ctor = M.getFunction("__hip_module_ctor");
+ if (!Ctor)
+ return;
+
+ // Locate the HIP fat-binary registration call and capture its return value
+ Value *Handle = nullptr;
+ for (BasicBlock &BB : *Ctor)
+ for (Instruction &I : BB)
+ if (auto *CB = dyn_cast<CallBase>(&I))
+ if (Function *Callee = CB->getCalledFunction())
+ if (Callee->getName() == "__hipRegisterFatBinary") {
+ Handle = &I; // call result
+ break;
+ }
+ if (!Handle)
+ return;
+ GlobalVariable *FatbinHandleGV = nullptr;
+ if (auto *HandleInst = dyn_cast<Instruction>(Handle))
+ for (Instruction *Cur = HandleInst->getNextNode(); Cur;
+ Cur = Cur->getNextNode()) {
+ auto *SI = dyn_cast<StoreInst>(Cur);
+ if (!SI || SI->getValueOperand() != Handle)
+ continue;
+ if (auto *GV = dyn_cast<GlobalVariable>(
+ SI->getPointerOperand()->stripPointerCasts())) {
+ FatbinHandleGV = GV;
+ break;
+ }
+ }
+
+ if (!FatbinHandleGV) {
+ LLVM_DEBUG(llvm::dbgs()
+ << "store of __hipRegisterFatBinary call not found\n");
+ }
+
+ // Insert the new registration just before the ctor’s return
+ ReturnInst *RetInst = nullptr;
+ for (auto &BB : llvm::reverse(*Ctor))
+ if ((RetInst = dyn_cast<ReturnInst>(BB.getTerminator())))
+ break;
+ if (!RetInst)
+ return;
+ IRBuilder<> Builder(RetInst);
+
+ LLVM_DEBUG(
+ llvm::dbgs() << "Found __hip_module_ctor, registering anchors for CUID="
+ << CUID << "\n");
+
+ // Get or create the __hipRegisterVar declaration
+ auto *VoidTy = Type::getVoidTy(M.getContext());
+ auto *VoidPtrTy = PointerType::getUnqual(M.getContext());
+ auto *Int32Ty = Type::getInt32Ty(M.getContext());
+ auto *Int64Ty = Type::getInt64Ty(M.getContext());
+
+ auto *RegisterVarTy =
+ FunctionType::get(VoidTy,
+ {VoidPtrTy, VoidPtrTy, VoidPtrTy, VoidPtrTy, Int32Ty,
+ Int64Ty, Int32Ty, Int32Ty},
+ false);
+ FunctionCallee RegisterVarFunc =
+ M.getOrInsertFunction("__hipRegisterVar", RegisterVarTy);
+
+ Value *HipHandle =
+ FatbinHandleGV ? Builder.CreateLoad(VoidPtrTy, FatbinHandleGV) : Handle;
+
+ // Create __llvm_offload_prf_<CUID> shadow structure on host
+ // This will be populated with section boundary addresses from the device
+ // Use CUID-suffixed name to match device symbol and avoid multi-TU collision
+ std::string OffloadPrfName = "__llvm_offload_prf_" + CUID;
+ auto *Int8PtrTy = PointerType::get(M.getContext(), 0);
+ std::vector<Type *> StructFields(8, Int8PtrTy);
+ auto *StructTy = StructType::get(M.getContext(), StructFields);
+
+ auto *OffloadPrfShadow = new GlobalVariable(
+ M, StructTy, /*isConstant=*/false, GlobalValue::ExternalLinkage,
+ ConstantAggregateZero::get(StructTy), OffloadPrfName);
+ CompilerUsedVars.push_back(OffloadPrfShadow);
+
+ // Register the unified structure with HIP runtime
+ auto *UnifiedNameStr =
+ ConstantDataArray::getString(M.getContext(), OffloadPrfName, true);
+ auto *UnifiedNameGlobal = new GlobalVariable(
+ M, UnifiedNameStr->getType(), /*isConstant=*/true,
+ GlobalValue::PrivateLinkage, UnifiedNameStr, OffloadPrfName + ".name");
+
+ Builder.CreateCall(RegisterVarFunc,
+ {HipHandle,
+ Builder.CreatePointerCast(OffloadPrfShadow, VoidPtrTy),
+ Builder.CreatePointerCast(UnifiedNameGlobal, VoidPtrTy),
+ Builder.CreatePointerCast(UnifiedNameGlobal, VoidPtrTy),
+ Builder.getInt32(0), // extern = 0
+ Builder.getInt64(64), // size = 64 (8 pointers * 8 bytes)
+ Builder.getInt32(0), // constant = 0
+ Builder.getInt32(0)}); // global = 0
+
+ // Register with the profile runtime so it knows to collect data from this TU
+ auto *RegisterShadowVarTy = FunctionType::get(VoidTy, {VoidPtrTy}, false);
+ FunctionCallee RegisterShadowVarFunc = M.getOrInsertFunction(
+ "__llvm_profile_offload_register_shadow_variable", RegisterShadowVarTy);
+ Builder.CreateCall(RegisterShadowVarFunc,
+ {Builder.CreatePointerCast(OffloadPrfShadow, VoidPtrTy)});
+
+ // Register per-section device symbols so compiler-rt can pre-register them
+ // with CLR before doing hipMemcpy (avoids HSA dependency).
+ FunctionCallee RegisterSectionShadowVarFunc = M.getOrInsertFunction(
+ "__llvm_profile_offload_register_section_shadow_variable",
+ RegisterShadowVarTy);
+
+ auto registerSectionSymbol = [&](StringRef SymName) {
+ // Create a 1-byte shadow global. The type/size are only used as a handle.
+ auto *I8Ty = Type::getInt8Ty(M.getContext());
+ GlobalVariable *Shadow = M.getGlobalVariable(SymName);
+ if (!Shadow) {
+ Shadow = new GlobalVariable(M, I8Ty, /*isConstant=*/false,
+ GlobalValue::ExternalLinkage,
+ ConstantInt::get(I8Ty, 0), SymName);
+ CompilerUsedVars.push_back(Shadow);
+ }
+
+ auto *NameStr = ConstantDataArray::getString(M.getContext(), SymName, true);
+ auto *NameGlobal = new GlobalVariable(
+ M, NameStr->getType(), /*isConstant=*/true, GlobalValue::PrivateLinkage,
+ NameStr, (SymName + ".name").str());
+
+ Builder.CreateCall(RegisterVarFunc,
+ {HipHandle, Builder.CreatePointerCast(Shadow, VoidPtrTy),
+ Builder.CreatePointerCast(NameGlobal, VoidPtrTy),
+ Builder.CreatePointerCast(NameGlobal, VoidPtrTy),
+ Builder.getInt32(0), // extern = 0
+ Builder.getInt64(1), // size = 1 byte (handle only)
+ Builder.getInt32(0), // constant = 0
+ Builder.getInt32(0)} // global = 0
+ );
+
+ Builder.CreateCall(RegisterSectionShadowVarFunc,
+ {Builder.CreatePointerCast(Shadow, VoidPtrTy)});
+ };
+
+ // Per-TU contiguous symbols (device side).
+ std::string CntsSym = std::string("__llvm_prf_c_") + CUID;
+ std::string DataSym = std::string("__llvm_prf_d_") + CUID;
+ std::string UCntsSym = std::string("__profu_all_") + CUID;
+ std::string NamesSym = std::string(getInstrProfNamesVarName()) + "_" + CUID;
+ registerSectionSymbol(CntsSym);
+ registerSectionSymbol(DataSym);
+ registerSectionSymbol(UCntsSym);
+ registerSectionSymbol(NamesSym);
+}
+
+} // namespace
diff --git a/llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp b/llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
index 0232d45e5b7bb..b1a943450cdab 100644
--- a/llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
+++ b/llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
@@ -118,6 +118,7 @@
#include <algorithm>
#include <cassert>
#include <cstdint>
+#include <cstdlib>
#include <memory>
#include <numeric>
#include <optional>
@@ -379,8 +380,15 @@ class FunctionInstrumenter final {
// values. Supporting other values is relatively straight-forward - just
// another counter range within the context.
bool isValueProfilingDisabled() const {
+ // Value profiling is disabled for GPU targets because the device-side
+ // profiling runtime does not yet implement
+ // __llvm_profile_instrument_target. The existing compiler-rt implementation
+ // uses a linked-list with locks and eviction policy that is not efficient
+ // for massively parallel GPU execution. A GPU-optimized implementation is
+ // left as future work.
return DisableValueProfiling ||
- InstrumentationType == PGOInstrumentationType::CTXPROF;
+ InstrumentationType == PGOInstrumentationType::CTXPROF ||
+ isGPUProfTarget(M);
}
bool shouldInstrumentEntryBB() const {
@@ -1201,6 +1209,9 @@ class PGOUseFunc {
// Annotate the irreducible loop header weights.
void annotateIrrLoopHeaderWeights();
+ // Annotate per-block uniformity info for offload profiling.
+ void setBlockUniformityAttribute();
+
// The hotness of the function from the profile count.
enum FuncFreqAttr { FFA_Normal, FFA_Cold, FFA_Hot };
@@ -1307,11 +1318,34 @@ bool PGOUseFunc::setInstrumentedCounts(
setupBBInfoEdges(FuncInfo);
- unsigned NumCounters =
- InstrumentBBs.size() + FuncInfo.SIVisitor.getNumOfSelectInsts();
+ unsigned NumInstrumentedBBs = InstrumentBBs.size();
+ unsigned NumSelects = FuncInfo.SIVisitor.getNumOfSelectInsts();
+ unsigned NumCounters = NumInstrumentedBBs + NumSelects;
// The number of counters here should match the number of counters
// in profile. Return if they mismatch.
if (NumCounters != CountFromProfile.size()) {
+ LLVM_DEBUG({
+ dbgs() << "PGO COUNTER MISMATCH for function " << F.getName() << ":\n";
+ dbgs() << " Expected counters: " << NumCounters << "\n";
+ dbgs() << " - From instrumented edges: " << NumInstrumentedBBs << "\n";
+ for (size_t i = 0; i < InstrumentBBs.size(); ++i) {
+ dbgs() << " " << i << ": " << InstrumentBBs[i]->getName() << "\n";
+ }
+ dbgs() << " - From select instructions: " << NumSelects << "\n";
+ dbgs() << " Actual counters from profile: " << CountFromProfile.size()
+ << "\n";
+
+ // Dump module
+ std::error_code EC;
+ std::string Filename = "pgo_mismatch_" + F.getName().str() + ".ll";
+ raw_fd_ostream OS(Filename, EC);
+ if (!EC) {
+ dbgs() << "Dumping module to " << Filename << "\n";
+ M->print(OS, nullptr);
+ } else {
+ dbgs() << "Error opening file " << Filename << " for writing\n";
+ }
+ });
return false;
}
auto *FuncEntry = &*F.begin();
@@ -1319,6 +1353,7 @@ bool PGOUseFunc::setInstrumentedCounts(
// Set the profile count to the Instrumented BBs.
uint32_t I = 0;
for (BasicBlock *InstrBB : InstrumentBBs) {
+
uint64_t CountValue = CountFromProfile[I++];
PGOUseBBInfo &Info = getBBInfo(InstrBB);
// If we reach here, we know that we have some nonzero count
@@ -1764,6 +1799,46 @@ void PGOUseFunc::annotateIrrLoopHeaderWeights() {
}
}
+void PGOUseFunc::setBlockUniformityAttribute() {
+ if (ProfileRecord.UniformityBits.empty())
+ return;
+
+ // Annotate uniformity on each instrumented IR basic block so later codegen
+ // passes (MachineFunction) can consume it without relying on fragile block
+ // numbering heuristics.
+ //
+ // Metadata kind: LLVMContext::MD_block_uniformity_profile
+ // Payload: i1 (true = uniform, false = divergent)
+
+ std::vector<BasicBlock *> InstrumentBBs;
+ FuncInfo.getInstrumentBBs(InstrumentBBs);
+
+ LLVMContext &Ctx = F.getContext();
+ Type *Int1Ty = Type::getInt1Ty(Ctx);
+
+ for (size_t I = 0, E = InstrumentBBs.size(); I < E; ++I) {
+ BasicBlock *BB = InstrumentBBs[I];
+ if (!BB || !BB->getTerminator())
+ continue;
+ bool IsUniform = ProfileRecord.isBlockUniform(I);
+ auto *MD = MDNode::get(
+ Ctx, ConstantAsMetadata::get(ConstantInt::get(Int1Ty, IsUniform)));
+ BB->getTerminator()->setMetadata(LLVMContext::MD_block_uniformity_profile,
+ MD);
+ }
+
+ // Keep a function attribute for debugging / IR inspection.
+ // Format: "U" for uniform, "D" for divergent, one per instrumented block.
+ std::string UniformityStr;
+ UniformityStr.reserve(InstrumentBBs.size());
+ for (size_t I = 0, E = InstrumentBBs.size(); I < E; ++I)
+ UniformityStr += ProfileRecord.isBlockUniform(I) ? 'U' : 'D';
+ F.addFnAttr("block-uniformity-profile", UniformityStr);
+
+ LLVM_DEBUG(dbgs() << "PGO: Set block uniformity profile for " << F.getName()
+ << ": " << UniformityStr << "\n");
+}
+
void SelectInstVisitor::instrumentOneSelectInst(SelectInst &SI) {
Module *M = F.getParent();
IRBuilder<> Builder(&SI);
@@ -2275,6 +2350,7 @@ static bool annotateAllFunctions(
Func.setBranchWeights();
Func.annotateValueSites();
Func.annotateIrrLoopHeaderWeights();
+ Func.setBlockUniformityAttribute();
PGOUseFunc::FuncFreqAttr FreqAttr = Func.getFuncFreqAttr();
if (FreqAttr == PGOUseFunc::FFA_Cold)
ColdFunctions.push_back(&F);
@@ -2409,14 +2485,14 @@ void llvm::setProfMetadata(Instruction *TI, ArrayRef<uint64_t> EdgeCounts,
uint64_t MaxCount) {
auto Weights = downscaleWeights(EdgeCounts, MaxCount);
- LLVM_DEBUG(dbgs() << "Weight is: "; for (const auto &W
- : Weights) {
+ LLVM_DEBUG(dbgs() << "Weight is: "; for (const auto &W : Weights) {
dbgs() << W << " ";
- } dbgs() << "\n";);
+ } dbgs() << "\n");
misexpect::checkExpectAnnotations(*TI, Weights, /*IsFrontend=*/false);
setBranchWeights(*TI, Weights, /*IsExpected=*/false);
+
if (EmitBranchProbability) {
std::string BrCondStr = getBranchCondString(TI);
if (BrCondStr.empty())
diff --git a/llvm/test/Instrumentation/InstrProfiling/amdgpu-3d-grid.ll b/llvm/test/Instrumentation/InstrProfiling/amdgpu-3d-grid.ll
new file mode 100644
index 0000000000000..efd91b08c52bd
--- /dev/null
+++ b/llvm/test/Instrumentation/InstrProfiling/amdgpu-3d-grid.ll
@@ -0,0 +1,27 @@
+;; Test that AMDGPU PGO instrumentation generates contiguous counter arrays
+;; and profile section symbols with CUID-based naming. The __gpu_pgo_is_sampled
+;; library function handles 3D block linearization internally.
+
+; RUN: opt %s -mtriple=amdgcn-amd-amdhsa -passes=instrprof -S | FileCheck %s
+
+ at __hip_cuid_abcdef789 = addrspace(1) global i8 0
+ at __profn_kernel_3d = private constant [9 x i8] c"kernel_3d"
+
+define amdgpu_kernel void @kernel_3d() {
+ call void @llvm.instrprof.increment(ptr @__profn_kernel_3d, i64 12345, i32 1, i32 0)
+ ret void
+}
+
+declare void @llvm.instrprof.increment(ptr, i64, i32, i32)
+
+;; Check contiguous counter array with CUID suffix
+; CHECK: @__llvm_prf_c_abcdef789 = protected addrspace(1) global [1 x i64] zeroinitializer
+
+;; Check uniform counter array
+; CHECK: @__profu_all_abcdef789 = protected addrspace(1) global [1 x i64] zeroinitializer
+
+;; Check profile section symbol
+; CHECK: @__llvm_offload_prf_abcdef789 = addrspace(1) constant
+
+;; Check sampling guard calls library function
+; CHECK: call i32 @__gpu_pgo_is_sampled(i32 3)
diff --git a/llvm/test/Instrumentation/InstrProfiling/amdgpu-contiguous-counters.ll b/llvm/test/Instrumentation/InstrProfiling/amdgpu-contiguous-counters.ll
new file mode 100644
index 0000000000000..65064183a1fd2
--- /dev/null
+++ b/llvm/test/Instrumentation/InstrProfiling/amdgpu-contiguous-counters.ll
@@ -0,0 +1,44 @@
+;; Test that AMDGPU targets use contiguous counter allocation with CUID-based naming.
+;; This avoids linker reordering issues where individual __profc_* symbols could be
+;; placed in any order within the section.
+
+; RUN: opt %s -mtriple=amdgcn-amd-amdhsa -passes=instrprof -S | FileCheck %s
+
+;; Simulate a module with CUID (as generated by HIP compilation)
+ at __hip_cuid_abc123 = addrspace(1) global i8 0
+
+ at __profn_kernel1 = private constant [7 x i8] c"kernel1"
+ at __profn_kernel2 = private constant [7 x i8] c"kernel2"
+
+;; Check that contiguous counter array is created with CUID suffix
+; CHECK: @__llvm_prf_c_abc123 = protected addrspace(1) global [{{[0-9]+}} x i64] zeroinitializer, section "__llvm_prf_cnts_abc123", align 8
+
+;; Check that contiguous uniform counter array is created for divergence tracking
+; CHECK: @__profu_all_abc123 = protected addrspace(1) global [{{[0-9]+}} x i64] zeroinitializer, section "__llvm_prf_ucnts_abc123", align 8
+
+;; Check that contiguous data array is created with CUID suffix
+; CHECK: @__llvm_prf_d_abc123 = protected addrspace(1) global
+
+;; Check that individual __profc_kernel* symbols are NOT created (contiguous mode)
+; CHECK-NOT: @__profc_kernel1
+; CHECK-NOT: @__profc_kernel2
+
+define amdgpu_kernel void @kernel1() {
+ call void @llvm.instrprof.increment(ptr @__profn_kernel1, i64 12345, i32 2, i32 0)
+ call void @llvm.instrprof.increment(ptr @__profn_kernel1, i64 12345, i32 2, i32 1)
+ ret void
+}
+
+define amdgpu_kernel void @kernel2() {
+ call void @llvm.instrprof.increment(ptr @__profn_kernel2, i64 67890, i32 1, i32 0)
+ ret void
+}
+
+declare void @llvm.instrprof.increment(ptr, i64, i32, i32)
+
+;; Check that __llvm_offload_prf_<CUID> structure is created with 8 pointers
+; CHECK: @__llvm_offload_prf_abc123 = addrspace(1) constant { ptr addrspace(1), ptr addrspace(1), ptr addrspace(1), ptr addrspace(1), ptr addrspace(1), ptr addrspace(1), ptr addrspace(1), ptr addrspace(1) }
+
+;; Per-function data symbols are aliases into the contiguous __profd_all array
+; CHECK: @__profd_kernel1 = protected alias
+; CHECK: @__profd_kernel2 = protected alias
diff --git a/llvm/test/Instrumentation/InstrProfiling/amdgpu-instrumentation.ll b/llvm/test/Instrumentation/InstrProfiling/amdgpu-instrumentation.ll
new file mode 100644
index 0000000000000..b823dc7fdf06b
--- /dev/null
+++ b/llvm/test/Instrumentation/InstrProfiling/amdgpu-instrumentation.ll
@@ -0,0 +1,57 @@
+;; Test AMDGPU PGO instrumentation lowering with multiple basic blocks.
+;; Verifies:
+;; 1. Sampling decision is computed once in the entry block.
+;; 2. Each instrumentation point calls __gpu_pgo_increment behind the
+;; sampling guard branch.
+;; 3. No-sampling mode (sampling=0) calls __gpu_pgo_increment unconditionally.
+
+; RUN: opt %s -mtriple=amdgcn-amd-amdhsa -passes=instrprof \
+; RUN: -offload-pgo-sampling=3 -S \
+; RUN: | FileCheck %s --check-prefix=SAMPLED
+; RUN: opt %s -mtriple=amdgcn-amd-amdhsa -passes=instrprof \
+; RUN: -offload-pgo-sampling=0 -S \
+; RUN: | FileCheck %s --check-prefix=NOSAMPLE
+
+ at __hip_cuid_test01 = addrspace(1) global i8 0
+ at __profn_test_kernel = private constant [11 x i8] c"test_kernel"
+
+define amdgpu_kernel void @test_kernel(ptr addrspace(1) %out, i32 %n) {
+entry:
+ call void @llvm.instrprof.increment(ptr @__profn_test_kernel, i64 111, i32 4, i32 0)
+ %cmp = icmp sgt i32 %n, 0
+ br i1 %cmp, label %if.then, label %if.end
+
+if.then:
+ call void @llvm.instrprof.increment(ptr @__profn_test_kernel, i64 111, i32 4, i32 1)
+ store i32 1, ptr addrspace(1) %out
+ br label %if.end
+
+if.end:
+ ret void
+}
+
+declare void @llvm.instrprof.increment(ptr, i64, i32, i32)
+
+;; ---- Sampled mode (sampling=3) ----
+
+;; Entry block: sampling decision computed once
+; SAMPLED-LABEL: define {{.*}} @test_kernel
+; SAMPLED: entry:
+; SAMPLED: %pgo.sampled = call i32 @__gpu_pgo_is_sampled(i32 3)
+; SAMPLED: %pgo.matched = icmp ne i32 %pgo.sampled, 0
+; SAMPLED: br i1 %pgo.matched, label %po_then, label %po_cont
+
+;; Second instrumentation point reuses same sampling decision
+; SAMPLED: br i1 %pgo.matched, label %po_then{{[0-9]+}}, label %po_cont{{[0-9]+}}
+
+;; Both instrumentation points call the library function
+; SAMPLED: call void @__gpu_pgo_increment(
+; SAMPLED: call void @__gpu_pgo_increment(
+
+;; ---- No-sampling mode (sampling=0) ----
+
+;; No sampling guard — direct call
+; NOSAMPLE-LABEL: define {{.*}} @test_kernel
+; NOSAMPLE: entry:
+; NOSAMPLE-NOT: @__gpu_pgo_is_sampled
+; NOSAMPLE: call void @__gpu_pgo_increment(
diff --git a/llvm/test/Instrumentation/InstrProfiling/amdgpu-uniform-counters.ll b/llvm/test/Instrumentation/InstrProfiling/amdgpu-uniform-counters.ll
new file mode 100644
index 0000000000000..e106752b730cb
--- /dev/null
+++ b/llvm/test/Instrumentation/InstrProfiling/amdgpu-uniform-counters.ll
@@ -0,0 +1,21 @@
+;; Test that AMDGPU targets generate uniform counter arrays alongside regular
+;; counters. The uniform counter is passed to __gpu_pgo_increment which
+;; updates it when all lanes in the wave are active.
+
+; RUN: opt %s -mtriple=amdgcn-amd-amdhsa -passes=instrprof -S | FileCheck %s
+
+ at __hip_cuid_test123 = addrspace(1) global i8 0
+ at __profn_test_kernel = private constant [11 x i8] c"test_kernel"
+
+define amdgpu_kernel void @test_kernel() {
+ call void @llvm.instrprof.increment(ptr @__profn_test_kernel, i64 12345, i32 1, i32 0)
+ ret void
+}
+
+declare void @llvm.instrprof.increment(ptr, i64, i32, i32)
+
+;; Check that uniform counter array is created
+; CHECK: @__profu_all_test123 = protected addrspace(1) global
+
+;; Check that __gpu_pgo_increment receives both counter and uniform counter
+; CHECK: call void @__gpu_pgo_increment(ptr addrspace(1) @__llvm_prf_c_test123, ptr addrspace(1) @__profu_all_test123, i64 1)
diff --git a/llvm/test/Instrumentation/InstrProfiling/amdgpu-wave32.ll b/llvm/test/Instrumentation/InstrProfiling/amdgpu-wave32.ll
new file mode 100644
index 0000000000000..073b60b18a252
--- /dev/null
+++ b/llvm/test/Instrumentation/InstrProfiling/amdgpu-wave32.ll
@@ -0,0 +1,29 @@
+;; Test that AMDGPU PGO instrumentation generates library calls for Wave32.
+;; Verifies sampling guard and __gpu_pgo_increment call with correct
+;; addrspace(1) counter pointers.
+
+; RUN: opt %s -mtriple=amdgcn-amd-amdhsa -passes=instrprof -S | FileCheck %s
+
+ at __hip_cuid_abcdef456 = addrspace(1) global i8 0
+ at __profn_kernel_w32 = private constant [10 x i8] c"kernel_w32"
+
+define amdgpu_kernel void @kernel_w32() #0 {
+ call void @llvm.instrprof.increment(ptr @__profn_kernel_w32, i64 12345, i32 1, i32 0)
+ ret void
+}
+
+declare void @llvm.instrprof.increment(ptr, i64, i32, i32)
+
+attributes #0 = { "target-cpu"="gfx1100" }
+
+;; Check wave size 32 in profile data
+; CHECK: @__llvm_prf_d_abcdef456 = {{.*}}i16 0, i16 32, i32 0
+
+;; Check sampling guard (default sampling=3)
+; CHECK: %pgo.sampled = call i32 @__gpu_pgo_is_sampled(i32 3)
+; CHECK: %pgo.matched = icmp ne i32 %pgo.sampled, 0
+; CHECK: br i1 %pgo.matched, label %po_then, label %po_cont
+
+;; Check library call with addrspace(1) pointers
+; CHECK: po_then:
+; CHECK: call void @__gpu_pgo_increment(ptr addrspace(1) @__llvm_prf_c_abcdef456, ptr addrspace(1) @__profu_all_abcdef456, i64 1)
diff --git a/llvm/test/Instrumentation/InstrProfiling/amdgpu-wave64.ll b/llvm/test/Instrumentation/InstrProfiling/amdgpu-wave64.ll
new file mode 100644
index 0000000000000..2b3e902f17089
--- /dev/null
+++ b/llvm/test/Instrumentation/InstrProfiling/amdgpu-wave64.ll
@@ -0,0 +1,28 @@
+;; Test that AMDGPU PGO instrumentation generates library calls for Wave64.
+;; Wave64 targets (e.g., gfx908) should embed wave size 64 in profile data.
+
+; RUN: opt %s -mtriple=amdgcn-amd-amdhsa -passes=instrprof -S | FileCheck %s
+
+ at __hip_cuid_abcdef123 = addrspace(1) global i8 0
+ at __profn_kernel_w64 = private constant [10 x i8] c"kernel_w64"
+
+define amdgpu_kernel void @kernel_w64() #0 {
+ call void @llvm.instrprof.increment(ptr @__profn_kernel_w64, i64 12345, i32 1, i32 0)
+ ret void
+}
+
+declare void @llvm.instrprof.increment(ptr, i64, i32, i32)
+
+attributes #0 = { "target-cpu"="gfx908" }
+
+;; Check wave size 64 in profile data
+; CHECK: @__llvm_prf_d_abcdef123 = {{.*}}i16 0, i16 64, i32 0
+
+;; Check sampling guard
+; CHECK: %pgo.sampled = call i32 @__gpu_pgo_is_sampled(i32 3)
+; CHECK: %pgo.matched = icmp ne i32 %pgo.sampled, 0
+; CHECK: br i1 %pgo.matched, label %po_then, label %po_cont
+
+;; Check library call
+; CHECK: po_then:
+; CHECK: call void @__gpu_pgo_increment(ptr addrspace(1) @__llvm_prf_c_abcdef123, ptr addrspace(1) @__profu_all_abcdef123, i64 1)
diff --git a/llvm/test/Instrumentation/InstrProfiling/coverage.ll b/llvm/test/Instrumentation/InstrProfiling/coverage.ll
index 08cbcaa962b76..880cd1731c00a 100644
--- a/llvm/test/Instrumentation/InstrProfiling/coverage.ll
+++ b/llvm/test/Instrumentation/InstrProfiling/coverage.ll
@@ -5,12 +5,12 @@ target triple = "aarch64-unknown-linux-gnu"
@__profn_foo = private constant [3 x i8] c"foo"
; CHECK: @__profc_foo = private global [1 x i8] c"\FF", section "__llvm_prf_cnts", comdat, align 1
-; CHECK: @__profd_foo = private global { i64, i64, i64, i64, ptr, ptr, i32, [3 x i16], i32 } { i64 {{.*}}, i64 {{.*}}, i64 sub (i64 ptrtoint (ptr @__profc_foo to i64)
-; BINARY: @__profd_foo = private global { i64, i64, i64, i64, ptr, ptr, i32, [3 x i16], i32 } { i64 {{.*}}, i64 {{.*}}, i64 ptrtoint (ptr @__profc_foo to i64),
+; CHECK: @__profd_foo = private global { i64, i64, i64, i64, ptr, ptr, i32, [3 x i16], i16, i16, i32 } { i64 {{.*}}, i64 {{.*}}, i64 sub (i64 ptrtoint (ptr @__profc_foo to i64)
+; BINARY: @__profd_foo = private global { i64, i64, i64, i64, ptr, ptr, i32, [3 x i16], i16, i16, i32 } { i64 {{.*}}, i64 {{.*}}, i64 ptrtoint (ptr @__profc_foo to i64),
@__profn_bar = private constant [3 x i8] c"bar"
; CHECK: @__profc_bar = private global [1 x i8] c"\FF", section "__llvm_prf_cnts", comdat, align 1
-; CHECK: @__profd_bar = private global { i64, i64, i64, i64, ptr, ptr, i32, [3 x i16], i32 } { i64 {{.*}}, i64 {{.*}}, i64 sub (i64 ptrtoint (ptr @__profc_bar to i64)
-; BINARY: @__profd_bar = private global { i64, i64, i64, i64, ptr, ptr, i32, [3 x i16], i32 } { i64 {{.*}}, i64 {{.*}}, i64 ptrtoint (ptr @__profc_bar to i64),
+; CHECK: @__profd_bar = private global { i64, i64, i64, i64, ptr, ptr, i32, [3 x i16], i16, i16, i32 } { i64 {{.*}}, i64 {{.*}}, i64 sub (i64 ptrtoint (ptr @__profc_bar to i64)
+; BINARY: @__profd_bar = private global { i64, i64, i64, i64, ptr, ptr, i32, [3 x i16], i16, i16, i32 } { i64 {{.*}}, i64 {{.*}}, i64 ptrtoint (ptr @__profc_bar to i64),
; CHECK: @__llvm_prf_nm = {{.*}} section "__llvm_prf_names"
; BINARY: @__llvm_prf_nm ={{.*}} section "__llvm_covnames"
diff --git a/llvm/test/Instrumentation/InstrProfiling/gpu-weak.ll b/llvm/test/Instrumentation/InstrProfiling/gpu-weak.ll
new file mode 100644
index 0000000000000..894fe8130a32d
--- /dev/null
+++ b/llvm/test/Instrumentation/InstrProfiling/gpu-weak.ll
@@ -0,0 +1,38 @@
+; RUN: opt < %s -passes=instrprof -S | FileCheck %s
+
+; Test that weak functions on GPU targets get weak linkage for their
+; __profd_ aliases to allow linker deduplication across TUs.
+; Non-weak functions get external linkage (default for aliases).
+
+target triple = "amdgcn-amd-amdhsa"
+
+ at __hip_cuid_abc123 = addrspace(1) global i8 0
+
+; Weak function should have weak linkage on its profile data alias
+; CHECK: @__profd_weak_func = weak protected alias
+ at __profn_weak_func = private constant [9 x i8] c"weak_func"
+
+define weak void @weak_func() {
+ call void @llvm.instrprof.increment(ptr @__profn_weak_func, i64 0, i32 1, i32 0)
+ ret void
+}
+
+; Weak ODR function should have weak_odr linkage on its profile data alias
+; CHECK: @__profd_weak_odr_func = weak_odr protected alias
+ at __profn_weak_odr_func = private constant [13 x i8] c"weak_odr_func"
+
+define weak_odr void @weak_odr_func() {
+ call void @llvm.instrprof.increment(ptr @__profn_weak_odr_func, i64 0, i32 1, i32 0)
+ ret void
+}
+
+; Non-weak function should have external linkage (no linkage keyword shown)
+; CHECK: @__profd_normal_func = protected alias
+ at __profn_normal_func = private constant [11 x i8] c"normal_func"
+
+define void @normal_func() {
+ call void @llvm.instrprof.increment(ptr @__profn_normal_func, i64 0, i32 1, i32 0)
+ ret void
+}
+
+declare void @llvm.instrprof.increment(ptr, i64, i32, i32)
diff --git a/llvm/test/Instrumentation/InstrProfiling/inline-data-var-create.ll b/llvm/test/Instrumentation/InstrProfiling/inline-data-var-create.ll
index 456103164378e..25411a88e0f99 100644
--- a/llvm/test/Instrumentation/InstrProfiling/inline-data-var-create.ll
+++ b/llvm/test/Instrumentation/InstrProfiling/inline-data-var-create.ll
@@ -7,17 +7,18 @@
target triple = "x86_64-unknown-linux-gnu"
-; INLINEFIRST: @__profd_foo = private global{{.*}}zeroinitializer, i32 21
-; INLINEFIRST: @__profd_bar = private global{{.*}}zeroinitializer, i32 23
-; INLINEFIRST: @__profd_foobar = private global{{.*}}zeroinitializer, i32 99
-
-; INLINEAFTER: @__profd_foobar = private global{{.*}}zeroinitializer, i32 99
-; INLINEAFTER: @__profd_foo = private global{{.*}}zeroinitializer, i32 21
-; INLINEAFTER: @__profd_bar = private global{{.*}}zeroinitializer, i32 23
-
-; NOINLINE: @__profd_foobar = private global{{.*}}zeroinitializer, i32 99
-; NOINLINE: @__profd_foo = private global{{.*}}zeroinitializer, i32 21
-; NOINLINE: @__profd_bar = private global{{.*}}zeroinitializer, i32 23
+;; Note: struct layout is { ..., i32, [3 x i16], i16, i16, i32 } where the two i16s are NumOffloadProfilingThreads and OffloadDeviceWaveSize
+; INLINEFIRST: @__profd_foo = private global{{.*}}i16 0, i32 21
+; INLINEFIRST: @__profd_bar = private global{{.*}}i16 0, i32 23
+; INLINEFIRST: @__profd_foobar = private global{{.*}}i16 0, i32 99
+
+; INLINEAFTER: @__profd_foobar = private global{{.*}}i16 0, i32 99
+; INLINEAFTER: @__profd_foo = private global{{.*}}i16 0, i32 21
+; INLINEAFTER: @__profd_bar = private global{{.*}}i16 0, i32 23
+
+; NOINLINE: @__profd_foobar = private global{{.*}}i16 0, i32 99
+; NOINLINE: @__profd_foo = private global{{.*}}i16 0, i32 21
+; NOINLINE: @__profd_bar = private global{{.*}}i16 0, i32 23
declare void @llvm.instrprof.increment(ptr %0, i64 %1, i32 %2, i32 %3)
declare void @llvm.instrprof.mcdc.parameters(ptr %0, i64 %1, i32 %2)
diff --git a/llvm/test/Instrumentation/InstrProfiling/platform.ll b/llvm/test/Instrumentation/InstrProfiling/platform.ll
index 9c76a5caf2a51..ac38071ae3718 100644
--- a/llvm/test/Instrumentation/InstrProfiling/platform.ll
+++ b/llvm/test/Instrumentation/InstrProfiling/platform.ll
@@ -9,26 +9,33 @@
; RUN: opt < %s -mtriple=x86_64-pc-windows -passes=instrprof -S | FileCheck %s -check-prefix=WINDOWS
; RUN: opt < %s -mtriple=powerpc64-ibm-aix-xcoff -passes=instrprof -S | FileCheck %s -check-prefix=AIX
; RUN: opt < %s -mtriple=arm-elf -passes=instrprof -S | FileCheck %s -check-prefix=BAREMETAL
+; RUN: opt < %s -mtriple=amdgcn-amd-amdhsa -passes=instrprof -S | FileCheck %s -check-prefix=AMDGPU
@__profn_foo = private constant [3 x i8] c"foo"
; MACHO-NOT: __profn_foo
; ELF-NOT: __profn_foo
; WINDOWS-NOT: __profn_foo
; AIX-NOT: __profn_foo
+; AMDGPU-NOT: __profn_foo
; MACHO: @__profc_foo = private global [1 x i64] zeroinitializer, section "__DATA,__llvm_prf_cnts", align 8
; ELF: @__profc_foo = private global [1 x i64] zeroinitializer, section "__llvm_prf_cnts", comdat, align 8
; WINDOWS: @__profc_foo = private global [1 x i64] zeroinitializer, section ".lprfc$M", align 8
; AIX: @__profc_foo = private global [1 x i64] zeroinitializer, section "__llvm_prf_cnts", align 8
+;; AMDGPU without CUID uses per-function allocation (like ELF) for OpenMP compatibility
+; AMDGPU: @__profc_foo = private addrspace(1) global [{{[0-9]+}} x i64] zeroinitializer, section "__llvm_prf_cnts", comdat, align 8
; MACHO: @__profd_foo = private {{.*}}, section "__DATA,__llvm_prf_data,regular,live_support", align 8
; ELF: @__profd_foo = private {{.*}}, section "__llvm_prf_data", comdat($__profc_foo), align 8
; WINDOWS: @__profd_foo = private global {{.*}}, section ".lprfd$M", align 8
; AIX: @__profd_foo = private {{.*}}, section "__llvm_prf_data", align 8
+;; AMDGPU without CUID uses per-function data (not alias)
+; AMDGPU: @__profd_foo = protected addrspace(1) global {{.*}}, section "__llvm_prf_data", comdat($__profc_foo), align 8
; ELF: @__llvm_prf_nm = private constant [{{.*}} x i8] c"{{.*}}", section "{{.*}}__llvm_prf_names"{{.*}}, align 1
; WINDOWS: @__llvm_prf_nm = private constant [{{.*}} x i8] c"{{.*}}", section "{{.*}}lprfn$M", align 1
; AIX: @__llvm_prf_nm = private constant [{{.*}} x i8] c"{{.*}}", section "{{.*}}__llvm_prf_names", align 1
+; AMDGPU: @__llvm_prf_nm = protected addrspace(1) constant [{{.*}} x i8] c"{{.*}}", section "__llvm_prf_names", align 1
define void @foo() {
call void @llvm.instrprof.increment(ptr @__profn_foo, i64 0, i32 1, i32 0)
@@ -37,6 +44,9 @@ define void @foo() {
declare void @llvm.instrprof.increment(ptr, i64, i32, i32)
+;; AMDGPU without CUID uses standard per-function allocation (for OpenMP compatibility)
+;; Start/stop symbols behavior is platform-specific
+
;; Emit registration functions for platforms that don't find the
;; symbols by their sections.
@@ -48,6 +58,7 @@ declare void @llvm.instrprof.increment(ptr, i64, i32, i32)
; WINDOWS-NOT: define internal void @__llvm_profile_register_functions
; AIX-NOT: define internal void @__llvm_profile_register_functions
; BAREMETAL-NOT: define internal void @__llvm_profile_register_functions
+; AMDGPU-NOT: define internal void @__llvm_profile_register_functions
;; PR38340: When dynamic registration is used, we had a bug where we'd register
;; something that's not a __profd_* variable.
@@ -60,3 +71,4 @@ declare void @llvm.instrprof.increment(ptr, i64, i32, i32)
; WINDOWS-NOT: define internal void @__llvm_profile_init
; AIX-NOT: define internal void @__llvm_profile_init
; BAREMETAL-NOT: define internal void @__llvm_profile_init
+; AMDGPU-NOT: define internal void @__llvm_profile_init
diff --git a/llvm/test/Transforms/PGOProfile/Inputs/thinlto_indirect_call_promotion.profraw b/llvm/test/Transforms/PGOProfile/Inputs/thinlto_indirect_call_promotion.profraw
index 3daa98f937b691880ffff203c9426bfacddf749d..e3383db3a31dc16b1309fc8aa005bf1a04673bbe 100644
GIT binary patch
delta 229
zcmZ3$a)5=iu_!ISs37M*_e4&A-iB#~2NhTS|9 at h3uid?gsn&u5CxDDl1~_1xXed7M
zfPmtH{}2F^P{5*Y5~F}>0-8DlsJaDNePDALfd(*5G!!=cb+z}oS<2MgK=to0bNO;9
ZplNKtV!`AF#s&qL%^;IN0On#C9{{Q*OtSz0
delta 179
zcmX at WvVeuNu_!ISs37M**F;W#-i%9+pGdFz|9^9xwDgmSsn)Ckt3Zsm(i2Yzh_3h#
z0Sr)(0F|38z$l<vfF at Z0k_=^Fn4u>FrWt{nm?rx&N{d>csXPEPW&vY^0 at zHTHjr`<
J0Nanm2LLB#KyUy6
diff --git a/llvm/test/Transforms/PGOProfile/amdgpu-disable-value-profiling.ll b/llvm/test/Transforms/PGOProfile/amdgpu-disable-value-profiling.ll
new file mode 100644
index 0000000000000..fc1c3c227bd05
--- /dev/null
+++ b/llvm/test/Transforms/PGOProfile/amdgpu-disable-value-profiling.ll
@@ -0,0 +1,23 @@
+;; Test that value profiling (indirect call profiling) is disabled for GPU targets.
+;; The device-side profiling runtime does not implement
+;; __llvm_profile_instrument_target, so indirect call profiling must not be emitted.
+
+; RUN: opt < %s -passes=pgo-instr-gen -S | FileCheck %s
+
+target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-p9:192:256:256:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8:9"
+target triple = "amdgcn-amd-amdhsa"
+
+ at fptr = addrspace(1) global ptr null, align 8
+
+;; Verify that regular block instrumentation IS emitted
+; CHECK: call void @llvm.instrprof.increment
+
+;; Verify that value profiling for indirect calls is NOT emitted
+; CHECK-NOT: call void @llvm.instrprof.value.profile
+
+define amdgpu_kernel void @test_indirect_call() {
+entry:
+ %fp = load ptr, ptr addrspace(1) @fptr, align 8
+ call void %fp()
+ ret void
+}
diff --git a/llvm/test/Transforms/PGOProfile/comdat_internal.ll b/llvm/test/Transforms/PGOProfile/comdat_internal.ll
index 1bad0db1b4762..a5a80e985d15e 100644
--- a/llvm/test/Transforms/PGOProfile/comdat_internal.ll
+++ b/llvm/test/Transforms/PGOProfile/comdat_internal.ll
@@ -13,9 +13,9 @@ $foo = comdat any
; CHECK: @__llvm_profile_raw_version = hidden constant i64 {{[0-9]+}}, comdat
; CHECK-NOT: __profn__stdin__foo
; CHECK: @__profc__stdin__foo.[[#FOO_HASH]] = private global [1 x i64] zeroinitializer, section "__llvm_prf_cnts", comdat, align 8
-; CHECK: @__profd__stdin__foo.[[#FOO_HASH]] = private global { i64, i64, i64, i64, ptr, ptr, i32, [3 x i16], i32 } { i64 {{.*}}, i64 [[#FOO_HASH]], i64 sub (i64 ptrtoint (ptr @__profc__stdin__foo.742261418966908927 to i64), i64 ptrtoint (ptr @__profd__stdin__foo.742261418966908927 to i64)), i64 0, ptr null
+; CHECK: @__profd__stdin__foo.[[#FOO_HASH]] = private global { i64, i64, i64, i64, ptr, ptr, i32, [3 x i16], i16, i16, i32 } { i64 {{.*}}, i64 [[#FOO_HASH]], i64 sub (i64 ptrtoint (ptr @__profc__stdin__foo.742261418966908927 to i64), i64 ptrtoint (ptr @__profd__stdin__foo.742261418966908927 to i64)), i64 0, ptr null
; CHECK-NOT: @foo
-; CHECK-SAME: , ptr null, i32 1, [3 x i16] zeroinitializer, i32 0 }, section "__llvm_prf_data", comdat($__profc__stdin__foo.[[#FOO_HASH]]), align 8
+; CHECK-SAME: , ptr null, i32 1, [3 x i16] zeroinitializer, i16 0, i16 0, i32 0 }, section "__llvm_prf_data", comdat($__profc__stdin__foo.[[#FOO_HASH]]), align 8
; CHECK: @__llvm_prf_nm
; CHECK: @llvm.compiler.used
diff --git a/llvm/test/Transforms/PGOProfile/instrprof_burst_sampling_fast.ll b/llvm/test/Transforms/PGOProfile/instrprof_burst_sampling_fast.ll
index 56d8364d8f543..07856cd90d11f 100644
--- a/llvm/test/Transforms/PGOProfile/instrprof_burst_sampling_fast.ll
+++ b/llvm/test/Transforms/PGOProfile/instrprof_burst_sampling_fast.ll
@@ -14,7 +14,7 @@ $__llvm_profile_raw_version = comdat any
; SAMPLE-VAR: @__llvm_profile_sampling = thread_local global i16 0, comdat
; SAMPLE-VAR: @__profc_f = private global [1 x i64] zeroinitializer, section "__llvm_prf_cnts", comdat, align 8
-; SAMPLE-VAR: @__profd_f = private global { i64, i64, i64, i64, ptr, ptr, i32, [3 x i16], i32 } { i64 -3706093650706652785, i64 12884901887, i64 sub (i64 ptrtoint (ptr @__profc_f to i64), i64 ptrtoint (ptr @__profd_f to i64)), i64 0, ptr @f.local, ptr null, i32 1, [3 x i16] zeroinitializer, i32 0 }, section "__llvm_prf_data", comdat($__profc_f), align 8
+; SAMPLE-VAR: @__profd_f = private global { i64, i64, i64, i64, ptr, ptr, i32, [3 x i16], i16, i16, i32 } { i64 -3706093650706652785, i64 12884901887, i64 sub (i64 ptrtoint (ptr @__profc_f to i64), i64 ptrtoint (ptr @__profd_f to i64)), i64 0, ptr @f.local, ptr null, i32 1, [3 x i16] zeroinitializer, i16 0, i16 0, i32 0 }, section "__llvm_prf_data", comdat($__profc_f), align 8
; SAMPLE-VAR: @__llvm_prf_nm = private constant {{.*}}, section "__llvm_prf_names", align 1
; SAMPLE-VAR: @llvm.compiler.used = appending global [2 x ptr] [ptr @__llvm_profile_sampling, ptr @__profd_f], section "llvm.metadata"
; SAMPLE-VAR: @llvm.used = appending global [1 x ptr] [ptr @__llvm_prf_nm], section "llvm.metadata"
diff --git a/llvm/test/Transforms/PGOProfile/vtable_profile.ll b/llvm/test/Transforms/PGOProfile/vtable_profile.ll
index aae1e2d8b4e49..0c554db05cfb4 100644
--- a/llvm/test/Transforms/PGOProfile/vtable_profile.ll
+++ b/llvm/test/Transforms/PGOProfile/vtable_profile.ll
@@ -49,7 +49,7 @@ target triple = "x86_64-unknown-linux-gnu"
@llvm.compiler.used = appending global [1 x ptr] [ptr @_ZTV5Base1], section "llvm.metadata"
; GEN: __llvm_profile_raw_version = comdat any
-; GEN: __llvm_profile_raw_version = hidden constant i64 72057594037927946, comdat
+; GEN: __llvm_profile_raw_version = hidden constant i64 72057594037927947, comdat
; GEN: __profn__Z4funci = private constant [8 x i8] c"_Z4funci"
; LOWER: $__profvt__ZTV7Derived = comdat nodeduplicate
diff --git a/llvm/test/tools/llvm-profdata/Inputs/c-general.profraw b/llvm/test/tools/llvm-profdata/Inputs/c-general.profraw
index a3e884343942ebc70ba95ab4ee006630b6816d80..1ce3941ea176a82580b9bfacfb3ddff4fa0b6355 100644
GIT binary patch
delta 197
zcmeyse?fq=u_!ISs37M*_e4&AK8-4#Kg(AA|KH*HnK?3aavZbN#0H0n4jdB?@H4Rp
zOuVQMWKUw`XJR*)EXWArOm+bB7z`!{LU;{~0*o^zPlRv|0HxSoOq?h``2(W>NF$Q~
zqsC-GCWt;JekPuP$$=1F15m19@<a&d08om<Ve&;VZxXWrqr_xEW{4&r=LX2Y0%nkb
Jn;V!5SOD3zJBt7S
delta 173
zcmca0 at PVJRu_!ISs37M**F;W#-Upg17^PPI|37zb_`jzUQ==!I;Ae7?m^e{>;tzqz
z0*w4j>JF0?8TBVe0GSs&fZ~%o7zHLTU=(27F?l0U`~^^ao4{m8Ci%${OahY)m;@Lj
zfIR)l6- at k0YdSy@Ad?ON#Wzg82o&dF=4Yyah;I;>9KbBVm at zq#S%2~jAj<%3&Ix9L
J%@3FhSO7}0IpY8T
diff --git a/llvm/test/tools/llvm-profdata/Inputs/compressed.profraw b/llvm/test/tools/llvm-profdata/Inputs/compressed.profraw
index e3f77e870d4d20828119348e70eb44e6d39e0ec0..f768cf4fd38c63a8977885739aa052074757fc4c 100644
GIT binary patch
delta 197
zcmX at Wzd(Squ_!ISs37M*_e4&AK8Y@`_=;8k|94n!UC<snIgVLsVuQm(2absc_?cJ)
zCSKGBvL`X}GqD>?7Gwl*COZIm3<i?}A-o1g0md1VCqg&}fKqHPCQg)}{DDybq>)L0
zQDd?o6GR^qKNC;D<Uk0o0Vq{4c_M^!04T-bF!>^wH;Gw*QDU+nGei at Ra|2{x0W-+J
J%?->BEC7JLI_m%c
delta 173
zcmZ1=aDbn)u_!ISs37M**F;W#-UpR`+ZL?)|9`G8Q)T?b)aZ#P_?cWJCQg)}_(Nc_
z03$z>y2E5eM*YbVK;{Jxp!nntMuEu-7zG%2Ox_3-e*qNVCNSBNNq({flfYyHCIQ9>
zAWwgC1rtBhnhuZz$fN^6 at ePwN0>wF)`I#yp;u{1e2QUjTW=u|G)}K5B$T9$%bAnl5
J^8;oF762DgIa2 at t
diff --git a/llvm/test/tools/llvm-profdata/binary-ids-padding.test b/llvm/test/tools/llvm-profdata/binary-ids-padding.test
index cc3e6c38e6907..dbbb3abc4186f 100644
--- a/llvm/test/tools/llvm-profdata/binary-ids-padding.test
+++ b/llvm/test/tools/llvm-profdata/binary-ids-padding.test
@@ -18,7 +18,7 @@
// whenever \12 (LF) is in the input string.
UNSUPPORTED: system-windows
RUN: printf '\201rforpl\377' > %t.profraw
-RUN: printf '\12\0\0\0\0\0\0\0' >> %t.profraw
+RUN: printf '\13\0\0\0\0\0\0\0' >> %t.profraw
// There will be 2 20-byte binary IDs, so the total Binary IDs size will be 64 bytes.
// 2 * 8 binary ID sizes
// + 2 * 20 binary IDs (of size 20)
@@ -66,15 +66,17 @@ RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\1\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
+RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\067\265\035\031\112\165\023\344' >> %t.profraw
RUN: printf '\02\0\0\0\0\0\0\0' >> %t.profraw
-RUN: printf '\310\377\3\0\1\0\0\0' >> %t.profraw
+RUN: printf '\300\377\3\0\1\0\0\0' >> %t.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\02\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
+RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\023\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\067\0\0\0\0\0\0\0' >> %t.profraw
diff --git a/llvm/test/tools/llvm-profdata/large-binary-id-size.test b/llvm/test/tools/llvm-profdata/large-binary-id-size.test
index b62bdad4ddb21..4310e83c3524d 100644
--- a/llvm/test/tools/llvm-profdata/large-binary-id-size.test
+++ b/llvm/test/tools/llvm-profdata/large-binary-id-size.test
@@ -2,7 +2,7 @@
// whenever \12 (LF) is in the input string.
UNSUPPORTED: system-windows
RUN: printf '\201rforpl\377' > %t.profraw
-RUN: printf '\12\0\0\0\0\0\0\0' >> %t.profraw
+RUN: printf '\13\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\40\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
diff --git a/llvm/test/tools/llvm-profdata/malformed-not-space-for-another-header.test b/llvm/test/tools/llvm-profdata/malformed-not-space-for-another-header.test
index 705e5efaf5875..f58714a818bd4 100644
--- a/llvm/test/tools/llvm-profdata/malformed-not-space-for-another-header.test
+++ b/llvm/test/tools/llvm-profdata/malformed-not-space-for-another-header.test
@@ -18,7 +18,7 @@
// whenever \12 (LF) is in the input string.
UNSUPPORTED: system-windows
RUN: printf '\201rforpl\377' > %t.profraw
-RUN: printf '\12\0\0\0\0\0\0\0' >> %t.profraw
+RUN: printf '\13\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\1\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
@@ -50,6 +50,7 @@ RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\1\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
+RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\023\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\3\0foo\0\0\0' >> %t.profraw
diff --git a/llvm/test/tools/llvm-profdata/malformed-num-counters-zero.test b/llvm/test/tools/llvm-profdata/malformed-num-counters-zero.test
index 157c13b926a7e..34b23bcd9acfd 100644
--- a/llvm/test/tools/llvm-profdata/malformed-num-counters-zero.test
+++ b/llvm/test/tools/llvm-profdata/malformed-num-counters-zero.test
@@ -18,7 +18,7 @@
// whenever \12 (LF) is in the input string.
UNSUPPORTED: system-windows
RUN: printf '\201rforpl\377' > %t.profraw
-RUN: printf '\12\0\0\0\0\0\0\0' >> %t.profraw
+RUN: printf '\13\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\1\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
@@ -56,6 +56,7 @@ RUN: cp %t.profraw %t-good.profraw
// Make NumCounters = 0 so that we get "number of counters is zero" error message
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
+RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\023\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\3\0foo\0\0\0' >> %t.profraw
@@ -66,12 +67,14 @@ ZERO: malformed instrumentation profile data: number of counters is zero
// Test a counter value greater than 2^56.
RUN: printf '\1\0\0\0\0\0\0\0' >> %t-bad.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t-bad.profraw
+RUN: printf '\0\0\0\0\0\0\0\0' >> %t-bad.profraw
// Counter value is 72057594037927937
RUN: printf '\1\0\0\0\0\0\0\1' >> %t-bad.profraw
RUN: printf '\3\0foo\0\0\0' >> %t-bad.profraw
RUN: printf '\1\0\0\0\0\0\0\0' >> %t-good.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t-good.profraw
+RUN: printf '\0\0\0\0\0\0\0\0' >> %t-good.profraw
// Counter value is 72057594037927937
RUN: printf '\1\0\0\0\0\0\0\0' >> %t-good.profraw
RUN: printf '\3\0foo\0\0\0' >> %t-good.profraw
diff --git a/llvm/test/tools/llvm-profdata/malformed-ptr-to-counter-array.test b/llvm/test/tools/llvm-profdata/malformed-ptr-to-counter-array.test
index 83cf76f68fb63..a18bf0d4bc439 100644
--- a/llvm/test/tools/llvm-profdata/malformed-ptr-to-counter-array.test
+++ b/llvm/test/tools/llvm-profdata/malformed-ptr-to-counter-array.test
@@ -18,7 +18,7 @@
// whenever \12 (LF) is in the input string.
UNSUPPORTED: system-windows
RUN: printf '\201rforpl\377' > %t.profraw
-RUN: printf '\12\0\0\0\0\0\0\0' >> %t.profraw
+RUN: printf '\13\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\1\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
@@ -56,6 +56,7 @@ RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\02\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
+RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
// Counter Section
diff --git a/llvm/test/tools/llvm-profdata/misaligned-binary-ids-size.test b/llvm/test/tools/llvm-profdata/misaligned-binary-ids-size.test
index 0f20a1b0b369c..b43b2408869c2 100644
--- a/llvm/test/tools/llvm-profdata/misaligned-binary-ids-size.test
+++ b/llvm/test/tools/llvm-profdata/misaligned-binary-ids-size.test
@@ -2,7 +2,7 @@
// whenever \12 (LF) is in the input string.
UNSUPPORTED: system-windows
RUN: printf '\201rforpl\377' > %t.profraw
-RUN: printf '\12\0\0\0\0\0\0\0' >> %t.profraw
+RUN: printf '\13\0\0\0\0\0\0\0' >> %t.profraw
// We should fail on this because the binary IDs is not a multiple of 8 bytes.
RUN: printf '\77\0\0\0\0\0\0\0' >> %t.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t.profraw
diff --git a/llvm/test/tools/llvm-profdata/profile-version.test b/llvm/test/tools/llvm-profdata/profile-version.test
index e811699ac63ed..7f627cb89c7d0 100644
--- a/llvm/test/tools/llvm-profdata/profile-version.test
+++ b/llvm/test/tools/llvm-profdata/profile-version.test
@@ -2,7 +2,7 @@ Test the profile version.
RUN: llvm-profdata merge -o %t.profdata %p/Inputs/basic.proftext
RUN: llvm-profdata show --profile-version %t.profdata | FileCheck %s
-CHECK: Profile version: 13
+CHECK: Profile version: 14
RUN: llvm-profdata merge -o %t.prev.profdata %p/Inputs/basic.proftext --write-prev-version
RUN: llvm-profdata show --profile-version %t.prev.profdata | FileCheck %s --check-prefix=PREV
diff --git a/llvm/test/tools/llvm-profdata/raw-32-bits-be.test b/llvm/test/tools/llvm-profdata/raw-32-bits-be.test
index 6b3dc96b36270..c83b365724bb2 100644
--- a/llvm/test/tools/llvm-profdata/raw-32-bits-be.test
+++ b/llvm/test/tools/llvm-profdata/raw-32-bits-be.test
@@ -3,7 +3,7 @@
UNSUPPORTED: system-windows
// Header
RUN: printf '\377lprofR\201' > %t
-RUN: printf '\0\0\0\0\0\0\0\12' >> %t
+RUN: printf '\0\0\0\0\0\0\0\13' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
RUN: printf '\0\0\0\0\0\0\0\2' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
@@ -27,17 +27,21 @@ RUN: printf '\0\0\0\0' >> %t
RUN: printf '\0\0\0\0' >> %t
RUN: printf '\0\0\0\1' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
+RUN: printf '\0\0\0\0' >> %t
RUN: printf '\0\0\0\3' >> %t
+RUN: printf '\0\0\0\0' >> %t
RUN: printf '\344\023\165\112\031\035\265\067' >> %t
RUN: printf '\0\0\0\0\0\0\0\2' >> %t
-RUN: printf '\0\377\377\330' >> %t
-RUN: printf '\2\377\377\323' >> %t
+RUN: printf '\0\377\377\320' >> %t
+RUN: printf '\2\377\377\313' >> %t
RUN: printf '\0\0\0\0' >> %t
RUN: printf '\0\0\0\0' >> %t
RUN: printf '\0\0\0\2' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
+RUN: printf '\0\0\0\0' >> %t
RUN: printf '\0\0\0\1' >> %t
+RUN: printf '\0\0\0\0' >> %t
RUN: printf '\0\0\0\0\0\0\0\023' >> %t
RUN: printf '\0\0\0\0\0\0\0\067' >> %t
diff --git a/llvm/test/tools/llvm-profdata/raw-32-bits-le.test b/llvm/test/tools/llvm-profdata/raw-32-bits-le.test
index 95625565f5c0c..8424bf4fff570 100644
--- a/llvm/test/tools/llvm-profdata/raw-32-bits-le.test
+++ b/llvm/test/tools/llvm-profdata/raw-32-bits-le.test
@@ -2,7 +2,7 @@
// whenever \12 (LF) is in the input string.
UNSUPPORTED: system-windows
RUN: printf '\201Rforpl\377' > %t
-RUN: printf '\12\0\0\0\0\0\0\0' >> %t
+RUN: printf '\13\0\0\0\0\0\0\0' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
RUN: printf '\2\0\0\0\0\0\0\0' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
@@ -26,17 +26,21 @@ RUN: printf '\0\0\0\0' >> %t
RUN: printf '\0\0\0\0' >> %t
RUN: printf '\1\0\0\0' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
+RUN: printf '\0\0\0\0' >> %t
RUN: printf '\3\0\0\0' >> %t
+RUN: printf '\0\0\0\0' >> %t
RUN: printf '\067\265\035\031\112\165\023\344' >> %t
RUN: printf '\02\0\0\0\0\0\0\0' >> %t
-RUN: printf '\330\377\377\0' >> %t
-RUN: printf '\323\377\377\2' >> %t
+RUN: printf '\320\377\377\0' >> %t
+RUN: printf '\313\377\377\2' >> %t
RUN: printf '\0\0\0\0' >> %t
RUN: printf '\0\0\0\0' >> %t
RUN: printf '\2\0\0\0' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
+RUN: printf '\0\0\0\0' >> %t
RUN: printf '\1\0\0\0' >> %t
+RUN: printf '\0\0\0\0' >> %t
RUN: printf '\023\0\0\0\0\0\0\0' >> %t
RUN: printf '\067\0\0\0\0\0\0\0' >> %t
diff --git a/llvm/test/tools/llvm-profdata/raw-64-bits-be.test b/llvm/test/tools/llvm-profdata/raw-64-bits-be.test
index 5316ef5a3e559..02d9b789ffedc 100644
--- a/llvm/test/tools/llvm-profdata/raw-64-bits-be.test
+++ b/llvm/test/tools/llvm-profdata/raw-64-bits-be.test
@@ -2,7 +2,7 @@
// whenever \12 (LF) is in the input string.
UNSUPPORTED: system-windows
RUN: printf '\377lprofr\201' > %t
-RUN: printf '\0\0\0\0\0\0\0\12' >> %t
+RUN: printf '\0\0\0\0\0\0\0\13' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
RUN: printf '\0\0\0\0\0\0\0\2' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
@@ -26,17 +26,21 @@ RUN: printf '\0\0\0\0\0\0\0\0' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
RUN: printf '\0\0\0\1' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
+RUN: printf '\0\0\0\0' >> %t
RUN: printf '\0\0\0\3' >> %t
+RUN: printf '\0\0\0\0' >> %t
RUN: printf '\344\023\165\112\031\035\265\067' >> %t
RUN: printf '\0\0\0\0\0\0\0\02' >> %t
-RUN: printf '\0\0\0\1\0\3\377\310' >> %t
-RUN: printf '\0\0\0\3\0\3\377\303' >> %t
+RUN: printf '\0\0\0\1\0\3\377\300' >> %t
+RUN: printf '\0\0\0\3\0\3\377\273' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
RUN: printf '\0\0\0\02' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
+RUN: printf '\0\0\0\0' >> %t
RUN: printf '\0\0\0\1' >> %t
+RUN: printf '\0\0\0\0' >> %t
RUN: printf '\0\0\0\0\0\0\0\023' >> %t
RUN: printf '\0\0\0\0\0\0\0\067' >> %t
diff --git a/llvm/test/tools/llvm-profdata/raw-64-bits-le.test b/llvm/test/tools/llvm-profdata/raw-64-bits-le.test
index 58f4da8cf0386..94c50430f2dc8 100644
--- a/llvm/test/tools/llvm-profdata/raw-64-bits-le.test
+++ b/llvm/test/tools/llvm-profdata/raw-64-bits-le.test
@@ -2,7 +2,7 @@
// whenever \12 (LF) is in the input string.
UNSUPPORTED: system-windows
RUN: printf '\201rforpl\377' > %t
-RUN: printf '\12\0\0\0\0\0\0\0' >> %t
+RUN: printf '\13\0\0\0\0\0\0\0' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
RUN: printf '\2\0\0\0\0\0\0\0' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
@@ -26,17 +26,21 @@ RUN: printf '\0\0\0\0\0\0\0\0' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
RUN: printf '\1\0\0\0' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
+RUN: printf '\0\0\0\0' >> %t
RUN: printf '\3\0\0\0' >> %t
+RUN: printf '\0\0\0\0' >> %t
RUN: printf '\067\265\035\031\112\165\023\344' >> %t
RUN: printf '\02\0\0\0\0\0\0\0' >> %t
-RUN: printf '\310\377\3\0\1\0\0\0' >> %t
-RUN: printf '\303\377\3\0\3\0\0\0' >> %t
+RUN: printf '\300\377\3\0\1\0\0\0' >> %t
+RUN: printf '\273\377\3\0\3\0\0\0' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
RUN: printf '\02\0\0\0' >> %t
RUN: printf '\0\0\0\0\0\0\0\0' >> %t
+RUN: printf '\0\0\0\0' >> %t
RUN: printf '\1\0\0\0' >> %t
+RUN: printf '\0\0\0\0' >> %t
RUN: printf '\023\0\0\0\0\0\0\0' >> %t
RUN: printf '\067\0\0\0\0\0\0\0' >> %t
diff --git a/llvm/test/tools/llvm-profdata/raw-two-profiles.test b/llvm/test/tools/llvm-profdata/raw-two-profiles.test
index 47cc6fa4fd7fe..8c6c79746553b 100644
--- a/llvm/test/tools/llvm-profdata/raw-two-profiles.test
+++ b/llvm/test/tools/llvm-profdata/raw-two-profiles.test
@@ -2,7 +2,7 @@
// whenever \12 (LF) is in the input string.
UNSUPPORTED: system-windows
RUN: printf '\201rforpl\377' > %t-foo.profraw
-RUN: printf '\12\0\0\0\0\0\0\0' >> %t-foo.profraw
+RUN: printf '\13\0\0\0\0\0\0\0' >> %t-foo.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t-foo.profraw
RUN: printf '\1\0\0\0\0\0\0\0' >> %t-foo.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t-foo.profraw
@@ -26,12 +26,13 @@ RUN: printf '\0\0\0\0\0\0\0\0' >> %t-foo.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t-foo.profraw
RUN: printf '\1\0\0\0\0\0\0\0' >> %t-foo.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t-foo.profraw
+RUN: printf '\0\0\0\0\0\0\0\0' >> %t-foo.profraw
RUN: printf '\023\0\0\0\0\0\0\0' >> %t-foo.profraw
RUN: printf '\3\0foo\0\0\0' >> %t-foo.profraw
RUN: printf '\201rforpl\377' > %t-bar.profraw
-RUN: printf '\12\0\0\0\0\0\0\0' >> %t-bar.profraw
+RUN: printf '\13\0\0\0\0\0\0\0' >> %t-bar.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t-bar.profraw
RUN: printf '\1\0\0\0\0\0\0\0' >> %t-bar.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t-bar.profraw
@@ -55,6 +56,7 @@ RUN: printf '\0\0\0\0\0\0\0\0' >> %t-bar.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t-bar.profraw
RUN: printf '\02\0\0\0\0\0\0\0' >> %t-bar.profraw
RUN: printf '\0\0\0\0\0\0\0\0' >> %t-bar.profraw
+RUN: printf '\0\0\0\0\0\0\0\0' >> %t-bar.profraw
RUN: printf '\067\0\0\0\0\0\0\0' >> %t-bar.profraw
RUN: printf '\101\0\0\0\0\0\0\0' >> %t-bar.profraw
diff --git a/llvm/tools/llvm-profdata/llvm-profdata.cpp b/llvm/tools/llvm-profdata/llvm-profdata.cpp
index ab67d75770fee..c40f6950d7d3c 100644
--- a/llvm/tools/llvm-profdata/llvm-profdata.cpp
+++ b/llvm/tools/llvm-profdata/llvm-profdata.cpp
@@ -30,6 +30,7 @@
#include "llvm/Support/BalancedPartitioning.h"
#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Discriminator.h"
+#include "llvm/Support/Endian.h"
#include "llvm/Support/Errc.h"
#include "llvm/Support/FileSystem.h"
#include "llvm/Support/Format.h"
@@ -690,6 +691,62 @@ static void overlapInput(const std::string &BaseFilename,
}
}
+/// Read uniform counters from a .unifcnts file.
+/// Returns true if the file was successfully read, false otherwise.
+/// The uniform counters are stored in UniformCounters vector.
+static bool readUniformCountersFile(StringRef ProfileFilename,
+ std::vector<uint64_t> &UniformCounters) {
+ // Construct the .unifcnts filename by replacing the extension
+ SmallString<256> UniformFilename(ProfileFilename);
+ sys::path::replace_extension(UniformFilename, "unifcnts");
+
+ // Try to open the file
+ auto BufferOrErr = MemoryBuffer::getFile(UniformFilename);
+ if (!BufferOrErr) {
+ // File doesn't exist or can't be read - this is not an error,
+ // just means no uniform counters are available
+ return false;
+ }
+
+ auto &Buffer = *BufferOrErr.get();
+ const char *Data = Buffer.getBufferStart();
+ size_t Size = Buffer.getBufferSize();
+
+ // Minimum size: 4 uint64_t header fields
+ if (Size < 4 * sizeof(uint64_t))
+ return false;
+
+ // Read header
+ uint64_t Magic = support::endian::read64le(Data);
+ uint64_t Version = support::endian::read64le(Data + 8);
+ uint64_t NumCounters = support::endian::read64le(Data + 16);
+ uint64_t CountersSize = support::endian::read64le(Data + 24);
+
+ // Verify magic number
+ const uint64_t ExpectedMagic = 0x55434E5450524F46ULL; // "UCNTPROF"
+ if (Magic != ExpectedMagic)
+ return false;
+
+ // Verify version
+ if (Version != 1)
+ return false;
+
+ // Verify size
+ size_t ExpectedSize = 4 * sizeof(uint64_t) + CountersSize;
+ if (Size < ExpectedSize)
+ return false;
+
+ // Read counters
+ UniformCounters.resize(NumCounters);
+ const char *CounterData = Data + 4 * sizeof(uint64_t);
+ for (uint64_t i = 0; i < NumCounters; ++i) {
+ UniformCounters[i] =
+ support::endian::read64le(CounterData + i * sizeof(uint64_t));
+ }
+
+ return true;
+}
+
/// Load an input into a writer context.
static void
loadInput(const WeightedFile &Input, SymbolRemapper *Remapper,
@@ -820,11 +877,50 @@ loadInput(const WeightedFile &Input, SymbolRemapper *Remapper,
return;
}
+ // Try to read uniform counters file for AMDGPU divergence tracking
+ std::vector<uint64_t> UniformCounters;
+ bool HasUniformCounters =
+ readUniformCountersFile(Input.Filename, UniformCounters);
+ size_t UniformCounterOffset = 0;
+
for (auto &I : *Reader) {
if (Remapper)
I.Name = (*Remapper)(I.Name);
const StringRef FuncName = I.Name;
bool Reported = false;
+
+ // If we have uniform counters and this is an offload profile, compute
+ // uniformity from the uniform/total counter ratio
+ if (HasUniformCounters && I.NumOffloadProfilingThreads > 0) {
+ size_t NumCounters = I.Counts.size();
+ if (UniformCounterOffset + NumCounters <= UniformCounters.size()) {
+ // Compute uniformity bits from uniform counter ratio
+ size_t NumBlocks = NumCounters / (I.NumOffloadProfilingThreads + 1);
+ I.UniformityBits.resize((NumBlocks + 7) / 8, 0xFF); // Default: uniform
+
+ for (size_t BlockIdx = 0; BlockIdx < NumBlocks; ++BlockIdx) {
+ uint64_t TotalCount = 0;
+ uint64_t UniformCount = 0;
+
+ // Sum across all slots for this block
+ for (size_t Slot = 0; Slot < I.NumOffloadProfilingThreads; ++Slot) {
+ size_t Idx = BlockIdx * (I.NumOffloadProfilingThreads + 1) + Slot;
+ TotalCount += I.Counts[Idx];
+ UniformCount += UniformCounters[UniformCounterOffset + Idx];
+ }
+
+ // Compute uniformity ratio (90% threshold)
+ bool IsUniform =
+ (TotalCount == 0) || ((double)UniformCount / TotalCount >= 0.9);
+ if (!IsUniform) {
+ I.UniformityBits[BlockIdx / 8] &= ~(1 << (BlockIdx % 8));
+ }
+ }
+
+ UniformCounterOffset += NumCounters;
+ }
+ }
+
WC->Writer.addRecord(std::move(I), Input.Weight, [&](Error E) {
if (Reported) {
consumeError(std::move(E));
@@ -2979,6 +3075,16 @@ static int showInstrProfile(ShowFormat SFormat, raw_fd_ostream &OS) {
OS << (I == Start ? "" : ", ") << Func.Counts[I];
}
OS << "]\n";
+
+ // Show uniformity bits if present
+ if (!Func.UniformityBits.empty()) {
+ OS << " Block uniformity: [";
+ for (size_t I = Start, E = Func.Counts.size(); I < E; ++I) {
+ bool IsUniform = Func.isBlockUniform(I);
+ OS << (I == Start ? "" : ", ") << (IsUniform ? "U" : "D");
+ }
+ OS << "]\n";
+ }
}
if (ShowIndirectCallTargets) {
diff --git a/offload/plugins-nextgen/common/include/GlobalHandler.h b/offload/plugins-nextgen/common/include/GlobalHandler.h
index af7dac66ca85d..0802b6f818910 100644
--- a/offload/plugins-nextgen/common/include/GlobalHandler.h
+++ b/offload/plugins-nextgen/common/include/GlobalHandler.h
@@ -67,9 +67,11 @@ struct __llvm_profile_data {
extern "C" {
extern int __attribute__((weak)) __llvm_write_custom_profile(
- const char *Target, const __llvm_profile_data *DataBegin,
- const __llvm_profile_data *DataEnd, const char *CountersBegin,
- const char *CountersEnd, const char *NamesBegin, const char *NamesEnd,
+ const char *Target, const char *TUSuffix,
+ const __llvm_profile_data *DataBegin, const __llvm_profile_data *DataEnd,
+ const char *CountersBegin, const char *CountersEnd,
+ const char *UniformCountersBegin, const char *UniformCountersEnd,
+ const char *NamesBegin, const char *NamesEnd,
const uint64_t *VersionOverride);
}
/// PGO profiling data extracted from a GPU device
diff --git a/offload/plugins-nextgen/common/src/GlobalHandler.cpp b/offload/plugins-nextgen/common/src/GlobalHandler.cpp
index b92c606d14da1..09416c18a3974 100644
--- a/offload/plugins-nextgen/common/src/GlobalHandler.cpp
+++ b/offload/plugins-nextgen/common/src/GlobalHandler.cpp
@@ -311,9 +311,11 @@ Error GPUProfGlobals::write() const {
memcpy(NamesBegin, NamesData.data(), NamesData.size());
// Invoke compiler-rt entrypoint
+ // Pass NULL for TUSuffix and UniformCounters (not used by OpenMP)
int result = __llvm_write_custom_profile(
- TargetTriple.str().c_str(), DataBegin, DataEnd, CountersBegin,
- CountersEnd, NamesBegin, NamesEnd, &Version);
+ TargetTriple.str().c_str(), /*TUSuffix=*/nullptr, DataBegin, DataEnd,
+ CountersBegin, CountersEnd, /*UniformCountersBegin=*/nullptr,
+ /*UniformCountersEnd=*/nullptr, NamesBegin, NamesEnd, &Version);
if (result != 0)
return Plugin::error(ErrorCode::HOST_IO,
"error writing GPU PGO data to file");
More information about the cfe-commits
mailing list