[clang] [HIP] Allow partial linking for `-fgpu-rdc` (PR #81700)

Tue Feb 13 20:16:37 PST 2024

https://github.com/yxsamliu created https://github.com/llvm/llvm-project/pull/81700

`-fgpu-rdc` mode allows device functions call device functions in different TU. However, currently all device objects have to be linked together since only one fat binary is supported. This is time consuming for AMDGPU backend since it only supports LTO.

There are use cases that objects can be divided into groups in which device functions are self-contained but host functions are not. It is desirable to link/optimize/codegen the device code and generate a fatbin for each group, whereas partially link the host code with `ld -r` or generate a static library by using the `-emit-static-lib` option of clang. This avoids linking all device code together, therefore decreases the linking time for `-fgpu-rdc`.

Previously, clang emits an external symbol `__hip_fatbin` for all objects for `-fgpu-rdc`. With this patch, clang emits an unique external symbol `__hip_fatbin_{cuid}` for the fat binary for each object. When a group of objects are linked together to generate a fatbin, the symbols are merged by alias and point to the same fat binary. Each group has its own fat binary. One executable or shared library can have multiple fat binaries. Device linking is done for undefined fab binary symbols only to avoid repeated linking. `__hip_gpubin_handle` is also uniquefied and merged to avoid repeated registering. Symbol `__hip_cuid_{cuid}` is introduced to facilitate debugging and tooling.

Fixes: https://github.com/llvm/llvm-project/issues/77018

>From 318117089831345caa13d8b4eeea23d0aa2c8588 Mon Sep 17 00:00:00 2001
From: "Yaxun (Sam) Liu" <yaxun.liu at amd.com>
Date: Tue, 13 Feb 2024 10:00:21 -0500
Subject: [PATCH] [HIP] Allow partial linking for `-fgpu-rdc`

`-fgpu-rdc` mode allows device functions call device functions
in different TU. However, currently all device objects
have to be linked together since only one fat binary
is supported. This is time consuming for AMDGPU backend
since it only supports LTO.

There are use cases that objects can be divided into groups
in which device functions are self-contained but host functions
are not. It is desirable to link/optimize/codegen the device
code and generate a fatbin for each group, whereas partially
link the host code with `ld -r` or generate a static library
by using the `-emit-static-lib` option of clang. This avoids
linking all device code together, therefore decreases
the linking time for `-fgpu-rdc`.

Previously, clang emits an external symbol `__hip_fatbin`
for all objects for `-fgpu-rdc`. With this patch, clang
emits an unique external symbol `__hip_fatbin_{cuid}`
for the fat binary for each object. When a group of objects
are linked together to generate a fatbin, the symbols
are merged by alias and point to the same fat binary.
Each group has its own fat binary. One executable or
shared library can have multiple fat binaries. Device
linking is done for undefined fab binary symbols only
to avoid repeated linking. `__hip_gpubin_handle` is also
uniquefied and merged to avoid repeated registering.
Symbol `__hip_cuid_{cuid}` is introduced to facilitate
debugging and tooling.

Fixes: https://github.com/llvm/llvm-project/issues/77018
---
 clang/lib/CodeGen/CGCUDANV.cpp                |  22 +-
 clang/lib/CodeGen/CodeGenModule.cpp           |  10 +-
 clang/lib/Driver/ToolChains/HIPUtility.cpp    | 242 +++++++++++++++++-
 clang/test/CodeGenCUDA/device-stub.cu         |  10 +-
 .../test/CodeGenCUDA/host-used-device-var.cu  |   5 +-
 clang/test/Driver/hip-toolchain-rdc.hip       |  38 ++-
 6 files changed, 283 insertions(+), 44 deletions(-)

diff --git a/clang/lib/CodeGen/CGCUDANV.cpp b/clang/lib/CodeGen/CGCUDANV.cpp
index 5b43272bfa62f4..7d23f944732dbf 100644
--- a/clang/lib/CodeGen/CGCUDANV.cpp
+++ b/clang/lib/CodeGen/CGCUDANV.cpp
@@ -760,10 +760,10 @@ llvm::Function *CGNVCUDARuntime::makeModuleCtorFunction() {
       // to contain the fat binary but will be populated somewhere else,
       // e.g. by lld through link script.
       FatBinStr = new llvm::GlobalVariable(
-        CGM.getModule(), CGM.Int8Ty,
-        /*isConstant=*/true, llvm::GlobalValue::ExternalLinkage, nullptr,
-        "__hip_fatbin", nullptr,
-        llvm::GlobalVariable::NotThreadLocal);
+          CGM.getModule(), CGM.Int8Ty,
+          /*isConstant=*/true, llvm::GlobalValue::ExternalLinkage, nullptr,
+          "__hip_fatbin_" + CGM.getContext().getCUIDHash(), nullptr,
+          llvm::GlobalVariable::NotThreadLocal);
       cast<llvm::GlobalVariable>(FatBinStr)->setSection(FatbinConstantName);
     }
 
@@ -816,8 +816,8 @@ llvm::Function *CGNVCUDARuntime::makeModuleCtorFunction() {
   // thread safety of the loaded program. Therefore we can assume sequential
   // execution of constructor functions here.
   if (IsHIP) {
-    auto Linkage = CudaGpuBinary ? llvm::GlobalValue::InternalLinkage :
-        llvm::GlobalValue::LinkOnceAnyLinkage;
+    auto Linkage = CudaGpuBinary ? llvm::GlobalValue::InternalLinkage
+                                 : llvm::GlobalValue::ExternalLinkage;
     llvm::BasicBlock *IfBlock =
         llvm::BasicBlock::Create(Context, "if", ModuleCtorFunc);
     llvm::BasicBlock *ExitBlock =
@@ -826,11 +826,11 @@ llvm::Function *CGNVCUDARuntime::makeModuleCtorFunction() {
     // of HIP ABI.
     GpuBinaryHandle = new llvm::GlobalVariable(
         TheModule, PtrTy, /*isConstant=*/false, Linkage,
-        /*Initializer=*/llvm::ConstantPointerNull::get(PtrTy),
-        "__hip_gpubin_handle");
-    if (Linkage == llvm::GlobalValue::LinkOnceAnyLinkage)
-      GpuBinaryHandle->setComdat(
-          CGM.getModule().getOrInsertComdat(GpuBinaryHandle->getName()));
+        /*Initializer=*/
+            CudaGpuBinary ? llvm::ConstantPointerNull::get(PtrTy) : nullptr,
+        CudaGpuBinary
+            ? "__hip_gpubin_handle"
+            : "__hip_gpubin_handle_" + CGM.getContext().getCUIDHash());
     GpuBinaryHandle->setAlignment(CGM.getPointerAlign().getAsAlign());
     // Prevent the weak symbol in different shared libraries being merged.
     if (Linkage != llvm::GlobalValue::InternalLinkage)
diff --git a/clang/lib/CodeGen/CodeGenModule.cpp b/clang/lib/CodeGen/CodeGenModule.cpp
index c984260b082cd1..218066bced6c19 100644
--- a/clang/lib/CodeGen/CodeGenModule.cpp
+++ b/clang/lib/CodeGen/CodeGenModule.cpp
@@ -919,7 +919,15 @@ void CodeGenModule::Release() {
         llvm::ConstantArray::get(ATy, UsedArray), "__clang_gpu_used_external");
     addCompilerUsedGlobal(GV);
   }
-
+  if (LangOpts.HIP) {
+    // Emit a unique ID so that host and device binaries from the same
+    // compilation unit can be associated.
+    auto *GV = new llvm::GlobalVariable(
+        getModule(), Int8Ty, false, llvm::GlobalValue::ExternalLinkage,
+        llvm::Constant::getNullValue(Int8Ty),
+        "__hip_cuid_" + getContext().getCUIDHash());
+    addCompilerUsedGlobal(GV);
+  }
   emitLLVMUsed();
   if (SanStats)
     SanStats->finish();
diff --git a/clang/lib/Driver/ToolChains/HIPUtility.cpp b/clang/lib/Driver/ToolChains/HIPUtility.cpp
index f692458b775de2..4bd6926ec6e463 100644
--- a/clang/lib/Driver/ToolChains/HIPUtility.cpp
+++ b/clang/lib/Driver/ToolChains/HIPUtility.cpp
@@ -9,13 +9,24 @@
 #include "HIPUtility.h"
 #include "CommonArgs.h"
 #include "clang/Driver/Compilation.h"
+#include "clang/Driver/Options.h"
+#include "llvm/ADT/StringExtras.h"
 #include "llvm/ADT/StringRef.h"
+#include "llvm/Object/Archive.h"
+#include "llvm/Object/ObjectFile.h"
+#include "llvm/Support/MD5.h"
+#include "llvm/Support/MemoryBuffer.h"
 #include "llvm/Support/Path.h"
+#include "llvm/Support/raw_ostream.h"
 #include "llvm/TargetParser/Triple.h"
+#include <deque>
+#include <set>
 
+using namespace clang;
 using namespace clang::driver;
 using namespace clang::driver::tools;
 using namespace llvm::opt;
+using llvm::dyn_cast;
 
 #if defined(_WIN32) || defined(_WIN64)
 #define NULL_FILE "nul"
@@ -36,6 +47,146 @@ static std::string normalizeForBundler(const llvm::Triple &T,
                      : T.normalize();
 }
 
+// Collect undefined __hip_fatbin* and __hip_gpubin_handle* symbols from all
+// input object or archive files.
+class HIPUndefinedFatBinSymbols {
+public:
+  HIPUndefinedFatBinSymbols(const Compilation &C)
+      : C(C), DiagID(C.getDriver().getDiags().getCustomDiagID(
+                  DiagnosticsEngine::Error,
+                  "Error collecting HIP undefined fatbin symbols: %0")),
+        Quiet(C.getArgs().hasArg(options::OPT__HASH_HASH_HASH)),
+        Verbose(C.getArgs().hasArg(options::OPT_v)){
+    populateSymbols();
+    if (Verbose) {
+      for (auto Name : FatBinSymbols)
+        llvm::errs() << "Found undefined HIP fatbin symbol: " << Name << "\n";
+      for (auto Name : GPUBinHandleSymbols)
+        llvm::errs() << "Found undefined HIP gpubin handle symbol: " << Name
+                     << "\n";
+    }
+  }
+
+  const std::set<std::string> &getFatBinSymbols() const {
+    return FatBinSymbols;
+  }
+
+  const std::set<std::string> &getGPUBinHandleSymbols() const {
+    return GPUBinHandleSymbols;
+  }
+
+private:
+  const Compilation &C;
+  unsigned DiagID;
+  bool Quiet;
+  bool Verbose;
+  std::set<std::string> FatBinSymbols;
+  std::set<std::string> GPUBinHandleSymbols;
+  const std::string FatBinPrefix = "__hip_fatbin";
+  const std::string GPUBinHandlePrefix = "__hip_gpubin_handle";
+
+  void populateSymbols() {
+    std::deque<const Action *> WorkList;
+    std::set<const Action *> Visited;
+
+    for (const auto &Action : C.getActions()) {
+      WorkList.push_back(Action);
+    }
+
+    while (!WorkList.empty()) {
+      const Action *CurrentAction = WorkList.front();
+      WorkList.pop_front();
+
+      if (!CurrentAction || !Visited.insert(CurrentAction).second)
+        continue;
+
+      if (const auto *IA = dyn_cast<InputAction>(CurrentAction)) {
+        std::string ID = IA->getId().str();
+        if (!ID.empty()) {
+          ID = llvm::utohexstr(llvm::MD5Hash(ID), /*LowerCase=*/true);
+          FatBinSymbols.insert(Twine(FatBinPrefix + "_" + ID).str());
+          GPUBinHandleSymbols.insert(
+              Twine(GPUBinHandlePrefix + "_" + ID).str());
+          continue;
+        }
+        const char *Filename = IA->getInputArg().getValue();
+        auto BufferOrErr = llvm::MemoryBuffer::getFile(Filename);
+        // Input action could be options to linker, therefore ignore it
+        // if cannot read it.
+        if (!BufferOrErr)
+          continue;
+
+        processInput(BufferOrErr.get()->getMemBufferRef());
+      } else
+        WorkList.insert(WorkList.end(), CurrentAction->getInputs().begin(),
+                        CurrentAction->getInputs().end());
+    }
+  }
+
+  void processInput(const llvm::MemoryBufferRef &Buffer) {
+    // Try processing as object file first.
+    auto ObjFileOrErr = llvm::object::ObjectFile::createObjectFile(Buffer);
+    if (ObjFileOrErr) {
+      processSymbols(**ObjFileOrErr);
+      return;
+    }
+
+    // Then try processing as archive files.
+    llvm::consumeError(ObjFileOrErr.takeError());
+    auto ArchiveOrErr = llvm::object::Archive::create(Buffer);
+    if (ArchiveOrErr) {
+      llvm::Error Err = llvm::Error::success();
+      llvm::object::Archive &Archive = *ArchiveOrErr.get();
+      for (auto &Child : Archive.children(Err)) {
+        auto ChildBufOrErr = Child.getMemoryBufferRef();
+        if (ChildBufOrErr)
+          processInput(*ChildBufOrErr);
+        else
+          errorHandler(ChildBufOrErr.takeError());
+      }
+
+      if (Err)
+        errorHandler(std::move(Err));
+      return;
+    }
+
+    // Ignore other files.
+    llvm::consumeError(ArchiveOrErr.takeError());
+  }
+  void processSymbols(const llvm::object::ObjectFile &Obj) {
+    for (const auto &Symbol : Obj.symbols()) {
+      auto FlagOrErr = Symbol.getFlags();
+      if (!FlagOrErr) {
+        errorHandler(FlagOrErr.takeError());
+        continue;
+      }
+
+      // Filter only undefined symbols
+      if (!(FlagOrErr.get() & llvm::object::SymbolRef::SF_Undefined)) {
+        continue;
+      }
+
+      auto NameOrErr = Symbol.getName();
+      if (!NameOrErr) {
+        errorHandler(NameOrErr.takeError());
+        continue;
+      }
+      llvm::StringRef Name = *NameOrErr;
+
+      if (Name.starts_with(FatBinPrefix))
+        FatBinSymbols.insert(Name.str());
+      else if (Name.starts_with(GPUBinHandlePrefix))
+        GPUBinHandleSymbols.insert(Name.str());
+    }
+  }
+
+  void errorHandler(llvm::Error Err) {
+    if (Quiet)
+      return;
+    C.getDriver().Diag(DiagID) << llvm::toString(std::move(Err));
+  }
+};
+
 // Construct a clang-offload-bundler command to bundle code objects for
 // different devices into a HIP fat binary.
 void HIP::constructHIPFatbinCommand(Compilation &C, const JobAction &JA,
@@ -130,26 +281,91 @@ void HIP::constructGenerateObjFileFromHIPFatBinary(
   auto HostTriple =
       C.getSingleOffloadToolChain<Action::OFK_Host>()->getTriple();
 
+  HIPUndefinedFatBinSymbols Symbols(C);
+
+  std::string PrimaryHipFatbinSymbol;
+  std::string
+      PrimaryGpuBinHandleSymbol;
+  bool FoundPrimaryHipFatbinSymbol =
+      false;
+  bool FoundPrimaryGpuBinHandleSymbol =
+      false;
+
+  std::vector<std::string>
+      AliasHipFatbinSymbols;
+  std::vector<std::string>
+      AliasGpuBinHandleSymbols;
+
+  // Iterate through symbols to find the primary ones and collect others for
+  // aliasing
+  for (const auto &Symbol : Symbols.getFatBinSymbols()) {
+    if (!FoundPrimaryHipFatbinSymbol) {
+      PrimaryHipFatbinSymbol = Symbol;
+      FoundPrimaryHipFatbinSymbol = true;
+    } else {
+      AliasHipFatbinSymbols.push_back(Symbol);
+    }
+  }
+
+  for (const auto &Symbol : Symbols.getGPUBinHandleSymbols()) {
+    if (!FoundPrimaryGpuBinHandleSymbol) {
+      PrimaryGpuBinHandleSymbol = Symbol;
+      FoundPrimaryGpuBinHandleSymbol = true;
+    } else {
+      AliasGpuBinHandleSymbols.push_back(Symbol);
+    }
+  }
+
   // Add MC directives to embed target binaries. We ensure that each
   // section and image is 16-byte aligned. This is not mandatory, but
   // increases the likelihood of data to be aligned with a cache block
   // in several main host machines.
   ObjStream << "#       HIP Object Generator\n";
   ObjStream << "# *** Automatically generated by Clang ***\n";
-  if (HostTriple.isWindowsMSVCEnvironment()) {
-    ObjStream << "  .section .hip_fatbin, \"dw\"\n";
-  } else {
-    ObjStream << "  .protected __hip_fatbin\n";
-    ObjStream << "  .type __hip_fatbin, at object\n";
-    ObjStream << "  .section .hip_fatbin,\"a\", at progbits\n";
+  if (FoundPrimaryGpuBinHandleSymbol) {
+    // Define the first gpubin handle symbol
+    if (HostTriple.isWindowsMSVCEnvironment()) {
+      ObjStream << "  .section .hip_gpubin_handle,\"dw\"\n";
+    } else {
+      ObjStream << "  .protected " << PrimaryGpuBinHandleSymbol << "\n";
+      ObjStream << "  .type " << PrimaryGpuBinHandleSymbol << ", at object\n";
+      ObjStream << "  .section .hip_gpubin_handle,\"aw\"\n";
+    }
+    ObjStream << "  .globl " << PrimaryGpuBinHandleSymbol << "\n";
+    ObjStream << "  .p2align 3\n"; // Align 8
+    ObjStream << PrimaryGpuBinHandleSymbol << ":\n";
+    ObjStream << "  .zero 8\n"; // Size 8
+
+    // Generate alias directives for other gpubin handle symbols
+    for (const auto &AliasSymbol : AliasGpuBinHandleSymbols) {
+      ObjStream << "  .globl " << AliasSymbol << "\n";
+      ObjStream << "  .set " << AliasSymbol << "," << PrimaryGpuBinHandleSymbol
+                << "\n";
+    }
+  }
+  if (FoundPrimaryHipFatbinSymbol) {
+    // Define the first fatbin symbol
+    if (HostTriple.isWindowsMSVCEnvironment()) {
+      ObjStream << "  .section .hip_fatbin,\"dw\"\n";
+    } else {
+      ObjStream << "  .protected " << PrimaryHipFatbinSymbol << "\n";
+      ObjStream << "  .type " << PrimaryHipFatbinSymbol << ", at object\n";
+      ObjStream << "  .section .hip_fatbin,\"a\", at progbits\n";
+    }
+    ObjStream << "  .globl " << PrimaryHipFatbinSymbol << "\n";
+    ObjStream << "  .p2align " << llvm::Log2(llvm::Align(HIPCodeObjectAlign))
+              << "\n";
+    // Generate alias directives for other fatbin symbols
+    for (const auto &AliasSymbol : AliasHipFatbinSymbols) {
+      ObjStream << "  .globl " << AliasSymbol << "\n";
+      ObjStream << "  .set " << AliasSymbol << "," << PrimaryHipFatbinSymbol
+                << "\n";
+    }
+    ObjStream << PrimaryHipFatbinSymbol << ":\n";
+    ObjStream << "  .incbin ";
+    llvm::sys::printArg(ObjStream, BundleFile, /*Quote=*/true);
+    ObjStream << "\n";
   }
-  ObjStream << "  .globl __hip_fatbin\n";
-  ObjStream << "  .p2align " << llvm::Log2(llvm::Align(HIPCodeObjectAlign))
-            << "\n";
-  ObjStream << "__hip_fatbin:\n";
-  ObjStream << "  .incbin ";
-  llvm::sys::printArg(ObjStream, BundleFile, /*Quote=*/true);
-  ObjStream << "\n";
   if (HostTriple.isOSLinux() && HostTriple.isOSBinFormatELF())
     ObjStream << "  .section .note.GNU-stack, \"\", @progbits\n";
   ObjStream.flush();
diff --git a/clang/test/CodeGenCUDA/device-stub.cu b/clang/test/CodeGenCUDA/device-stub.cu
index d7a7b1bb9fe956..60304647bd4c54 100644
--- a/clang/test/CodeGenCUDA/device-stub.cu
+++ b/clang/test/CodeGenCUDA/device-stub.cu
@@ -50,21 +50,19 @@
 // RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s \
 // RUN:     -fgpu-rdc -fcuda-include-gpubinary %t -o - -x hip \
 // RUN:   | FileCheck -allow-deprecated-dag-overlap %s --check-prefixes=ALL,LNX,RDC,HIP,HIPEF
-// RUN: %clang_cc1 -triple x86_64-linux-gnu -emit-llvm %s -o - -x hip\
+// RUN: %clang_cc1 -cuid=123 -triple x86_64-linux-gnu -emit-llvm %s -o - -x hip\
 // RUN:   | FileCheck -allow-deprecated-dag-overlap %s -check-prefixes=ALL,LNX,NORDC,HIP,HIPNEF
 
 // RUN: %clang_cc1 -triple x86_64-pc-windows-msvc -aux-triple amdgcn -emit-llvm %s \
 // RUN:     -fcuda-include-gpubinary %t -o - -x hip\
 // RUN:   | FileCheck -allow-deprecated-dag-overlap %s --check-prefixes=ALL,WIN
 
-// RUN: %clang_cc1 -triple x86_64-pc-windows-msvc -aux-triple amdgcn -emit-llvm %s \
+// RUN: %clang_cc1 -cuid=123 -triple x86_64-pc-windows-msvc -aux-triple amdgcn -emit-llvm %s \
 // RUN:     -o - -x hip\
 // RUN:   | FileCheck -allow-deprecated-dag-overlap %s --check-prefixes=ALL,WIN,HIP,HIPNEF
 
 #include "Inputs/cuda.h"
 
-// HIPNEF: $__hip_gpubin_handle = comdat any
-
 #ifndef NOGLOBALS
 // NORDC-DAG: @device_var = internal global i32
 // RDC-DAG: @device_var = global i32
@@ -161,7 +159,7 @@ __device__ void device_use() {
 // * constant unnamed string with GPU binary
 // CUDA: @[[FATBIN:.*]] = private constant{{.*}} c"GPU binary would be here.",
 // HIPEF: @[[FATBIN:.*]] = private constant{{.*}} c"GPU binary would be here.",{{.*}}align 4096
-// HIPNEF: @[[FATBIN:__hip_fatbin]] = external constant i8, section ".hip_fatbin"
+// HIPNEF: @[[FATBIN:__hip_fatbin_[0-9a-f]+]] = external constant i8, section ".hip_fatbin"
 // CUDANORDC-SAME: section ".nv_fatbin", align 8
 // CUDARDC-SAME: section "__nv_relfatbin", align 8
 // * constant struct that wraps GPU binary
@@ -177,7 +175,7 @@ __device__ void device_use() {
 // HIP-SAME: section ".hipFatBinSegment"
 // * variable to save GPU binary handle after initialization
 // CUDANORDC: @__[[PREFIX]]_gpubin_handle = internal global ptr null
-// HIPNEF: @__[[PREFIX]]_gpubin_handle = linkonce hidden global ptr null
+// HIPNEF: @__[[PREFIX]]_gpubin_handle_{{[0-9a-f]+}} = external hidden global ptr, align 8
 // * constant unnamed string with NVModuleID
 // CUDARDC: [[MODULE_ID_GLOBAL:@.*]] = private constant
 // CUDARDC-SAME: c"[[MODULE_ID:.+]]\00", section "__nv_module_id", align 32
diff --git a/clang/test/CodeGenCUDA/host-used-device-var.cu b/clang/test/CodeGenCUDA/host-used-device-var.cu
index 7cb31aff84264e..5328660c9dc9df 100644
--- a/clang/test/CodeGenCUDA/host-used-device-var.cu
+++ b/clang/test/CodeGenCUDA/host-used-device-var.cu
@@ -1,9 +1,9 @@
 // REQUIRES: amdgpu-registered-target
 // RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -fcuda-is-device -x hip %s \
 // RUN:   -std=c++17 -O3 -mllvm -amdgpu-internalize-symbols -emit-llvm -o - \
-// RUN:   | FileCheck -check-prefix=DEV %s
+// RUN:   -cuid=123 | FileCheck -check-prefix=DEV %s
 // RUN: %clang_cc1 -triple x86_64-unknown-linux-gnu -x hip %s \
-// RUN:   -std=c++17 -O3 -emit-llvm -o - | FileCheck -check-prefix=HOST %s
+// RUN:   -std=c++17 -O3 -emit-llvm -o - -cuid=123 | FileCheck -check-prefix=HOST %s
 
 // Negative tests.
 
@@ -187,6 +187,7 @@ public:
 // DEV-SAME: {{^[^@]*}} @_ZL2u3
 // DEV-SAME: {{^[^@]*}} @_ZZ4fun1vE11static_var1
 // DEV-SAME: {{^[^@]*}} @_ZZZN21TestStaticVarInLambda3funEvENKUlPcE_clES0_E4var2
+// DEV-SAME: {{^[^@]*}} @__hip_cuid_{{[0-9a-f]+}}
 // DEV-SAME: {{^[^@]*}} @constexpr_var2b
 // DEV-SAME: {{^[^@]*}} @inline_var
 // DEV-SAME: {{^[^@]*}} @u1
diff --git a/clang/test/Driver/hip-toolchain-rdc.hip b/clang/test/Driver/hip-toolchain-rdc.hip
index 1827531f9cab7a..d19d8ccd6cb29e 100644
--- a/clang/test/Driver/hip-toolchain-rdc.hip
+++ b/clang/test/Driver/hip-toolchain-rdc.hip
@@ -1,7 +1,7 @@
 // REQUIRES: x86-registered-target
 // REQUIRES: amdgpu-registered-target
 
-// RUN: %clang -### --target=x86_64-linux-gnu \
+// RUN: %clang -### --target=x86_64-linux-gnu -v \
 // RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
 // RUN:   --hip-device-lib=lib1.bc --hip-device-lib=lib2.bc \
 // RUN:   --hip-device-lib-path=%S/Inputs/hip_multiple_inputs/lib1 \
@@ -12,7 +12,7 @@
 // RUN:   %S/Inputs/hip_multiple_inputs/b.hip \
 // RUN: 2>&1 | FileCheck -check-prefixes=CHECK,LNX %s
 
-// RUN: %clang -### --target=x86_64-pc-windows-msvc \
+// RUN: %clang -### --target=x86_64-pc-windows-msvc -v \
 // RUN:   -x hip --cuda-gpu-arch=gfx803 --cuda-gpu-arch=gfx900 \
 // RUN:   --hip-device-lib=lib1.bc --hip-device-lib=lib2.bc \
 // RUN:   --hip-device-lib-path=%S/Inputs/hip_multiple_inputs/lib1 \
@@ -23,15 +23,31 @@
 // RUN:   %S/Inputs/hip_multiple_inputs/b.hip \
 // RUN: 2>&1 | FileCheck -check-prefixes=CHECK,MSVC %s
 
-// check code object alignment in dumped llvm-mc input
-// LNX: .protected __hip_fatbin
-// LNX: .type __hip_fatbin, at object
-// LNX: .section .hip_fatbin,"a", at progbits
-// MSVC: .section .hip_fatbin, "dw"
-// CHECK: .globl __hip_fatbin
-// CHECK: .p2align 12
-// CHECK: __hip_fatbin:
-// CHECK: .incbin "[[BUNDLE:.*hipfb]]"
+// check HIP fatbin and gpubin handle symbols and code object alignment in dumped llvm-mc input
+// CHECK: Found undefined HIP fatbin symbol: __hip_fatbin_[[ID1:[0-9a-f]+]]
+// CHECK: Found undefined HIP fatbin symbol: __hip_fatbin_[[ID2:[0-9a-f]+]]
+// CHECK: Found undefined HIP gpubin handle symbol: __hip_gpubin_handle_[[ID1]]
+// CHECK: Found undefined HIP gpubin handle symbol: __hip_gpubin_handle_[[ID2]]
+// LNX:  .protected __hip_gpubin_handle_[[ID1]]
+// LNX:  .type __hip_gpubin_handle_[[ID1]]
+// LNX-LABEL:  .section .hip_gpubin_handle,"aw"
+// MSVC-LABEL: .section .hip_gpubin_handle,"dw"
+// CHECK:  .globl __hip_gpubin_handle_[[ID1]]
+// CHECK-NEXT:  .p2align 3
+// CHECK-NEXT:__hip_gpubin_handle_[[ID1]]:
+// CHECK-NEXT:  .zero 8
+// CHECK-NEXT:  .globl __hip_gpubin_handle_[[ID2]]
+// CHECK-NEXT:  .set __hip_gpubin_handle_[[ID2]],__hip_gpubin_handle_[[ID1]]
+// LNX: .protected __hip_fatbin_[[ID1]]
+// LNX: .type __hip_fatbin_[[ID1]], at object
+// LNX-LABEL: .section .hip_fatbin,"a", at progbits
+// MSVC-LABEL: .section .hip_fatbin,"dw"
+// CHECK: .globl __hip_fatbin_[[ID1]]
+// CHECK-NEXT: .p2align 12
+// CHECK-NEXT:  .globl __hip_fatbin_[[ID2]]
+// CHECK-NEXT:  .set __hip_fatbin_[[ID2]],__hip_fatbin_[[ID1]]
+// CHECK-NEXT: __hip_fatbin_[[ID1]]:
+// CHECK-NEXT: .incbin "[[BUNDLE:.*hipfb]]"
 // LNX: .section .note.GNU-stack, "", @progbits
 // MSVC-NOT: .note.GNU-stack