[PATCH] D139730: [OpenMP][DeviceRTL][AMDGPU] Support code object version 5

Mon Aug 7 05:53:11 PDT 2023

saiislam added a comment.

In D139730#4561622 <https://reviews.llvm.org/D139730#4561622>, @arsenm wrote:

> In D139730#4561619 <https://reviews.llvm.org/D139730#4561619>, @arsenm wrote:
>
>> In D139730#4561575 <https://reviews.llvm.org/D139730#4561575>, @jhuber6 wrote:
>>
>>> In D139730#4561573 <https://reviews.llvm.org/D139730#4561573>, @arsenm wrote:
>>>
>>>> In D139730#4561540 <https://reviews.llvm.org/D139730#4561540>, @jhuber6 wrote:
>>>>
>>>>> Could you explain briefly what the approach here is? I'm confused as to what's actually changed and how we're handling this difference. I thought if this was just the definition of some builtin function we could just rely on the backend to figure it out. Why do we need to know the code object version inside the device RTL?
>>>>
>>>> The build is called in the device rtl, so the device RTL needs to contain both implementations. The "backend figuring it out" is dead code elimination
>>>
>>> Okay, do we expect to re-use this interface anywhere? If it's just for OpenMP then we should probably copy the approach taken for `__omp_rtl_debug_kind`, which is a global created on the GPU by `CGOpenMPRuntimeGPU`'s constructor and does more or less the same thing.
>>
>> device libs replicates the same scheme using its own copy of an equivalent variable. Trying to merge those two together
>
> Although I guess that doesn't really need the builtin changes?

This builtin was already aware about cov4 and cov5. All this patch is changing is making it aware about a possibility where both needs to be present.
It is already used by device-libs, deviceRTL, and libc-gpu.
Also, encapsulating ABI related changes in implementation of the builtin allows other runtime developers to be agnostic to these lower level changes.

================
Comment at: clang/lib/CodeGen/CGBuiltin.cpp:17187-17188
+        Address(Result, CGF.Int16Ty, CharUnits::fromQuantity(2)));
+  } else {
+    if (Cov == clang::TargetOptions::COV_5) {
+      // Indexing the implicit kernarg segment.
----------------
jhuber6 wrote:
> nit.
There are a couple of common lines after the inner if-else, in the outer else section.

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:1752
+    // if (auto Err = preAllocateDeviceMemoryPool())
+    //   return Err;
+
----------------
jhuber6 wrote:
> Leftoever?
No, it is not a left over.
One of the fields in cov5 implicitikernarg is heap_v1 ptr. It should point to a 128KB zero-initialized block of coarse-grained memory on each device before launching the kernel. This code was working a while ago, but right now it is failing most likely due to some latest change in devicertl memory handling mechanism.
I need to debug it with this patch, otherwise it will cause all target region code calling device-malloc to fail.
I will try to fix it before the next revision.

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:2542
+  /// Get the address of pointer to the preallocated device memory pool.
+  void **getPreAllocatedDeviceMemoryPool() {
+    return &PreAllocatedDeviceMemoryPool;
----------------
jhuber6 wrote:
> Why do we need this? The current method shouldn't need to change if all we're doing is allocating memory of greater size.
`PreAllocatedDeviceMemoryPool` is the pointer which stores the intermediate value before it is written to heap_v1_ptr field of cov5 implicitkernarg.

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:3036

+  if (getImplicitArgsSize() < utils::COV5_SIZE) {
+    DP("Setting fields of ImplicitArgs for COV4\n");
----------------
jhuber6 wrote:
> So we're required to emit some new arguments? I don't have any idea what'schanged between this COV4 and COV5 stuff.
In cov5, we need to set certain fields of the implicit kernel arguments before launching the kernel.
Please see [[ https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-code-object-kernel-argument-metadata-map-table-v5 | AMDHSA Code Object V5 Kernel Argument Metadata Map Additions and Changes]] for more details.

Only NumBlocks, NumThreads(XYZ), GridDims, and Heap_V1_ptr are relevant for us, so I have simplified code further.

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:3037
+  if (getImplicitArgsSize() < utils::COV5_SIZE) {
+    DP("Setting fields of ImplicitArgs for COV4\n");
+  } else {
----------------
arsenm wrote:
> This isn't doing anything?
Earlier we used to set hostcall_buffer here, but not anymore.
I have left the message in DP just for debug help.

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/utils/UtilitiesRTL.h:36

-// The implicit arguments of AMDGPU kernels.
-struct AMDGPUImplicitArgsTy {
-  uint64_t OffsetX;
-  uint64_t OffsetY;
-  uint64_t OffsetZ;
-  uint64_t HostcallPtr;
-  uint64_t Unused0;
-  uint64_t Unused1;
-  uint64_t Unused2;
+enum IMPLICITARGS : uint32_t {
+  COV4_SIZE = 56,
----------------
jhuber6 wrote:
> arsenm wrote:
> > This is getting duplicated a few places, should it move to a support header?
> > 
> > I don't love the existing APIs for this, I think a struct definition makes more sense
> The other user here is my custom loader, @JonChesterfield has talked about wanting a common HSA helper header for awhile now.
> 
> I agree that the struct definition is much better. Being able to simply allocate this size and then zero fill it is much cleaner.
Defining a struct for whole 256 byte of implicitargs in cov5 was becoming a little difficult due to different sizes of various fields (2, 4, 6, 8, 48, 72 bytes) along with multiple reserved fields in between. It made sense for cov4 because it only had 7 fields of 8 bytes each, where we needed only 4th field in OpenMP runtime (for hostcall_buffer).

Offset based lookups like the following allows handling/exposing only required fields across generations of ABI.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D139730/new/

https://reviews.llvm.org/D139730