[Openmp-commits] [PATCH] D138389: [OpenMP][libomptarget] Add AMDGPU NextGen plugin with asynchronous behavior

Mon Dec 12 10:03:36 PST 2022

kevinsala marked 38 inline comments as done and an inline comment as not done.
kevinsala added inline comments.

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:137
+/// If the \p TargetID does not contain a feature (default), do not map it.
+StringRef parseTargetID(StringRef TargetID, StringMap<bool> &FeatureMap) {
+  if (TargetID.empty())
----------------
kevinsala wrote:
> jdoerfert wrote:
> > jhuber6 wrote:
> > > arsenm wrote:
> > > > Do we not have this in a shared place already?
> > > I believe we have some support in `clang` somewhere, we could probably move that to somewhere in `llvm` and include it here instead.
> > I used the github search and the only place I found that looks for the sramecc+ string is the existing AMD plugin (same function basically):
> > https://github.com/llvm/llvm-project/blob/a35ad711d90497994701a99723a81badf3d4348e/openmp/libomptarget/plugins/amdgpu/src/rtl.cpp#L1855
> If you find it fine, I can move it to a new utils header in `plugins-nextgen/amdgpu/` and be included by both AMDGPU plugins.
Opened a separate patch for that: https://reviews.llvm.org/D139792

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:499
+    while (hsa_signal_wait_scacquire(Signal, HSA_SIGNAL_CONDITION_EQ, 0,
+                                     UINT64_MAX, HSA_WAIT_STATE_ACTIVE) != 0)
+      ;
----------------
jdoerfert wrote:
> kevinsala wrote:
> > JonChesterfield wrote:
> > > comment for why active instead of blocked? wait in particular sounds like we expect the signal to be unavailable when the function is called so busy-spin in the host thread seems bad
> > True, that's something I was experimenting with. I'll change it back to blocked wait.
> I'm not sure which way to go. Add a TODO to try this out. I'm uncertain if the host thread can do something useful while it's not busy waiting.
Kept as BLOCKED waiting and added a TODO comment about it.

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:536-537
+    // Kernel symbols have a ".kd" suffix.
+    std::string KernelName(getName());
+    KernelName += ".kd";
+
----------------
jhuber6 wrote:
> Small nit, can use LLVM's string, it's basically just a small vector of chars.
For some reason, that doesn't seem to work. It's printing strange characters after `".kd"`, even if I add a `\0`. I'm looking into it.

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:725
+    // Lock the queue during the packet publishing process.
+    std::lock_guard<AMDGPUQueueTy> Lock(Queue);
+
----------------
jdoerfert wrote:
> JonChesterfield wrote:
> > Confused by this, why are we locking the queue?
> I think that's related to the answer above. Add a TODO to look into this.
Added a long comment about it on top of the Stream's `std::mutex` data member.

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:968
+      hsa_signal_t InputSignalRaw = InputSignal->get();
+      Status = hsa_amd_memory_async_copy(Inter, Queue.getAgent(), Src,
+                                         Queue.getAgent(), CopySize, 1,
----------------
JonChesterfield wrote:
> passing queue.getAgent() as two arguments here seems suspicious - shouldn't one of these be the GPU and one the host?
I'm looking into it. The documentation of `hsa_amd_memory_async_copy` in the header doesn't give much detail on what's the meaning of those two parameters and which are the consequences of passing GPU/CPU agents. I passed the GPU agent as both parameters as the original AMDGPU plugin does.

In case we pass the CPU agent as the src/dst agent, we will need to allow the CPU agent access to each of the device allocations. Otherwise, it will fail to comply with "//the agent must be able to directly access both the source and destination buffers in their current locations//" ([[ https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/master/src/inc/hsa_ext_amd.h#L1159 | header doc ]]).

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:1463
+        OMPX_StreamSize("LIBOMPTARGET_AMDGPU_STREAM_SIZE", 512),
+        OMPX_MaxAsyncCopySize("LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_SIZE",
+                              1 * 1024 * 1024), // 1MB
----------------
ye-luo wrote:
> kevinsala wrote:
> > jdoerfert wrote:
> > > ye-luo wrote:
> > > > LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES?
> > > We need documentation for those in openmp/docs, next to the other env vars.
> > I'll add the documentation for these envars. They control the following aspects:
> > 
> > 1) `LIBOMPTARGET_AMDGPU_QUEUE_SIZE`: The number of HSA packets that can be pushed into each HSA queue without waiting the driver to process them.
> > 2) `LIBOMPTARGET_AMDGPU_STREAM_SIZE`: Number of asynchronous operations (e.g., kernel launches, memory transfers) that can be pushed into each of our streams without waiting on their finalization. With the upcoming patch implementing dynamically sized streams, this envar will be renamed and become a hint for the initial stream size.
> > 3) `LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_SIZE`: Up to this size, the memory copies (__tgt_rtl_submit_data, __tgt_rtl_retrieve_data) will be asynchronous operations appended to the corresponding stream. For larger transfer, it will become synchronous transfers. This can be seen `dataSubmitImpl` (1695).
> Thanks for adding docs.
> 1. What I found in hsa doc is "Number of packets the queue is expected to hold". This is more understandable than your line. I think your description is true but that is a second level interpretation. I think both lines can be useful in the doc. LIBOMPTARGET_AMDGPU_QUEUE_SIZE is an insufficient name. LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE is better IMO.
> 2. A STREAM may have many contents with different sizes. STREAM_SIZE is to vague. Better to have something like ASYNC_OP_DEPTH_PER_STREAM.
> 3. Still prefer COPY_BYTES.
Current names:
- `LIBOMPTARGET_AMDGPU_NUM_HSA_QUEUES` (default: 8)
- `LIBOMPTARGET_AMDGPU_HSA_QUEUE_SIZE` (default: 1024)
- `LIBOMPTARGET_AMDGPU_MAX_ASYNC_COPY_BYTES` (default: 1*1024*1024, 1MB)
- `LIBOMPTARGET_AMDGPU_NUM_INITIAL_SIGNALS` (default: 64), probably better to name `HSA_SIGNALS` instead of `SIGNALS`

The description of each one appears on top of the envar's declarations. I'll add documentation on openmp/docs in another patch.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D138389/new/

https://reviews.llvm.org/D138389