[Openmp-commits] [PATCH] D138389: [OpenMP][libomptarget] Add AMDGPU NextGen plugin with asynchronous behavior

Tue Nov 29 15:54:12 PST 2022

JonChesterfield added a comment.

A few comments above. General case, I don't see the benefit of getting constructs like:
`hsa_signal_t get() const { return Signal; }`
can we either not allow access to the private member, or just make it public and drop the getter?

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:181
+  /// Getter of the HSA memory pool.
+  hsa_amd_memory_pool_t get() const { return MemoryPool; }
+
----------------
this is pointless, just make the member public if we have to leak it. but again, existing code...

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:220
+                     const llvm::SmallVector<hsa_agent_t> &Agents) const {
+    // TODO: Ensure it is possible to enable the access. This can be retrieved
+    // through HSA_AMD_AGENT_MEMORY_POOL_INFO_ACCESS. If it is not possible,
----------------
this is a scary comment, though presumably in existing code

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:292
+
+    *PtrStorage = MemoryManager->allocate(Size, nullptr);
+    if (*PtrStorage == nullptr)
----------------
jdoerfert wrote:
> jhuber6 wrote:
> > I'm not deeply familiar with the `MemoryManager`, but I thought that we didn't use it for AMD because the HSA plugin already handled its own caching in the memory pool? Maybe @JonChesterfield and @tianshilei1992 could comment further.
> I'm generally in favor of using it either way.
IIUC the HSA interface with 'memory pool' in the name doesn't implement a memory pool so caching it here is probably faster. I expect it'll interact badly with the asan development effort but that was rocm-internal last time I looked at it so can't influence decisions here

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:393
+    // the acquire memory order.
+    while (PacketId - hsa_queue_load_read_index_scacquire(Queue) >= Queue->size)
+      ;
----------------
does this behave sensibly on wrap around? the arithmetic used elsewhere is 

```
static uint64_t acquireAvailablePacketId(hsa_queue_t *Queue) {
  uint64_t PacketId = hsa_queue_add_write_index_relaxed(Queue, 1);
  bool Full = true;
  while (Full) {
    Full =
        PacketId >= (Queue->size + hsa_queue_load_read_index_scacquire(Queue));
  }
  return PacketId;
}
```

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:452
+
+  /// Lock and unlock the queue. Acquiring a packet, filling its fields and
+  /// publishing it should be performed under the same exclusive region without
----------------
This is strange. The whole point of the queue abstraction is that multiple threads can enqueue work onto it at the same time, so why would we be locking it?

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:481
+  /// Create an empty signal.
+  AMDGPUSignalTy() : Signal({0}) {}
+
----------------
Is {0} a meaningful value for a signal? e.g. can we pass it to signal_destroy or similar without it falling over?

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:499
+    while (hsa_signal_wait_scacquire(Signal, HSA_SIGNAL_CONDITION_EQ, 0,
+                                     UINT64_MAX, HSA_WAIT_STATE_ACTIVE) != 0)
+      ;
----------------
comment for why active instead of blocked? wait in particular sounds like we expect the signal to be unavailable when the function is called so busy-spin in the host thread seems bad

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:507
+
+  /// Signal decrementing by one.
+  void signal() {
----------------
This looks racy. If a few threads are calling decrement and one calls reset, it's going to break. 

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D138389/new/

https://reviews.llvm.org/D138389