[Openmp-commits] [PATCH] D138389: [OpenMP][libomptarget] Add AMDGPU NextGen plugin with asynchronous behavior

Sat Dec 3 16:27:06 PST 2022

kevinsala added inline comments.

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:220
+                     const llvm::SmallVector<hsa_agent_t> &Agents) const {
+    // TODO: Ensure it is possible to enable the access. This can be retrieved
+    // through HSA_AMD_AGENT_MEMORY_POOL_INFO_ACCESS. If it is not possible,
----------------
JonChesterfield wrote:
> this is a scary comment, though presumably in existing code
I can implement it, but I would perform this check only in debug mode.

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:393
+    // the acquire memory order.
+    while (PacketId - hsa_queue_load_read_index_scacquire(Queue) >= Queue->size)
+      ;
----------------
JonChesterfield wrote:
> does this behave sensibly on wrap around? the arithmetic used elsewhere is 
> 
> ```
> static uint64_t acquireAvailablePacketId(hsa_queue_t *Queue) {
>   uint64_t PacketId = hsa_queue_add_write_index_relaxed(Queue, 1);
>   bool Full = true;
>   while (Full) {
>     Full =
>         PacketId >= (Queue->size + hsa_queue_load_read_index_scacquire(Queue));
>   }
>   return PacketId;
> }
> ```
I obtained that method from the HSA standard document.

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:452
+
+  /// Lock and unlock the queue. Acquiring a packet, filling its fields and
+  /// publishing it should be performed under the same exclusive region without
----------------
JonChesterfield wrote:
> This is strange. The whole point of the queue abstraction is that multiple threads can enqueue work onto it at the same time, so why would we be locking it?
Multiple threads may enqueue work onto the queue concurrently thanks to the atomic operations to retrieve a packet slot and publishing/ringing the doorbell. But, if I understood correctly, we shouldn't allow publishing a packet (ringing the doorbell with that packet id) if there is any previous packet in the queue that is not ready to be processed yet. For instance, this sequence will be invalid according to my understanding:

```
   Thread 1 (T1):                               Thread2 (T2):
   ========                                     ========
1. Gets packet slot 0
2. Fills packet 0's fields                      Gets packet slot 1
3.                                              Fills packet 1's fields
4.                                              Publish packet writing doorbell with id 1
5. Publish packet writing doorbell with id 0)
```

Probably, we could reduce the locked section or remove the lock, by preventing T2 (or any thread in the generalized case) from ringing the doorbell if any previous packet is not ready. But I didn't find that critical currently. If we see this lock is a bottleneck, we can try to optimize it.

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:481
+  /// Create an empty signal.
+  AMDGPUSignalTy() : Signal({0}) {}
+
----------------
JonChesterfield wrote:
> Is {0} a meaningful value for a signal? e.g. can we pass it to signal_destroy or similar without it falling over?
We shouldn't call destroy with the {0} handle. I can add asserts.

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:499
+    while (hsa_signal_wait_scacquire(Signal, HSA_SIGNAL_CONDITION_EQ, 0,
+                                     UINT64_MAX, HSA_WAIT_STATE_ACTIVE) != 0)
+      ;
----------------
JonChesterfield wrote:
> comment for why active instead of blocked? wait in particular sounds like we expect the signal to be unavailable when the function is called so busy-spin in the host thread seems bad
True, that's something I was experimenting with. I'll change it back to blocked wait.

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:507
+
+  /// Signal decrementing by one.
+  void signal() {
----------------
JonChesterfield wrote:
> This looks racy. If a few threads are calling decrement and one calls reset, it's going to break. 
Reset is expected to be called when the signal is not being used by any other thread nor the HSA runtime. It should be called just before being reused.

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:1931
+    bool Found = false;
+    HostAllocationsMutex.lock_shared();
+    if (!HostAllocations.empty()) {
----------------
JonChesterfield wrote:
> I can't work out what's going on here. The corresponding logic for erase looks up the pointer directly, should found not be the same? Also can't tell why we're recording the size of the allocation next to the pointer, as opposed to a DenseSet<void*>
The `__tgt_rtl_data_delete` operation should pass the same pointer provided by the `__tgt_rtl_data_alloc`. As far as I know, it's not valid to make a partial deletion of an allocated buffer. But in the case of `__tgt_rtl_data_submit/retrieve`, the pointer can come with an applied offset. Thus, we should check whether the provided pointer is inside any host allocation, considering the sizes of the allocations.

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:1973
+  /// with an intermediate pinned buffer.
+  std::map<const void *, size_t> HostAllocations;
+  mutable std::shared_mutex HostAllocationsMutex;
----------------
JonChesterfield wrote:
> this is implemented as a tree - why std::map?
To perform `lower_bound` operations with logarithmic complexity.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D138389/new/

https://reviews.llvm.org/D138389