[Openmp-commits] [PATCH] D156996: [OpenMP][AMDGPU] Add Envar for controlling HSA busy queue tracking

Thu Aug 3 12:11:47 PDT 2023

mhalk added inline comments.

================
Comment at: openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp:1536
     return Plugin::success();
   }

----------------
kevinsala wrote:
> mhalk wrote:
> > kevinsala wrote:
> > > mhalk wrote:
> > > > kevinsala wrote:
> > > > > I rewrote the function trying to be a bit simplier. Conceptually, I think the only difference is that `NextQueue` is always increased. Would it make sense?
> > > > > 
> > > > > ```
> > > > > inline Error assignNextQueue(AMDGPUStreamTy *Stream) {
> > > > >   uint32_t StartIndex = NextQueue % MaxNumQueues;
> > > > > 
> > > > >   if (OMPX_QueueTracking) {
> > > > >     // Find the first idle queue.
> > > > >     for (uint32_t I = 0; I < MaxNumQueues; ++I) {
> > > > >       if (!Queues[StartIndex].isBusy())
> > > > >         break;
> > > > > 
> > > > >       StartIndex = (StartIndex + 1) % MaxNumQueues;
> > > > >     }
> > > > >   }
> > > > > 
> > > > >   // Try to initialize queue and increase its users.
> > > > >   if (Queues[StartIndex].init(Agent, QueueSize))
> > > > >     return Err;
> > > > >   Queues[StartIndex].addUser();
> > > > > 
> > > > >   // Assign the queue.
> > > > >   Stream->Queue = &Queues[StartIndex];
> > > > > 
> > > > >   // Always move to the next queue.
> > > > >   ++NextQueue;
> > > > > 
> > > > >   return Plugin::success();
> > > > > }
> > > > > ```
> > > > > 
> > > > > Also, in the case `OMPX_QueueTracking` is enabled, when no queues are idle, we could fallback to the queue with less users. The minimum can be computed while iterating the queues in that loop.
> > > > > 
> > > > Really like the reduced complexity.
> > > > 
> > > > Now that `NextQueue` is always increased wouldn't that forfeit the purpose of handling busy queues? -- at least for the first `MaxNumQueues` Stream-requests.
> > > > Because I think we will now always look at a Queue that has not been initialized and is therefore considered "idle"/not busy.
> > > > I guess keeping `NextQueue` at zero until the maximum number of queues is actually initialized is important, unless we find sth. equivalent.
> > > > I'll think about it a bit more.
> > > I don't think updating `NextQueue` will have an impact on the tracking mechanism. For the tracking, `NextQueue` just indicates which queue to start looking at. But you will traverse the whole array of queues if we are not finding idle queues.
> > > 
> > > So, in the first `MaxNumQueues` Stream-request, the mechanism will do the following:
> > > 
> > > ```
> > > 1st request: NextQueue is 0 -> check queue 0 (idle) -> break
> > > 2nd request: NextQueue is 1 -> check queue 1 (idle) -> break
> > > 3rd request: NextQueue is 2 -> check queue 2 (idle) -> break
> > > ...
> > > ```
> > > 
> > > In these first `MaxNumQueues` requests, we will always break at the first loop iteration because no one will be using the queue at`NextQueue` positions yet. Then, after those first requests, we will probably start having to iterate over the queues to find if any is idle.
> > Yes, but what if upon the 2nd request queue_0 is idle.
> > Won't we initialize queue_1 albeit there is an already initialized one ready?
> > 
> > Or do we not care about this? If so & should you have time -- please share some insight in this.
> True, it would make sense to re-use the already initialized one. That can be changed on the simplified function easily.
> 
> I've some other doubts regarding falling back to round robin when there is no idle queue. Since the number of queues is very low, I think that improving the fallback (i.e., when all queues are busy) is important. The default number of queues is four, so we will already move to the fallback mechanism (simple round-robin) when having 4 streams working concurrently.
> 
> For the same cost as now, we can chose the queue with the minimum number of users instead. I believe this could keep the users (more or less) well-balanced among queue. Does it make sense?
Yes, that does make sense to me.

When you say "for the same cost", what do you have in mind?

I guess, I'll check if there are measurable drawbacks to round robin vs. "least contention", performance-wise.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D156996/new/

https://reviews.llvm.org/D156996