[Openmp-commits] [PATCH] D104418: [PoC][WIP][OpenMP][Offloading] Fixed data race in libomptarget caused by async data movement

Thu Jun 17 08:04:46 PDT 2021

tianshilei1992 added a comment.

In D104418#2823848 <https://reviews.llvm.org/D104418#2823848>, @JonChesterfield wrote:

> So is the race this is trying to fix between the host and the target, or between regions running on the target?
>
> Passing data back to the host suggests the former, but that doesn't make sense, because the async operations are queued up within an async object that is not shared between host threads (and has a lexically scoped lifetime, iirc).
>
> If it's a race between different target regions on the device, we should be able to resolve that with no host involvement at all.

It's a data race between host and plugin. https://bugs.llvm.org/show_bug.cgi?id=49940 has detailed analysis. Long story short, let's say we have two concurrent target regions `T1` and `T2` both reading memory region `M`. Assume `T1` is scheduled to execute first. It then maps `M` to the device. The mapping consists of several steps:
0. Get the lock of mapping table.

1. Look up into the mapping table to see if `M` is already mapped. - No.
2. Allocate device memory for `M` and create a mapping entry for `M`.
3. Issue data transfer.
4. Release the lock.
5. Do remaining things, such as more data transfer, or directly launch the kernel.

Everything seems fine. Now `T2` starts to execute. It also maps `M` to the device. However, steps are slightly different.
0. Get the lock of mapping table.

1. Look up into the mapping table to see if `M` is already mapped. - Yes.
2. Since the mapping is already there, `T2` will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.
3. Do remaining things, like Step 5 above.

It also looks good for this logic. However, there is a data race. Since we now have async data movement, Step 3 for `T1` can only guarantee that movement of `M` is issued, but it cannot guarantee that after Step 3, the data is already on the device. For `T2`, its logic is, if the mapping is already in the table, it "assumes" that the data is already on the device, and therefore it doesn't issue data transfer. And then, bang! For the case that `T2` is scheduled to execute by the GPU scheduler ahead of the data movement of `M`, the memory region on the device contains nothing from host. It is random.

So in one sentence, the root cause is the misalignment between assumption of mapping lookup logic and non-atomic data movement operations. It is not a problem for targets not supporting async data movement because everything can be guarded by the lock. Also, we don't want to use synchronous data movement here as well. The reason is obvious, for better performance. Data movement is extremely heavy, and multiple data movements can be potentially parallel, even for one single target region.

What I'm trying to do in this patch is to establish the dependency between the data movement and all other tasks using that data. We want to utilize the device side synchronization to do that by using event. The interesting thing for the event is, from host's perspective, "wait for an event" means just insert the event to the queue, and that's it. It is not like a barrier because it will not block the host. But yes, it is a barrier for the device scheduler for sure that all enqueued operations following the event can only start execution if the event is fulfilled.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D104418/new/

https://reviews.llvm.org/D104418