[Openmp-commits] [PATCH] D104418: [PoC][WIP][OpenMP][Offloading] Fixed data race in libomptarget caused by async data movement

Thu Jun 17 09:33:33 PDT 2021

gValarini added a comment.

In D104418#2824738 <https://reviews.llvm.org/D104418#2824738>, @tianshilei1992 wrote:

> So in one sentence, the root cause is the misalignment between assumption of mapping lookup logic and non-atomic data movement operations. It is not a problem for targets not supporting async data movement because everything can be guarded by the lock. Also, for those targets supporting async data movement, we don't want to use synchronous data movement here as well. The reason is obvious, for better performance. Data movement is extremely expensive, and multiple data movements can be potentially parallel, even for one single target region.
>
> What I'm trying to do in this patch is to establish the dependency between the data movement and all other tasks using that data. We want to utilize the device side synchronization to do that by using event. The interesting thing for the event is, from host's perspective, "wait for an event" means just insert the event to the queue, and that's it. It is not like a barrier because it will not block the host. But yes, it is a barrier for the device scheduler for sure that all enqueued operations following the event can only start execution if the event is fulfilled.

I believe that your root cause description is correct, but I don't understand how the mentioned locks currently present in mapping logic can fix the case when a target device only implements the synchronous API (which is the first problem described in 49940 <https://bugs.llvm.org/show_bug.cgi?id=49940>). Let me explain what I mean.

In D104418#2824738 <https://reviews.llvm.org/D104418#2824738>, @tianshilei1992 wrote:

> It's a data race between host and plugin. https://bugs.llvm.org/show_bug.cgi?id=49940 has detailed analysis. Long story short, let's say we have two concurrent target regions `T1` and `T2` both reading memory region `M`. Assume `T1` is scheduled to execute first. It then maps `M` to the device. The mapping consists of several steps:
> 0. Get the lock of mapping table.
>
> 1. Look up into the mapping table to see if `M` is already mapped. - No.
> 2. Allocate device memory for `M` and create a mapping entry for `M`.
> 3. Issue data transfer.
> 4. Release the lock.
> 5. Do remaining things, such as more data transfer, or directly launch the kernel.
>
> Everything seems fine. Now `T2` starts to execute. It also maps `M` to the device. However, steps are slightly different.
> 0. Get the lock of mapping table.
>
> 1. Look up into the mapping table to see if `M` is already mapped. - Yes.
> 2. Since the mapping is already there, `T2` will NOT issue data transfer. This can improve performance. It doesn't make sense to transfer same data for several times.
> 3. Do remaining things, like Step 5 above.
>
> It also looks good for this logic. However, there is a data race. Since we now have async data movement, Step 3 for `T1` can only guarantee that movement of `M` is issued, but it cannot guarantee that after Step 3, the data is already on the device. For `T2`, its logic is, if the mapping is already in the table, it "assumes" that the data is already on the device, and therefore it doesn't issue data transfer. And then, bang! For the case that `T2` is scheduled to execute by the GPU scheduler ahead of the data movement of `M`, the memory region on the device contains nothing from host. It is random.

To the best of my knowledge, the given step-by-step has a small problem: the mapping table lock is not released in step 4, but instead, it is released between steps 2 and 3.

Since the proposed patch stores the event in the `AsyncInfoTy`, when interacting with a synchronous-only plugin we can still have the same data corruption problem: `T1` may be paused right after the mapping table lock is released (between steps 2 and 3) and `T2` will go through its target region execution not issuing the appropriate data transfer.

If what I described above is correct, we can fix this synchronous case with two simple changes added to https://reviews.llvm.org/D104382 patch:

1. Add a mutex to `HostDataToTargetTy` that will syncrhonize concurrent external (meaning outside the `DeviceTy` code) accesses;
2. `TargetPointerResultTy` locks this mutex on construction and unlock it on destruction (a simple std::unique_lock can suffice).

This way we ensure that multiple threads dealing with the same mapping table entry are synchronized, fixing the first problem in 49940 <https://bugs.llvm.org/show_bug.cgi?id=49940>. What do you think?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D104418/new/

https://reviews.llvm.org/D104418