[Openmp-commits] [PATCH] D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile.

Alexey Bataev via Phabricator via Openmp-commits openmp-commits at lists.llvm.org
Mon Jun 10 18:31:25 PDT 2019

ABataev added a comment.

In D62393#1537269 <https://reviews.llvm.org/D62393#1537269>, @jdoerfert wrote:

> In D62393#1533461 <https://reviews.llvm.org/D62393#1533461>, @ABataev wrote:
> > In D62393#1533454 <https://reviews.llvm.org/D62393#1533454>, @jdoerfert wrote:
> >
> > > >> The standard has "levels" and "active-levels". Both need to be tracked but it seems we only have a single array?
> > > >  >
> > > > 
> > > > They all are tracked on this array.
> > >
> > > Where is that described?
> >
> >
> > Sorry, read the code. I'm not a debugger.
> First of all, this response is simply rude.
>  Second, you wrote the code so my question is well justified.
>  Third, does that mean you encode multiple values in a single int array but did not document this at all? The code is not documentation.
>  Finally, this will break once we have a parallel region in a recursive loop with more than 254 instantiations, correct?

First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.
Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme. Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.
Third, the single array is used to handle up to 128 (not 256) inner parallel region.  It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation defined.

>>>>   We don't need to synchronize the whole program, we just need to synchronize accesses to the parallelLevel array. In SPMD mode with full runtime (or SPMD mode eith atomic ops) the ptx compiler generates unsafe code for parallelLevel array without being marked as volatile (i.e. simple atomic-like construct in CUDa ).
>>> Nobody argues we should synchronize the whole program. If you want to have atomic accesses to the parallelLevel array, make the accesses to the parallelLevel array atomic.
>>> You cannot just say that the PTX compiler generates unsafe code. That is not an argument that is a guess.
>> Volatile provides required level of atomicity per PTX ISA.
> Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to `parallelLevel` or why you do not synchronize the accesses to `parallelLevel` across the warp explicitly.

I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution.  Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

  rOMP OpenMP



More information about the Openmp-commits mailing list