[Openmp-commits] [PATCH] D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile.

Tue Jun 11 08:50:58 PDT 2019

hfinkel added subscribers: jlebar, Hahnfeld, tra, reames, chandlerc.
hfinkel added a comment.

In D62393#1537313 <https://reviews.llvm.org/D62393#1537313>, @ABataev wrote:

> In D62393#1537269 <https://reviews.llvm.org/D62393#1537269>, @jdoerfert wrote:
>
> > In D62393#1533461 <https://reviews.llvm.org/D62393#1533461>, @ABataev wrote:
> >
> > > In D62393#1533454 <https://reviews.llvm.org/D62393#1533454>, @jdoerfert wrote:
> > >
> > > > >> The standard has "levels" and "active-levels". Both need to be tracked but it seems we only have a single array?
> > > > >  >
> > > > > 
> > > > > They all are tracked on this array.
> > > >
> > > > Where is that described?
> > >
> > >
> > > Sorry, read the code. I'm not a debugger.
> >
> >
> > First of all, this response is simply rude.
> >  Second, you wrote the code so my question is well justified.
> >  Third, does that mean you encode multiple values in a single int array but did not document this at all? The code is not documentation.
> >  Finally, this will break once we have a parallel region in a recursive loop with more than 254 instantiations, correct?
>
>
> First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.

I think it is important to note that, to the best of my knowledge, no one was implying any bad faith on the part of anyone. To name one specific factor: We have had problems in the past where groups of collaborators/coworkers have done off-list reviews, followed only by a perfunctory/superficial on-list review, resulting in a lack of public discussion around the relevant design and implementation considerations. The review history highlighted by Johannes can give that impression, and we all need the community to watch itself in this regard, because of many potentially-relevant factors, to ensure the quality of our public code reviews is high. I see no reason to believe that everyone here wants to create the best possible code base. Sometimes that means one member of the community pointing out that, in his or her opinion, past reviews might have been insufficiently considered, and we should welcome that, and all work together to address these kinds of concerns going forward.

> Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme.

Asking for additional high-level documentation on how some subsystem functions is not, on its face, unreasonable. This happens during code review frequently, and from my perspective, should probably happen even more frequently than it already does. If you feel that the relevant documentation already exists, then it's perfectly appropriate to point this out.

> Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.
>  Third, the single array is used to handle up to 128 (not 256) inner parallel region.  It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation defined.

This certainly seems like something that should be documented. The implementation can certainly have limits, but we should certainly know what they are explicitly whenever possible.

> 
> 
>>>>>   We don't need to synchronize the whole program, we just need to synchronize accesses to the parallelLevel array. In SPMD mode with full runtime (or SPMD mode eith atomic ops) the ptx compiler generates unsafe code for parallelLevel array without being marked as volatile (i.e. simple atomic-like construct in CUDa ).
>>>> 
>>>> Nobody argues we should synchronize the whole program. If you want to have atomic accesses to the parallelLevel array, make the accesses to the parallelLevel array atomic.
>>>> 
>>>> You cannot just say that the PTX compiler generates unsafe code. That is not an argument that is a guess.
>>> 
>>> Volatile provides required level of atomicity per PTX ISA.
>> 
>> Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to `parallelLevel` or why you do not synchronize the accesses to `parallelLevel` across the warp explicitly.
> 
> I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution.  Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
>  In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

This is a very interesting situation. First, I can certainly imagine that older versions of LLVM might have implicitly treated volatile in a way consistent with these semantics, but I have no idea whether Clang's CUDA implementation now supports these semantics explicitly. I'm not saying that we shouldn't use this feature of CUDA as a result, but this is important to understand. I certainly did not realize that CUDA provided special volatile semantics that include atomic properties, and if that's something that LLVM needs to support, we certainly need to have this explicitly documented.

@reames , @tra , @Hahnfeld , @jlebar , @chandlerc,  I see that this was discussed in D50391 <https://reviews.llvm.org/D50391> (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.

Regardless, please don't take offence to skepticism around using volatile to accomplish synchronization. I doubt that most people know about CUDA's particular properties in this regard, and as you know, in C/C++ the properties are much weaker. This is definitely something that requires some comments so that readers of the code understand what's going on.

Repository:
  rOMP OpenMP

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D62393/new/

https://reviews.llvm.org/D62393