[Openmp-commits] [PATCH] D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile.

Thu Jun 13 11:02:25 PDT 2019

ABataev added a comment.

In D62393#1542145 <https://reviews.llvm.org/D62393#1542145>, @hfinkel wrote:

> In D62393#1539630 <https://reviews.llvm.org/D62393#1539630>, @ABataev wrote:
>
> > In D62393#1539086 <https://reviews.llvm.org/D62393#1539086>, @jdoerfert wrote:
> >
> > > **This is a clarrification for some older comments.**
> > >
> > > In D62393#1537313 <https://reviews.llvm.org/D62393#1537313>, @ABataev wrote:
> > >
> > > > First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.
> > >
> > >
> > > I do not remember me saying this, especially since I know you don't work in the same company.
> > >
> > > > Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme. Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.
> > >
> > > That is exactly what happens (wrt. Attributor), people looked at the code and asked about principles, requested documentation, etc. If you look at the code know, it is all the better for it. So far you just ignored my request for clarifications and justifications which is what this whole code review actually started with.
> > >
> > > > Third, the single array is used to handle up to 128 (not 256) inner parallel region.  It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation-defined.
> > >
> > > I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.
> > >
> > > In D62393#1538348 <https://reviews.llvm.org/D62393#1538348>, @ABataev wrote:
> > >
> > > > Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.
> > >
> > >
> > > I did already provide plenty of arguments in https://reviews.llvm.org/D62199#1515182 and https://reviews.llvm.org/D62199#1517073.
> > >
> > > For the record, this code review was (from my side) never about accusations or improper behavior but about:
> > >
> > > 1. The right solution to a problem I don't think we specified(*) properly, and
> > > 2. The general lack of documentation that lead to various questions on how to interpret the changes.
> > >
> > >   (*) My question about which accesses are racy have been ignored, as did my inquiries about an alternative explicit synchronization to communicate the value through all threads in the warp.
> > >
> > >   I don't think I have "no idea at all how things work in Cuda" and I don't appreciate the accusation.
> > >
> > >
> > >
> > > > Yes, but you need to have at least some basic knowledge about OpenMP itself, Cuda and the OpenMP runtime. I have no time to write the lectures trying to explain some basic things to Johannes.
> > >
> > > Me, and probably other people, might argue I have some OpenMP/Clang experience.
> > >
> > > > Yes, you're right. The design decisions must described somewhere. But I don't think that this must prevent the work on the runtime improvement.
> > >
> > > This seems to me like a core issue. Improvements without documentation will inevitably cause problems down the road and rushing into a solution is also not a sustainable practise.
> > >
> > > >>>> Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to `parallelLevel` or why you do not synchronize the accesses to `parallelLevel` across the warp explicitly.
> > > >>> 
> > > >>> I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution.  Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
> > > >>>  In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.
> > >
> > > According to D50391 <https://reviews.llvm.org/D50391>, explicit relaxed accesses are lowered into volatile PTX accesses. Why did you see a slowdown or did you simply expect a slowdown?
> > >
> > > > If somebody tries to review the code, this review must be productive. If the review leads to the set of lectures about language details, it is not productive. It is school lectures. Should I explain Cuda language in the review? No. To review the code you need to have at least some minimal level of the expertise.
> > >
> > > I never asked for a school lecture, nor is it appropriate that you think you have to give me one instead of actually answering my questions and commenting on my concerns. Simply questioning my expertise and therefore ignoring my concerns not helping anyone.
> >
> >
> > Hal, @hfinkel, do you really want me to continue with all this stuff?
>
>
> Yes, there are outstanding technical questions; from @jdoerfert :
>
> >   I'm also still not following what race condition you try to prevent with volatile in the first place, could you elaborate which accesses are racing without it?

Several threads in the same warp access the same parallelLevel array element. The compiler optimize out some of the accesses (because it is not aware of the access in several threads) and it leads to undefined behaviour.

> and a related question regarding alternate synchronization schemes (my suspicion is that it's more useful to answer the question about which accesses are racing first, so that the reviewers can assess what ordering semantics are required, and then address and benefits or drawbacks of alternate synchronization schemes).
> 
> @jdoerfert also asked:
> 
>> I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.
> 
> And either the requested documentation should be added, or if it already exists, you should point that out.

I agree, that documentation should be added, I already said this. But this is not related directly to this patch.

> Your reply to @Hahnfeld:
> 
>> ... 1 is returned if at least 1 parallel region is active. And we count both, active and inactive regions. We may have only 1 active region, the very first one. If the first region is active, the MSB is set to 1 to mark this situation.
> 
> this should also appear in a comment in the code.

> In general, please treat all reviewer comments as comments from experts that should be addressed (and, generally, code reviewers do have relatively-rare expertise in particular areas). Experts are also busy, however, so we all try to be respectful of each other's time. Experts also make mistakes and suboptimal decisions, of course, which is why we have code review. Thus, if someone asks for an explanation of how something works in the current or updated code, then provide an explanation. If someone requests additional documentation, then add it. Yes, the expert can read the code, but it's more efficient for you to provide the explanation (as the author of the patch, you have presumably read the code most recently) or documentation.

Yes, I agree, and I can answer the technical details, if they are productive and lead to the best code quality.

> Again, I see no reason to believe that everyone here isn't acting in good faith and working to create the software of the highest possible quality. Thanks!

Sorry, but Johannes did this. "The more patches go through with this kind of "review" the more "suspicious" this looks to me." It is his words. seems to me, he does not think that me or my colleagues are acting in good faith.

Repository:
  rOMP OpenMP

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D62393/new/

https://reviews.llvm.org/D62393