[Openmp-commits] [PATCH] D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile.

Tue Jun 11 16:25:30 PDT 2019

jdoerfert added a comment.

**This is a clarrification for some older comments.**

In D62393#1537313 <https://reviews.llvm.org/D62393#1537313>, @ABataev wrote:

> First of all, you started this when expressed the thought that the patches are accepted not because they are correct, but because we all work in the same company. This was rude! Besides, we don't work at the same company anymore.

I do not remember me saying this, especially since I know you don't work in the same company.

> Second, you need to try to understand the main idea yourself. I can't explain the whole protocol between the compiler and the runtime, it will take a lot of time. Try to understand this at first and then ask the questions about the particular details, but not the whole scheme. Or you want me ask you to explain the principles of Attributor in full? Or loop vectorizer? Or something else? Your question is not productive.

That is exactly what happens (wrt. Attributor), people looked at the code and asked about principles, requested documentation, etc. If you look at the code know, it is all the better for it. So far you just ignored my request for clarifications and justifications which is what this whole code review actually started with.

> Third, the single array is used to handle up to 128 (not 256) inner parallel region.  It is fine from point of view of the standard. The number of supported inner parallel levels (both, active and inactive) is implementation-defined.

I asked where we document that a single array encodes both, the number of active and inactive parallel regions, at the same time. The code is not sufficient documentation for such a low-level implementation detail. Similar, but before not on my radar, is the fact that there is an apparently undocumented implementation detail wrt the number of levels we can handle.

In D62393#1538348 <https://reviews.llvm.org/D62393#1538348>, @ABataev wrote:

> Hal, I agree that it was very dangerous situations, but I think, you need to prove something before trying to escalate the situation and blame somebody in doing the incorrect things. Johannes did this without any proves, he directly blamed me and others in doing improper things. Though he has no idea at all how things work in Cuda.

I did already provide plenty of arguments in https://reviews.llvm.org/D62199#1515182 and https://reviews.llvm.org/D62199#1517073.

For the record, this code review was (from my side) never about accusations or improper behavior but about:

1. The right solution to a problem I don't think we specified(*) properly, and
2. The general lack of documentation that lead to various questions on how to interpret the changes.

(*) My question about which accesses are racy have been ignored, as did my inquiries about an alternative explicit synchronization to communicate the value through all threads in the warp.

I don't think I have "no idea at all how things work in Cuda" and I don't appreciate the accusation.

> Yes, but you need to have at least some basic knowledge about OpenMP itself, Cuda and the OpenMP runtime. I have no time to write the lectures trying to explain some basic things to Johannes.

Me, and probably other people, might argue I have some OpenMP/Clang experience.

> Yes, you're right. The design decisions must described somewhere. But I don't think that this must prevent the work on the runtime improvement.

This seems to me like a core issue. Improvements without documentation will inevitably cause problems down the road and rushing into a solution is also not a sustainable practise.

>>>> Arguably, atomic accesses provide atomicity. You also did never respond to the question which other accesses actually race with the ones to `parallelLevel` or why you do not synchronize the accesses to `parallelLevel` across the warp explicitly.
>>> 
>>> I answered all the questions. We can use the explicit synchronizatiin, yes, but it will slow down the whole program execution.  Instead, it is more effective to synchronize the access to the single variable. Volatile modifier allows to do this without explicit atomic operations. In Cuda volatile is slightly different form that in C/C++ and can be used to implement relaxed atomic operations.
>>>  In this array,the memory is written by one thread but can be read by other threads in the same warp. Without volatile modifier or atomic ops the ptxas tool reorders operations and it leads to the undefined behaviour when the runtime is inlined.

According to D50391 <https://reviews.llvm.org/D50391>, explicit relaxed accesses are lowered into volatile PTX accesses. Why did you see a slowdown or did you simply expect a slowdown?

> If somebody tries to review the code, this review must be productive. If the review leads to the set of lectures about language details, it is not productive. It is school lectures. Should I explain Cuda language in the review? No. To review the code you need to have at least some minimal level of the expertise.

I never asked for a school lecture, nor is it appropriate that you think you have to give me one instead of actually answering my questions and commenting on my concerns. Simply questioning my expertise and therefore ignoring my concerns not helping anyone.

Repository:
  rOMP OpenMP

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D62393/new/

https://reviews.llvm.org/D62393