[Openmp-commits] [PATCH] D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile.

Thu Jun 13 09:50:43 PDT 2019

__simt__ added a comment.

In D62393#1541976 <https://reviews.llvm.org/D62393#1541976>, @ABataev wrote:

> In D62393#1541969 <https://reviews.llvm.org/D62393#1541969>, @tra wrote:
>
> > > @reames , @tra , @Hahnfeld , @jlebar , @chandlerc,  I see that this was discussed in D50391 <https://reviews.llvm.org/D50391> (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.
> >
> > Do I understand it correctly that the argument is about using C++ `volatile` with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's `atomic monotonic` operations?
> >
> > If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.
> >
> > In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so __volatile__ might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.
> >
> > Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.
> >
> > Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.
>
>
> Artem, thanks for your comment. But we need not only atomicity, bht also we need to command the ptxas compiler to not optimize accesses to parallelLevel. According to several comments, (like this one https://stackoverflow.com/a/1533115) it is recommended to use volatile modifier in Cuda in such cases. If clang does not provide the required level of support for Cuda volatile, I think this is an incompatibility with nvcc.
>  Also, I already thought about using PTX instructions directly. Probably, you're right that it would be better to use them if you're sure that there is a difference between nvcc and clang.

The quoted stack overflow comment doesn't seem like the one you intended. Seems like a mispaste.

When you say "we need to command the ptxas compiler to not optimize accesses to parallelLevel", this is not the best way to describe what you actually need here. I think you're channeling the belief that compilers don't optimize accesses to `std::atomic<T>`, and explaining what you need in terms of that belief. It seems that any optimization that could be performed (and some are indeed performed!) on `std::atomic<T>` with `memory_order_relaxed` is valid to perform on the accesses to `parallelLevel` as well, because that is really what it should be.

(If that is not the case, then now would be a good time to say it.)

The next conclusion is completely correct: you get that very semantic from the ptxas compiler by using either `*.relaxed.sys` or `*.volatile` accesses. Both have the same meaning as accesses `std::atomic<T>` with `memory_order_relaxed` in modern PTX, whilst nothing in modern PTX carries a true "no optimizations" meaning.

Lastly, I can confirm that Clang definitely supports inline PTX asm, I've used it, with these instructions specifically.

Repository:
  rOMP OpenMP

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D62393/new/

https://reviews.llvm.org/D62393