[Openmp-commits] [PATCH] D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile.
Artem Belevich via Phabricator via Openmp-commits
openmp-commits at lists.llvm.org
Thu Jun 13 10:11:34 PDT 2019
tra added a comment.
In D62393#1541976 <https://reviews.llvm.org/D62393#1541976>, @ABataev wrote:
> In D62393#1541969 <https://reviews.llvm.org/D62393#1541969>, @tra wrote:
> > > @reames , @tra , @Hahnfeld , @jlebar , @chandlerc, I see that this was discussed in D50391 <https://reviews.llvm.org/D50391> (in terms of using PTX's volatile to provide atomic lowering), but it's not clear to me that using volatile with atomic semantics in LLVM is something we guarantee will work (even in CUDA mode). I imagine that we'd need to lower these in terms of relaxed atomics in Clang for this to really be guaranteed to work correctly.
> > Do I understand it correctly that the argument is about using C++ `volatile` with the assumption that those will map to ld.volatile/st.volatile, which happen to provide sufficiently strong guarantees to be equivalent of LLVM's `atomic monotonic` operations?
> > If that's the case, then I would agree that it's an implementation detail one should not be relying on. If atomicity is needed, we should figure out a way to express that in C++.
> > In practice, the standard C++ library support in cuda is somewhat limited. I don't think atomic<> works, so __volatile__ might be a 'happens to work' short-term workaround. If that's the case, then there should a comment describing what's going on here and TODO to fix it when better support for atomics on GPU is available.
> > Another option would be to use atomic*() functions provided by CUDA, but those do have limited functionality on older GPUs.
> > Yet another alternative is to explicitly use ld.volatile/st.volatile from inline assembly. I don't think we have it plumbed as clang builtin.
> Artem, thanks for your comment. But we need not only atomicity, bht also we need to command the ptxas compiler to not optimize accesses to parallelLevel.
So you need to have ld.volatile/st.volatile to make sure compiler does not mess with accesses to shared memory.
C++ volatile will give you that. You also rely on atomicity. C++ volatile does not guarantee that, even if NVPTX does happen to. It's a mere coincidence. What if next PTX revision provides a better way to implement C++ volatile without providing atomicity? Then your code will all of a sudden break in new and exciting ways.
I guess the conservative approach here would be to declare the variable volatile, and then use inline asm ld.volatile/st.volatile to increment it. This will guarantee that the code does exactly what you want without assuming that LLVM back-end always lowers volatile to ld/st.volatile.
> According to several comments, (like this one https://stackoverflow.com/a/1533115) it is recommended to use volatile modifier in Cuda in such cases. If clang does not provide the required level of support for Cuda volatile, I think this is an incompatibility with nvcc.
Wrong SO link? The question does not seem to be CUDA-related.
As for advise to use __volatile__ with nvcc, in my experience it tends to be overused a lot and it's quite often wrong. We used to have *tons* of that in CUDA code in TensorFlow and most of those cases turned out to be a big crowbar shoved in the way of compiler optimizations in order to cover up a real issue somewhere else.
> Also, I already thought about using PTX instructions directly. Probably, you're right that it would be better to use them if you're sure that there is a difference between nvcc and clang.
AFAICT, clang will produce PTX .volatile for C++ volatile variables. That's the closest we can get to implement C++ semantics on GPU. The fact that it may end up providing stronger guarantees should not be relied on, IMO. It will not be obvious to whoever is going to read the code later. And LLVM back-end is not guaranteed to always lower C++ volatile as PTX .volatile.
CHANGES SINCE LAST ACTION
More information about the Openmp-commits