[Openmp-commits] [PATCH] D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile.
Alexey Bataev via Phabricator via Openmp-commits
openmp-commits at lists.llvm.org
Tue Jun 11 15:00:30 PDT 2019
ABataev added a comment.
In D62393#1538931 <https://reviews.llvm.org/D62393#1538931>, @hfinkel wrote:
> In D62393#1538879 <https://reviews.llvm.org/D62393#1538879>, @ABataev wrote:
>
> > In D62393#1538842 <https://reviews.llvm.org/D62393#1538842>, @__simt__ wrote:
> >
> > > I know some things about CUDA, volatile and C++. Let's see if I can understand the part of the proposed change that involves these. I don't understand the part about the test but I don't need to, I'll ignore that.
> > >
> > > The gist of the issue is that parallelLevel table should really be atomic<> because of data-races on it (that was a bug prior to this, maybe there are more such bugs lying around), except there's a problem with the obvious fix: atomic<> is not available in CUDA (yet) so it's not an option to fix this issue. The next best thing we have instead is volatile.
> > >
> > > Now, volatile in PTX (e.g. asm("ld.volatile...[x];")) and volatile (e.g. "volatile ...x;") in C++ source like this, are not exactly the same thing. When CUDA says that volatile is equivalent memory_order_relaxed, it's saying something clear (I think) about the PTX level code but it's still being pretty vague about CUDA C++ level code. OTOH it's entirely possible for Clang to do something with either atomic<> or volatile that isn't valid for the other -- and that's a different can of worms than, say, NVCC which does {whatever NVCC does, it's not the compiler you're using}.
> > >
> > > However, since people do use CUDA C++ volatile this way a lot already, you can't really have a good CUDA toolchain unless it is the case (including, by accident) that it works for this purpose. In other words, it's probably reasonable to assume this in your code because every other code would be on fire otherwise, and it's not on fire, so far as we can tell.
> > >
> > > It might be worth it to prepare your code for the eventual arrival of atomic<> on CUDA. That is, maybe create a template alias on T with some helper functions, just enough for your use. It might make this code more self-documenting and make it easy to make it 100% legit later on.
> >
> >
> > Hi Olivier, thanks for your comments.
>
>
> +1
>
> > You're absolutely right. Actually, we're using both compilers, nvcc and clang (under different conditions, though). Marking the variable volatile does not break it in the LLVM level. Maybe, it is by accident, but I rather doubt in this.
>
> I'm pretty sure that it's not accidental, but I'm very concerned about relying on it working without documenting the required semantics. Two things:
>
> 1. First, we currently provide volatile accesses with some synchronization semantics. For example, for LoadInst, we have this:
>
>
>
> bool isUnordered() const {
> return (getOrdering() == AtomicOrdering::NotAtomic ||
> getOrdering() == AtomicOrdering::Unordered) &&
> !isVolatile();
> }
>
>
>
> while, at the same time, documenting that volatile has no such semantics at the IR level. The LangRef currently says, " The optimizers may change the order of volatile operations relative to non-volatile operations. This is not Java’s “volatile” and has no cross-thread synchronization behavior."
>
> And, thus, my concern. It works, and there is some explicit code to support this functonality, but the semantics are not documented (and, perhaps, are documented *not* to work). This I think that we should correct to ensure correct functioning going forward. Alternatively, we might change how Clang lowers volatile access in CUDA mode to make them relaxed atomics? Any thoughts on this?
Yes, probably it is a good idea to fix possible incompatibility with nvcc. But this problem may be nit very significant for Cuda if some possibly dangerous optimization passes are disabled for CUDA toolchain. But anyway, better to translate it directly to relaxed atomic operations just in case.
>
>
>> Do you suggest to create a template function that will provide the access to the parallelLevel variable? Amd when the atomic<> is supported by Cuda change the type of this variable to atomic<> so the compiler could automatically instantiate this template function with the proper type, right? Or you have something different in mind? If so, could provide a small example of your idea?
Repository:
rOMP OpenMP
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D62393/new/
https://reviews.llvm.org/D62393
More information about the Openmp-commits
mailing list