[Openmp-commits] [PATCH] D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile.

Tue Oct 8 06:42:26 PDT 2019

ABataev added a comment.

In D62393#1698778 <https://reviews.llvm.org/D62393#1698778>, @JonChesterfield wrote:

> >> The byte array should be volatile qualified, as it was in your initial patch. And you should use inline asm or equivalent to prevent memory reordering for ptx. And the comment should probably reflect that the code requires both ptx volatile and the constraint on memory order.
> > 
> > Not again! We don't need to make it atomic,
>
> I didn't say it should be atomic. I said the data should be volatile and the comment updated.
>
> This patch prevents llvm from moving memory accesses across other memory operations by using a memory clobber in the inline assembly. That the asm also contains the word volatile is distracting but irrelevant to the transforms in llvm.

No, not in LLVM. LLVM behaves correctly, it is ptxas from CUDA8 does some wrong optimizations.

> The PTX compiler presumably looks at the memory clobber or parses the inline asm - either would suffice there.
> 
>> we have memory and threads barriers. But cuda8 sometimes just misses them and moves accesses to the variables across the barriers though it should not.
> 
> This discussion is complicated by two notions of volatile (llvm, ptx) and four compilers (clang, llvm, ptx, nvcc). Consequently it's ambiguous which barriers you're referring to but I don't think that matters here. I suspect 'should not' is contentious as it requires llvm to treat volatile differently when compiling for nvptx.
> 
>> The simplest and less intrusive way to prevent it is to mark those mem accesses as volatile (or atomic, whoch is too expensive). And this is what we do in this patch.
> 
> This patch doesn't mark memory accesses as volatile in llvm. It might do in ptx, if that compiler parses inline asm. This patch does introduce a memory clobber which prevents reordering operations.

We don't need to fix anything in LLVM, we just need to prevent some optimizations in ptxas. PTX inline code does exactly what we need here.

Repository:
  rOMP OpenMP

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D62393/new/

https://reviews.llvm.org/D62393