[Openmp-commits] [PATCH] D62393: [OPENMP][NVPTX]Mark parallel level counter as volatile.

Thu Jun 20 11:26:04 PDT 2019

ABataev added a comment.

In D62393#1549779 <https://reviews.llvm.org/D62393#1549779>, @ABataev wrote:

> In D62393#1549761 <https://reviews.llvm.org/D62393#1549761>, @jdoerfert wrote:
>
> > In D62393#1548388 <https://reviews.llvm.org/D62393#1548388>, @ABataev wrote:
> >
> > > In D62393#1543858 <https://reviews.llvm.org/D62393#1543858>, @jdoerfert wrote:
> > >
> > > > In D62393#1542731 <https://reviews.llvm.org/D62393#1542731>, @ABataev wrote:
> > > >
> > > > > In D62393#1542638 <https://reviews.llvm.org/D62393#1542638>, @jdoerfert wrote:
> > > > >
> > > > > > In D62393#1542513 <https://reviews.llvm.org/D62393#1542513>, @ABataev wrote:
> > > > > >
> > > > > > > In D62393#1542471 <https://reviews.llvm.org/D62393#1542471>, @jdoerfert wrote:
> > > > > > >
> > > > > > > > I want to investigate the racy accesses further and make sure it is not a miscompile inside LLVM.
> > > > > > >
> > > > > > >
> > > > > > > This is not a problem inside LLVM. The problem  appears after optimizations performed by the ptxas tool (when it compiles PTX to SASS) at O3 <https://reviews.llvm.org/owners/package/3/> with the inlined runtime.
> > > > > > >
> > > > > > > > I extracted the test case (see below) but I was not seeing the `ERROR`. How did you run the test case to see a different value for `Count`?
> > > > > > >
> > > > > > > You need to compile it with the inlined runtime at O2 <https://reviews.llvm.org/owners/package/2/> or O3 <https://reviews.llvm.org/owners/package/3/>.
> > > > > >
> > > > > >
> > > > > > When I run 
> > > > > >  `./bin/clang -fopenmp-targets=nvptx64-nvida-cuda -O3 -fopenmp --cuda-path=/soft/compilers/cuda/cuda-9.1.85  -Xopenmp-target -march=sm_70  -fopenmp=libomp  test.c -o test.ll -emit-llvm -S`
> > > > > >  I get
> > > > > >
> > > > > >   https://gist.github.com/jdoerfert/4376a251d98171326d625f2fb67b5259
> > > > > >
> > > > > > which shows the inlined and optimized libomptarget.
> > > > > >
> > > > > > > And you need the latest version of the libomptarget
> > > > > >
> > > > > > My version is from today Jun 13 15:24:11 2019, git: 3bc6e2a7aa3853b06045c42e81af094647c48676 <https://reviews.llvm.org/rG3bc6e2a7aa3853b06045c42e81af094647c48676>
> > > > >
> > > > >
> > > > > We have problems in Cuda 8, at least, for arch sm_35
> > > >
> > > >
> > > > I couldn't get that version to run properly so I asked someone who had a system set up. 
> > > >  Unfortunately, the test.c [1] did not trigger the problem. In test.c we run the new test part in `spmd_parallel_regions.cpp` 1000 times and check the result each time.
> > > >  It was run with Cuda 8.0 for sm_35, sm_37, and sm_70.
> > > >
> > > > Could you share more information on how the system has to look to trigger the problem?
> > > >  Could you take a look at the test case we run and make sure it triggers the problem on your end?
> > > >
> > > > [1] https://gist.github.com/jdoerfert/d2b18ca8bb5c3443cc1d26b23236866f
> > >
> > >
> > > You need to apply the patch D62318 <https://reviews.llvm.org/D62318> to reproduce the problem for sure.
> >
> >
> > This means the problem, as of right now, does not exist, correct?
>
>
> No, it still might appear but it is rather harder to run into the trouble with the current version of the runtime.
>
> > If so, what part of the D62318 <https://reviews.llvm.org/D62318> patch is causing the problem?
>
> I reduced significantly the size of the runtime class and it triggers some of optimizations more often. The access to parallelLevel variable when we check the current parallel level to grt the correct thread ID triggers those optimizations.
>
> > Does the `test.c` that I floated earlier expose the problem then or do I need a different test case?
> >  What configuration are you running? Is it reproducible with Cuda 9/10 and sm_70?
>
> Yes, it exposes the problem, but only with D62318 <https://reviews.llvm.org/D62318> applied. Not sure about Cuda9, will try to check this later today.

Checked with Cuda9, it works. Most probably, the problem is related to Cuda8 only. Most probably, there were some optimizations that were fixed in Cuda9 (in ptxas tool). Clang defines macro CUDA_VERSION and sets it to 8000 for Cuda8. I can check this macro and use volatile modifiers only for Cuda8.
To reproduce the problem, you need to build the runtime without debug info and build the test at O3 <https://reviews.llvm.org/owners/package/3/>.

Repository:
  rOMP OpenMP

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D62393/new/

https://reviews.llvm.org/D62393