[Openmp-dev] Declare variant + Nvidia Device Offload

Tue May 19 08:50:23 PDT 2020

Hi Johannes,

> Yes, we need to replace the dot with either of the symbols you
> mentioned. If we can use the same symbols on the host, I'm fine with
> chaining it unconditionally. 
I've prototyped this change locally, replacing the dot with a
dollar-sign in the places you mentioned. With this change, declaring
variants for offloading to CUDA devices works for my testcase in my
setup (ARM + Nvidia SM 7.2).

If you want, I could create a patch for that, but do you (or anyone else
on the list) know whether the other offloading targets (AMDGPU, NEC VE,
...) are all able to handle the dollar sign in the mangled name, so we
could make this change unconditionally? I do not have a setup for each
of these to test.

I was also wondering if legalizing the mangled name shouldn't be handled
by the backend (NVPTX in this case) instead of the OpenMP-specific parts.

> I see. This is a performance bug that need to be addressed. I'm just
> not sure where exactly. 
Now that I was able to use atomicAdd with declare variant, I can confirm
that the performance improves significantly with atomicAdd. I will try
to get some more detailed numbers from nvprof.

Best,

Lukas

Lukas Sommer, M.Sc.
TU Darmstadt
Embedded Systems and Applications Group (ESA)
Hochschulstr. 10, 64289 Darmstadt, Germany
Phone: +49 6151 1622429
www.esa.informatik.tu-darmstadt.de

On 18.05.20 20:03, Johannes Doerfert wrote:
>
> On 5/18/20 12:31 PM, Lukas Sommer wrote:
> > Hi Johannes,
> >
> > thanks four your quick reply!
> >
> >> The math stuff works because all declare variant functions are static.
> >>
> >> I think if we need to replace the `.` with a symbol that the user
> cannot
> >>
> >> use but the ptax assembler is not upset about. we should also move
> >>
> >> `getOpenMPVariantManglingSeparatorStr` from `Decl.h` into
> >>
> >> `llvm/lib/Frontends/OpenMP/OMPContext.h`, I forgot why I didn't.
> >>
> > The `.` also seems to be part of the mangled context. Where does that
> > mangling take place?
>
> OMPTraitInfo::getMangledName() in OpenMPClause.{h,cpp} (clang)
>
> I guess it doesn't live in OMPContext because it uses the user defined
> structured trait set to create the mangling. Right now I don't see a
> reason not to use the VariantMatchInfo and move everything into
> OMPContext. Though, no need to do this now.
>
> > According to the PTX documentation [0], identifiers cannot contain
> dots,
> > but `$` and `%` are allowed in user-defined names (apart from a few
> > predefined identifiers).
> >
> > Should we replace the dot only for Nvidia devices or in general? Do any
> > other parts of the code rely on the mangling of the variants with dots?
>
> Yes, we need to replace the dot with either of the symbols you
> mentioned. If we can use the same symbols on the host, I'm fine with
> chaining it unconditionally.
>
> Except the test, I don't think we need to adapt anything else.
>
> FWIW, OpenMP 6.0 will actually define the mangling.
>
>
> >> You should also be able to use the clang builtin atomics
> > You were referring to
> >
> https://clang.llvm.org/docs/LanguageExtensions.html#c11-atomic-builtins,
> > weren't you? As far as I can see, those only work on atomic types.
>
> I meant:
> http://llvm.org/docs/Atomics.html#libcalls-atomic
>
>
> >> `omp atomic` should eventually resolve to the same thing (I hope).
> > From what I can see in the generated LLVM IR, this does not seem to be
> > the case. Maybe that's due to the fact, that I'm using update or
> structs
> > (for more context, see [1]):
> >
> >>  #pragma omp atomic update
> >>  target_cells_[voxelIndex].mean[0] += (double) target_[id].data[0];
> >>  #pragma omp atomic update
> >>  target_cells_[voxelIndex].mean[1] += (double) target_[id].data[1];
> >> #pragma omp atomic update
> >> target_cells_[voxelIndex].mean[2] += (double) target_[id].data[2];
> >> #pragma omp atomic update
> >> target_cells_[voxelIndex].numberPoints += 1;
> > In the generated LLVM IR, there are a number of atomic loads and an
> > atomicrmw in the end, but no calls to CUDA builtins.
> >
> > The CUDA equivalent of this target region uses calls to atomicAdd and
> > according to nvprof, this is ~10x faster than the code generated for
> the
> > target region by Clang (although I'm not entirely sure the atomics are
> > the only problem here).
>
> I see. This is a performance bug that need to be addressed. I'm just
> not sure where exactly.
>
>
> > Best,
> >
> > Lukas
> >
> > [0]
> >
> https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#identifiers
> >
> > [1]
> >
> https://github.com/esa-tu-darmstadt/daphne-benchmark/blob/054bbd723dfdf65926ef3678138c6423d581b6e1/src/OpenMP-offload/ndt_mapping/kernel.cpp#L1361
> >
> > Lukas Sommer, M.Sc.
> > TU Darmstadt
> > Embedded Systems and Applications Group (ESA)
> > Hochschulstr. 10, 64289 Darmstadt, Germany
> > Phone: +49 6151 1622429
> > www.esa.informatik.tu-darmstadt.de
> >
> > On 18.05.20 18:18, Johannes Doerfert wrote:
> >>
> >> Oh, I forgot about this one.
> >>
> >>
> >> The math stuff works because all declare variant functions are static.
> >>
> >> I think if we need to replace the `.` with a symbol that the user
> cannot
> >>
> >> use but the ptax assembler is not upset about. we should also move
> >>
> >> `getOpenMPVariantManglingSeparatorStr` from `Decl.h` into
> >>
> >> `llvm/lib/Frontends/OpenMP/OMPContext.h`, I forgot why I didn't.
> >>
> >>
> >>
> >> You should also be able to use the clang builtin atomics and even the
> >>
> >> `omp atomic` should eventually resolve to the same thing (I hope).
> >>
> >>
> >> Let me know if that helps,
> >>
> >>   Johannes
> >>
> >>
> >>
> >> On 5/18/20 10:33 AM, Lukas Sommer via Openmp-dev wrote:
> >>> Hi all,
> >>>
> >>> what's the current status of declare variant when compiling for
> Nvidia
> >>> GPUs?
> >>>
> >>> In my code, I have declared a variant of a function, that uses CUDA's
> >>> built-in atomicAdd (using the syntax from OpenMP TR8):
> >>>
> >>>> #pragma omp begin declare variant match(device={kind(nohost)})
> >>>>
> >>>> void atom_add(double* address, double val){
> >>>>         atomicAdd(address, val);
> >>>> }
> >>>>
> >>>> #pragma omp end declare variant
> >>> When compiling with Clang from master, ptxas fails:
> >>>
> >>>> clang++ -fopenmp   -O3 -std=c++11 -fopenmp
> >>>> -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target -march=sm_72 -v
> >>>> [...]
> >>>> ptxas kernel-openmp-nvptx64-nvidia-cuda.s, line 322; fatal   :
> Parsing
> >>>> error near '.ompvariant': syntax error
> >>>> ptxas fatal   : Ptx assembly aborted due to errors
> >>>> [...]
> >>>> clang-11: error: ptxas command failed with exit code 255 (use -v to
> >>>> see invocation)
> >>> The line mentioned in the ptxas error looks like this:
> >>>
> >>>>         // .globl _Z33atom_add.ompvariant.S2.s6.PnohostPdd
> >>>> .visible .func _Z33atom_add.ompvariant.S2.s6.PnohostPdd(
> >>>>         .param .b64
> _Z33atom_add.ompvariant.S2.s6.PnohostPdd_param_0,
> >>>>         .param .b64 _Z33atom_add.ompvariant.S2.s6.PnohostPdd_param_1
> >>>> )
> >>>> {
> >>> My guess was that ptxas stumbles across the ".ompvariant"-part of the
> >>> mangled function name.
> >>>
> >>> Is declare variant currently supported when compiling for Nvidia
> GPUs?
> >>> If not, is there a workaround (macro defined only for device
> >>> compilation, access to the atomic CUDA functions, ...)?
> >>>
> >>> Thanks in advance,
> >>>
> >>> Best
> >>>
> >>> Lukas
> >>>
> >
>