[Openmp-dev] Declare variant + Nvidia Device Offload

Mon May 18 11:03:26 PDT 2020

On 5/18/20 12:31 PM, Lukas Sommer wrote:
 > Hi Johannes,
 >
 > thanks four your quick reply!
 >
 >> The math stuff works because all declare variant functions are static.
 >>
 >> I think if we need to replace the `.` with a symbol that the user cannot
 >>
 >> use but the ptax assembler is not upset about. we should also move
 >>
 >> `getOpenMPVariantManglingSeparatorStr` from `Decl.h` into
 >>
 >> `llvm/lib/Frontends/OpenMP/OMPContext.h`, I forgot why I didn't.
 >>
 > The `.` also seems to be part of the mangled context. Where does that
 > mangling take place?

OMPTraitInfo::getMangledName() in OpenMPClause.{h,cpp} (clang)

I guess it doesn't live in OMPContext because it uses the user defined
structured trait set to create the mangling. Right now I don't see a
reason not to use the VariantMatchInfo and move everything into
OMPContext. Though, no need to do this now.

 > According to the PTX documentation [0], identifiers cannot contain dots,
 > but `$` and `%` are allowed in user-defined names (apart from a few
 > predefined identifiers).
 >
 > Should we replace the dot only for Nvidia devices or in general? Do any
 > other parts of the code rely on the mangling of the variants with dots?

Yes, we need to replace the dot with either of the symbols you
mentioned. If we can use the same symbols on the host, I'm fine with
chaining it unconditionally.

Except the test, I don't think we need to adapt anything else.

FWIW, OpenMP 6.0 will actually define the mangling.

 >> You should also be able to use the clang builtin atomics
 > You were referring to
 > https://clang.llvm.org/docs/LanguageExtensions.html#c11-atomic-builtins,
 > weren't you? As far as I can see, those only work on atomic types.

I meant:
http://llvm.org/docs/Atomics.html#libcalls-atomic

 >> `omp atomic` should eventually resolve to the same thing (I hope).
 > From what I can see in the generated LLVM IR, this does not seem to be
 > the case. Maybe that's due to the fact, that I'm using update or structs
 > (for more context, see [1]):
 >
 >>  #pragma omp atomic update
 >>  target_cells_[voxelIndex].mean[0] += (double) target_[id].data[0];
 >>  #pragma omp atomic update
 >>  target_cells_[voxelIndex].mean[1] += (double) target_[id].data[1];
 >> #pragma omp atomic update
 >> target_cells_[voxelIndex].mean[2] += (double) target_[id].data[2];
 >> #pragma omp atomic update
 >> target_cells_[voxelIndex].numberPoints += 1;
 > In the generated LLVM IR, there are a number of atomic loads and an
 > atomicrmw in the end, but no calls to CUDA builtins.
 >
 > The CUDA equivalent of this target region uses calls to atomicAdd and
 > according to nvprof, this is ~10x faster than the code generated for the
 > target region by Clang (although I'm not entirely sure the atomics are
 > the only problem here).

I see. This is a performance bug that need to be addressed. I'm just not 
sure where exactly.

 > Best,
 >
 > Lukas
 >
 > [0]
 > 
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#identifiers
 >
 > [1]
 > 
https://github.com/esa-tu-darmstadt/daphne-benchmark/blob/054bbd723dfdf65926ef3678138c6423d581b6e1/src/OpenMP-offload/ndt_mapping/kernel.cpp#L1361
 >
 > Lukas Sommer, M.Sc.
 > TU Darmstadt
 > Embedded Systems and Applications Group (ESA)
 > Hochschulstr. 10, 64289 Darmstadt, Germany
 > Phone: +49 6151 1622429
 > www.esa.informatik.tu-darmstadt.de
 >
 > On 18.05.20 18:18, Johannes Doerfert wrote:
 >>
 >> Oh, I forgot about this one.
 >>
 >>
 >> The math stuff works because all declare variant functions are static.
 >>
 >> I think if we need to replace the `.` with a symbol that the user cannot
 >>
 >> use but the ptax assembler is not upset about. we should also move
 >>
 >> `getOpenMPVariantManglingSeparatorStr` from `Decl.h` into
 >>
 >> `llvm/lib/Frontends/OpenMP/OMPContext.h`, I forgot why I didn't.
 >>
 >>
 >>
 >> You should also be able to use the clang builtin atomics and even the
 >>
 >> `omp atomic` should eventually resolve to the same thing (I hope).
 >>
 >>
 >> Let me know if that helps,
 >>
 >>   Johannes
 >>
 >>
 >>
 >> On 5/18/20 10:33 AM, Lukas Sommer via Openmp-dev wrote:
 >>> Hi all,
 >>>
 >>> what's the current status of declare variant when compiling for Nvidia
 >>> GPUs?
 >>>
 >>> In my code, I have declared a variant of a function, that uses CUDA's
 >>> built-in atomicAdd (using the syntax from OpenMP TR8):
 >>>
 >>>> #pragma omp begin declare variant match(device={kind(nohost)})
 >>>>
 >>>> void atom_add(double* address, double val){
 >>>>         atomicAdd(address, val);
 >>>> }
 >>>>
 >>>> #pragma omp end declare variant
 >>> When compiling with Clang from master, ptxas fails:
 >>>
 >>>> clang++ -fopenmp   -O3 -std=c++11 -fopenmp
 >>>> -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target -march=sm_72 -v
 >>>> [...]
 >>>> ptxas kernel-openmp-nvptx64-nvidia-cuda.s, line 322; fatal   : Parsing
 >>>> error near '.ompvariant': syntax error
 >>>> ptxas fatal   : Ptx assembly aborted due to errors
 >>>> [...]
 >>>> clang-11: error: ptxas command failed with exit code 255 (use -v to
 >>>> see invocation)
 >>> The line mentioned in the ptxas error looks like this:
 >>>
 >>>>         // .globl _Z33atom_add.ompvariant.S2.s6.PnohostPdd
 >>>> .visible .func _Z33atom_add.ompvariant.S2.s6.PnohostPdd(
 >>>>         .param .b64 _Z33atom_add.ompvariant.S2.s6.PnohostPdd_param_0,
 >>>>         .param .b64 _Z33atom_add.ompvariant.S2.s6.PnohostPdd_param_1
 >>>> )
 >>>> {
 >>> My guess was that ptxas stumbles across the ".ompvariant"-part of the
 >>> mangled function name.
 >>>
 >>> Is declare variant currently supported when compiling for Nvidia GPUs?
 >>> If not, is there a workaround (macro defined only for device
 >>> compilation, access to the atomic CUDA functions, ...)?
 >>>
 >>> Thanks in advance,
 >>>
 >>> Best
 >>>
 >>> Lukas
 >>>
 >