[Openmp-dev] Declare variant + Nvidia Device Offload

Wed May 20 08:22:24 PDT 2020

On 5/19/20 10:50 AM, Lukas Sommer wrote:
 > Hi Johannes,
 >
 >> Yes, we need to replace the dot with either of the symbols you
 >> mentioned. If we can use the same symbols on the host, I'm fine with
 >> chaining it unconditionally.
 > I've prototyped this change locally, replacing the dot with a
 > dollar-sign in the places you mentioned. With this change, declaring
 > variants for offloading to CUDA devices works for my testcase in my
 > setup (ARM + Nvidia SM 7.2).
 >
 > If you want, I could create a patch for that, but do you (or anyone else
 > on the list) know whether the other offloading targets (AMDGPU, NEC VE,
 > ...) are all able to handle the dollar sign in the mangled name, so we
 > could make this change unconditionally? I do not have a setup for each
 > of these to test.

Please prepare a patch. People can then test it, e.g., AMD folks, but I
expect it to work fine and us adopting it :)

 > I was also wondering if legalizing the mangled name shouldn't be handled
 > by the backend (NVPTX in this case) instead of the OpenMP-specific parts.

I doubt it is reasonable to change symbol names once they are defined.
It would work here "by accident" but with the OpenMP 6.0 mangling this
will be in fact illegal.

 >> I see. This is a performance bug that need to be addressed. I'm just
 >> not sure where exactly.
 > Now that I was able to use atomicAdd with declare variant, I can confirm
 > that the performance improves significantly with atomicAdd. I will try
 > to get some more detailed numbers from nvprof.
 >

Cool! I'm glad the declare variant begin/end works and exposing the CUDA
intrinsics is useful/working too :)

 > Best,
 >
 > Lukas
 >
 > Lukas Sommer, M.Sc.
 > TU Darmstadt
 > Embedded Systems and Applications Group (ESA)
 > Hochschulstr. 10, 64289 Darmstadt, Germany
 > Phone: +49 6151 1622429
 > www.esa.informatik.tu-darmstadt.de
 >
 > On 18.05.20 20:03, Johannes Doerfert wrote:
 >>
 >> On 5/18/20 12:31 PM, Lukas Sommer wrote:
 >>> Hi Johannes,
 >>>
 >>> thanks four your quick reply!
 >>>
 >>>> The math stuff works because all declare variant functions are static.
 >>>>
 >>>> I think if we need to replace the `.` with a symbol that the user
 >> cannot
 >>>>
 >>>> use but the ptax assembler is not upset about. we should also move
 >>>>
 >>>> `getOpenMPVariantManglingSeparatorStr` from `Decl.h` into
 >>>>
 >>>> `llvm/lib/Frontends/OpenMP/OMPContext.h`, I forgot why I didn't.
 >>>>
 >>> The `.` also seems to be part of the mangled context. Where does that
 >>> mangling take place?
 >>
 >> OMPTraitInfo::getMangledName() in OpenMPClause.{h,cpp} (clang)
 >>
 >> I guess it doesn't live in OMPContext because it uses the user defined
 >> structured trait set to create the mangling. Right now I don't see a
 >> reason not to use the VariantMatchInfo and move everything into
 >> OMPContext. Though, no need to do this now.
 >>
 >>> According to the PTX documentation [0], identifiers cannot contain
 >> dots,
 >>> but `$` and `%` are allowed in user-defined names (apart from a few
 >>> predefined identifiers).
 >>>
 >>> Should we replace the dot only for Nvidia devices or in general? Do any
 >>> other parts of the code rely on the mangling of the variants with dots?
 >>
 >> Yes, we need to replace the dot with either of the symbols you
 >> mentioned. If we can use the same symbols on the host, I'm fine with
 >> chaining it unconditionally.
 >>
 >> Except the test, I don't think we need to adapt anything else.
 >>
 >> FWIW, OpenMP 6.0 will actually define the mangling.
 >>
 >>
 >>>> You should also be able to use the clang builtin atomics
 >>> You were referring to
 >>>
 >> https://clang.llvm.org/docs/LanguageExtensions.html#c11-atomic-builtins,
 >>> weren't you? As far as I can see, those only work on atomic types.
 >>
 >> I meant:
 >> http://llvm.org/docs/Atomics.html#libcalls-atomic
 >>
 >>
 >>>> `omp atomic` should eventually resolve to the same thing (I hope).
 >>> From what I can see in the generated LLVM IR, this does not seem to be
 >>> the case. Maybe that's due to the fact, that I'm using update or
 >> structs
 >>> (for more context, see [1]):
 >>>
 >>>>   #pragma omp atomic update
 >>>>   target_cells_[voxelIndex].mean[0] += (double) target_[id].data[0];
 >>>>   #pragma omp atomic update
 >>>>   target_cells_[voxelIndex].mean[1] += (double) target_[id].data[1];
 >>>> #pragma omp atomic update
 >>>> target_cells_[voxelIndex].mean[2] += (double) target_[id].data[2];
 >>>> #pragma omp atomic update
 >>>> target_cells_[voxelIndex].numberPoints += 1;
 >>> In the generated LLVM IR, there are a number of atomic loads and an
 >>> atomicrmw in the end, but no calls to CUDA builtins.
 >>>
 >>> The CUDA equivalent of this target region uses calls to atomicAdd and
 >>> according to nvprof, this is ~10x faster than the code generated for
 >> the
 >>> target region by Clang (although I'm not entirely sure the atomics are
 >>> the only problem here).
 >>
 >> I see. This is a performance bug that need to be addressed. I'm just
 >> not sure where exactly.
 >>
 >>
 >>> Best,
 >>>
 >>> Lukas
 >>>
 >>> [0]
 >>>
 >> 
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#identifiers
 >>>
 >>> [1]
 >>>
 >> 
https://github.com/esa-tu-darmstadt/daphne-benchmark/blob/054bbd723dfdf65926ef3678138c6423d581b6e1/src/OpenMP-offload/ndt_mapping/kernel.cpp#L1361
 >>>
 >>> Lukas Sommer, M.Sc.
 >>> TU Darmstadt
 >>> Embedded Systems and Applications Group (ESA)
 >>> Hochschulstr. 10, 64289 Darmstadt, Germany
 >>> Phone: +49 6151 1622429
 >>> www.esa.informatik.tu-darmstadt.de
 >>>
 >>> On 18.05.20 18:18, Johannes Doerfert wrote:
 >>>>
 >>>> Oh, I forgot about this one.
 >>>>
 >>>>
 >>>> The math stuff works because all declare variant functions are static.
 >>>>
 >>>> I think if we need to replace the `.` with a symbol that the user
 >> cannot
 >>>>
 >>>> use but the ptax assembler is not upset about. we should also move
 >>>>
 >>>> `getOpenMPVariantManglingSeparatorStr` from `Decl.h` into
 >>>>
 >>>> `llvm/lib/Frontends/OpenMP/OMPContext.h`, I forgot why I didn't.
 >>>>
 >>>>
 >>>>
 >>>> You should also be able to use the clang builtin atomics and even the
 >>>>
 >>>> `omp atomic` should eventually resolve to the same thing (I hope).
 >>>>
 >>>>
 >>>> Let me know if that helps,
 >>>>
 >>>>    Johannes
 >>>>
 >>>>
 >>>>
 >>>> On 5/18/20 10:33 AM, Lukas Sommer via Openmp-dev wrote:
 >>>>> Hi all,
 >>>>>
 >>>>> what's the current status of declare variant when compiling for
 >> Nvidia
 >>>>> GPUs?
 >>>>>
 >>>>> In my code, I have declared a variant of a function, that uses CUDA's
 >>>>> built-in atomicAdd (using the syntax from OpenMP TR8):
 >>>>>
 >>>>>> #pragma omp begin declare variant match(device={kind(nohost)})
 >>>>>>
 >>>>>> void atom_add(double* address, double val){
 >>>>>>          atomicAdd(address, val);
 >>>>>> }
 >>>>>>
 >>>>>> #pragma omp end declare variant
 >>>>> When compiling with Clang from master, ptxas fails:
 >>>>>
 >>>>>> clang++ -fopenmp   -O3 -std=c++11 -fopenmp
 >>>>>> -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target -march=sm_72 -v
 >>>>>> [...]
 >>>>>> ptxas kernel-openmp-nvptx64-nvidia-cuda.s, line 322; fatal   :
 >> Parsing
 >>>>>> error near '.ompvariant': syntax error
 >>>>>> ptxas fatal   : Ptx assembly aborted due to errors
 >>>>>> [...]
 >>>>>> clang-11: error: ptxas command failed with exit code 255 (use -v to
 >>>>>> see invocation)
 >>>>> The line mentioned in the ptxas error looks like this:
 >>>>>
 >>>>>>          // .globl _Z33atom_add.ompvariant.S2.s6.PnohostPdd
 >>>>>> .visible .func _Z33atom_add.ompvariant.S2.s6.PnohostPdd(
 >>>>>>          .param .b64
 >> _Z33atom_add.ompvariant.S2.s6.PnohostPdd_param_0,
 >>>>>>          .param .b64 
_Z33atom_add.ompvariant.S2.s6.PnohostPdd_param_1
 >>>>>> )
 >>>>>> {
 >>>>> My guess was that ptxas stumbles across the ".ompvariant"-part of the
 >>>>> mangled function name.
 >>>>>
 >>>>> Is declare variant currently supported when compiling for Nvidia
 >> GPUs?
 >>>>> If not, is there a workaround (macro defined only for device
 >>>>> compilation, access to the atomic CUDA functions, ...)?
 >>>>>
 >>>>> Thanks in advance,
 >>>>>
 >>>>> Best
 >>>>>
 >>>>> Lukas
 >>>>>
 >>>
 >>
 >