[Openmp-dev] Declare variant + Nvidia Device Offload
Johannes Doerfert via Openmp-dev
openmp-dev at lists.llvm.org
Wed May 20 08:22:24 PDT 2020
On 5/19/20 10:50 AM, Lukas Sommer wrote:
> Hi Johannes,
>
>> Yes, we need to replace the dot with either of the symbols you
>> mentioned. If we can use the same symbols on the host, I'm fine with
>> chaining it unconditionally.
> I've prototyped this change locally, replacing the dot with a
> dollar-sign in the places you mentioned. With this change, declaring
> variants for offloading to CUDA devices works for my testcase in my
> setup (ARM + Nvidia SM 7.2).
>
> If you want, I could create a patch for that, but do you (or anyone else
> on the list) know whether the other offloading targets (AMDGPU, NEC VE,
> ...) are all able to handle the dollar sign in the mangled name, so we
> could make this change unconditionally? I do not have a setup for each
> of these to test.
Please prepare a patch. People can then test it, e.g., AMD folks, but I
expect it to work fine and us adopting it :)
> I was also wondering if legalizing the mangled name shouldn't be handled
> by the backend (NVPTX in this case) instead of the OpenMP-specific parts.
I doubt it is reasonable to change symbol names once they are defined.
It would work here "by accident" but with the OpenMP 6.0 mangling this
will be in fact illegal.
>> I see. This is a performance bug that need to be addressed. I'm just
>> not sure where exactly.
> Now that I was able to use atomicAdd with declare variant, I can confirm
> that the performance improves significantly with atomicAdd. I will try
> to get some more detailed numbers from nvprof.
>
Cool! I'm glad the declare variant begin/end works and exposing the CUDA
intrinsics is useful/working too :)
> Best,
>
> Lukas
>
> Lukas Sommer, M.Sc.
> TU Darmstadt
> Embedded Systems and Applications Group (ESA)
> Hochschulstr. 10, 64289 Darmstadt, Germany
> Phone: +49 6151 1622429
> www.esa.informatik.tu-darmstadt.de
>
> On 18.05.20 20:03, Johannes Doerfert wrote:
>>
>> On 5/18/20 12:31 PM, Lukas Sommer wrote:
>>> Hi Johannes,
>>>
>>> thanks four your quick reply!
>>>
>>>> The math stuff works because all declare variant functions are static.
>>>>
>>>> I think if we need to replace the `.` with a symbol that the user
>> cannot
>>>>
>>>> use but the ptax assembler is not upset about. we should also move
>>>>
>>>> `getOpenMPVariantManglingSeparatorStr` from `Decl.h` into
>>>>
>>>> `llvm/lib/Frontends/OpenMP/OMPContext.h`, I forgot why I didn't.
>>>>
>>> The `.` also seems to be part of the mangled context. Where does that
>>> mangling take place?
>>
>> OMPTraitInfo::getMangledName() in OpenMPClause.{h,cpp} (clang)
>>
>> I guess it doesn't live in OMPContext because it uses the user defined
>> structured trait set to create the mangling. Right now I don't see a
>> reason not to use the VariantMatchInfo and move everything into
>> OMPContext. Though, no need to do this now.
>>
>>> According to the PTX documentation [0], identifiers cannot contain
>> dots,
>>> but `$` and `%` are allowed in user-defined names (apart from a few
>>> predefined identifiers).
>>>
>>> Should we replace the dot only for Nvidia devices or in general? Do any
>>> other parts of the code rely on the mangling of the variants with dots?
>>
>> Yes, we need to replace the dot with either of the symbols you
>> mentioned. If we can use the same symbols on the host, I'm fine with
>> chaining it unconditionally.
>>
>> Except the test, I don't think we need to adapt anything else.
>>
>> FWIW, OpenMP 6.0 will actually define the mangling.
>>
>>
>>>> You should also be able to use the clang builtin atomics
>>> You were referring to
>>>
>> https://clang.llvm.org/docs/LanguageExtensions.html#c11-atomic-builtins,
>>> weren't you? As far as I can see, those only work on atomic types.
>>
>> I meant:
>> http://llvm.org/docs/Atomics.html#libcalls-atomic
>>
>>
>>>> `omp atomic` should eventually resolve to the same thing (I hope).
>>> From what I can see in the generated LLVM IR, this does not seem to be
>>> the case. Maybe that's due to the fact, that I'm using update or
>> structs
>>> (for more context, see [1]):
>>>
>>>> #pragma omp atomic update
>>>> target_cells_[voxelIndex].mean[0] += (double) target_[id].data[0];
>>>> #pragma omp atomic update
>>>> target_cells_[voxelIndex].mean[1] += (double) target_[id].data[1];
>>>> #pragma omp atomic update
>>>> target_cells_[voxelIndex].mean[2] += (double) target_[id].data[2];
>>>> #pragma omp atomic update
>>>> target_cells_[voxelIndex].numberPoints += 1;
>>> In the generated LLVM IR, there are a number of atomic loads and an
>>> atomicrmw in the end, but no calls to CUDA builtins.
>>>
>>> The CUDA equivalent of this target region uses calls to atomicAdd and
>>> according to nvprof, this is ~10x faster than the code generated for
>> the
>>> target region by Clang (although I'm not entirely sure the atomics are
>>> the only problem here).
>>
>> I see. This is a performance bug that need to be addressed. I'm just
>> not sure where exactly.
>>
>>
>>> Best,
>>>
>>> Lukas
>>>
>>> [0]
>>>
>>
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#identifiers
>>>
>>> [1]
>>>
>>
https://github.com/esa-tu-darmstadt/daphne-benchmark/blob/054bbd723dfdf65926ef3678138c6423d581b6e1/src/OpenMP-offload/ndt_mapping/kernel.cpp#L1361
>>>
>>> Lukas Sommer, M.Sc.
>>> TU Darmstadt
>>> Embedded Systems and Applications Group (ESA)
>>> Hochschulstr. 10, 64289 Darmstadt, Germany
>>> Phone: +49 6151 1622429
>>> www.esa.informatik.tu-darmstadt.de
>>>
>>> On 18.05.20 18:18, Johannes Doerfert wrote:
>>>>
>>>> Oh, I forgot about this one.
>>>>
>>>>
>>>> The math stuff works because all declare variant functions are static.
>>>>
>>>> I think if we need to replace the `.` with a symbol that the user
>> cannot
>>>>
>>>> use but the ptax assembler is not upset about. we should also move
>>>>
>>>> `getOpenMPVariantManglingSeparatorStr` from `Decl.h` into
>>>>
>>>> `llvm/lib/Frontends/OpenMP/OMPContext.h`, I forgot why I didn't.
>>>>
>>>>
>>>>
>>>> You should also be able to use the clang builtin atomics and even the
>>>>
>>>> `omp atomic` should eventually resolve to the same thing (I hope).
>>>>
>>>>
>>>> Let me know if that helps,
>>>>
>>>> Johannes
>>>>
>>>>
>>>>
>>>> On 5/18/20 10:33 AM, Lukas Sommer via Openmp-dev wrote:
>>>>> Hi all,
>>>>>
>>>>> what's the current status of declare variant when compiling for
>> Nvidia
>>>>> GPUs?
>>>>>
>>>>> In my code, I have declared a variant of a function, that uses CUDA's
>>>>> built-in atomicAdd (using the syntax from OpenMP TR8):
>>>>>
>>>>>> #pragma omp begin declare variant match(device={kind(nohost)})
>>>>>>
>>>>>> void atom_add(double* address, double val){
>>>>>> atomicAdd(address, val);
>>>>>> }
>>>>>>
>>>>>> #pragma omp end declare variant
>>>>> When compiling with Clang from master, ptxas fails:
>>>>>
>>>>>> clang++ -fopenmp -O3 -std=c++11 -fopenmp
>>>>>> -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target -march=sm_72 -v
>>>>>> [...]
>>>>>> ptxas kernel-openmp-nvptx64-nvidia-cuda.s, line 322; fatal :
>> Parsing
>>>>>> error near '.ompvariant': syntax error
>>>>>> ptxas fatal : Ptx assembly aborted due to errors
>>>>>> [...]
>>>>>> clang-11: error: ptxas command failed with exit code 255 (use -v to
>>>>>> see invocation)
>>>>> The line mentioned in the ptxas error looks like this:
>>>>>
>>>>>> // .globl _Z33atom_add.ompvariant.S2.s6.PnohostPdd
>>>>>> .visible .func _Z33atom_add.ompvariant.S2.s6.PnohostPdd(
>>>>>> .param .b64
>> _Z33atom_add.ompvariant.S2.s6.PnohostPdd_param_0,
>>>>>> .param .b64
_Z33atom_add.ompvariant.S2.s6.PnohostPdd_param_1
>>>>>> )
>>>>>> {
>>>>> My guess was that ptxas stumbles across the ".ompvariant"-part of the
>>>>> mangled function name.
>>>>>
>>>>> Is declare variant currently supported when compiling for Nvidia
>> GPUs?
>>>>> If not, is there a workaround (macro defined only for device
>>>>> compilation, access to the atomic CUDA functions, ...)?
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> Best
>>>>>
>>>>> Lukas
>>>>>
>>>
>>
>
More information about the Openmp-dev
mailing list