[PATCH] D99432: [OPENMP]Fix PR48851: the locals are not globalized in SPMD mode.

Wed May 5 07:34:04 PDT 2021

ABataev added a comment.

In D99432#2738859 <https://reviews.llvm.org/D99432#2738859>, @estewart08 wrote:

> In D99432#2736981 <https://reviews.llvm.org/D99432#2736981>, @ABataev wrote:
>
>> In D99432#2736970 <https://reviews.llvm.org/D99432#2736970>, @estewart08 wrote:
>>
>>> In D99432#2728788 <https://reviews.llvm.org/D99432#2728788>, @ABataev wrote:
>>>
>>>> In D99432#2726997 <https://reviews.llvm.org/D99432#2726997>, @estewart08 wrote:
>>>>
>>>>> In D99432#2726845 <https://reviews.llvm.org/D99432#2726845>, @ABataev wrote:
>>>>>
>>>>>> In D99432#2726588 <https://reviews.llvm.org/D99432#2726588>, @estewart08 wrote:
>>>>>>
>>>>>>> In D99432#2726391 <https://reviews.llvm.org/D99432#2726391>, @ABataev wrote:
>>>>>>>
>>>>>>>> In D99432#2726337 <https://reviews.llvm.org/D99432#2726337>, @estewart08 wrote:
>>>>>>>>
>>>>>>>>> In D99432#2726060 <https://reviews.llvm.org/D99432#2726060>, @ABataev wrote:
>>>>>>>>>
>>>>>>>>>> In D99432#2726050 <https://reviews.llvm.org/D99432#2726050>, @estewart08 wrote:
>>>>>>>>>>
>>>>>>>>>>> In D99432#2726025 <https://reviews.llvm.org/D99432#2726025>, @ABataev wrote:
>>>>>>>>>>>
>>>>>>>>>>>> In D99432#2726019 <https://reviews.llvm.org/D99432#2726019>, @estewart08 wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> In reference to https://bugs.llvm.org/show_bug.cgi?id=48851, I do not see how this helps SPMD mode with team privatization of declarations in-between target teams and parallel regions.
>>>>>>>>>>>>
>>>>>>>>>>>> Diв you try the reproducer with the applied patch?
>>>>>>>>>>>
>>>>>>>>>>> Yes, I still saw the test fail, although it was not with latest llvm-project. Are you saying the reproducer passes for you?
>>>>>>>>>>
>>>>>>>>>> I don't have CUDA installed but from what I see in the LLVM IR it shall pass. Do you have a debug log, does it crashes or produces incorrect results?
>>>>>>>>>
>>>>>>>>> This is on an AMDGPU but I assume the behavior would be similar for NVPTX.
>>>>>>>>>
>>>>>>>>> It produces incorrect/incomplete results in the dist[0] index after a manual reduction and in turn the final global gpu_results array is incorrect.
>>>>>>>>> When thread 0 does a reduction into dist[0] it has no knowledge of dist[1] having been updated by thread 1. Which tells me the array is still thread private.
>>>>>>>>> Adding some printfs, looking at one teams' output:
>>>>>>>>>
>>>>>>>>> SPMD
>>>>>>>>>
>>>>>>>>>   Thread 0: dist[0]: 1
>>>>>>>>>   Thread 0: dist[1]: 0  // This should be 1
>>>>>>>>>   After reduction into dist[0]: 1  // This should be 2
>>>>>>>>>   gpu_results = [1,1]  // [2,2] expected
>>>>>>>>>
>>>>>>>>> Generic Mode:
>>>>>>>>>
>>>>>>>>>   Thread 0: dist[0]: 1
>>>>>>>>>   Thread 0: dist[1]: 1   
>>>>>>>>>   After reduction into dist[0]: 2
>>>>>>>>>   gpu_results = [2,2]
>>>>>>>>
>>>>>>>> Hmm, I would expect a crash if the array was allocated in the local memory. Could you try to add some more printfs (with data and addresses of the array) to check the results? Maybe there is a data race somewhere in the code?
>>>>>>>
>>>>>>> As a reminder, each thread updates a unique index in the dist array and each team updates a unique index in gpu_results.
>>>>>>>
>>>>>>> SPMD - shows each thread has a unique address for dist array
>>>>>>>
>>>>>>>   Team 0 Thread 1: dist[0]: 0, 0x7f92e24a8bf8
>>>>>>>   Team 0 Thread 1: dist[1]: 1, 0x7f92e24a8bfc
>>>>>>>   
>>>>>>>   Team 0 Thread 0: dist[0]: 1, 0x7f92e24a8bf0
>>>>>>>   Team 0 Thread 0: dist[1]: 0, 0x7f92e24a8bf4
>>>>>>>   
>>>>>>>   Team 0 Thread 0: After reduction into dist[0]: 1
>>>>>>>   Team 0 Thread 0: gpu_results address: 0x7f92a5000000
>>>>>>>   --------------------------------------------------
>>>>>>>   Team 1 Thread 1: dist[0]: 0, 0x7f92f9ec5188
>>>>>>>   Team 1 Thread 1: dist[1]: 1, 0x7f92f9ec518c
>>>>>>>   
>>>>>>>   Team 1 Thread 0: dist[0]: 1, 0x7f92f9ec5180
>>>>>>>   Team 1 Thread 0: dist[1]: 0, 0x7f92f9ec5184
>>>>>>>   
>>>>>>>   Team 1 Thread 0: After reduction into dist[0]: 1
>>>>>>>   Team 1 Thread 0: gpu_results address: 0x7f92a5000000
>>>>>>>   
>>>>>>>   gpu_results[0]: 1
>>>>>>>   gpu_results[1]: 1
>>>>>>>
>>>>>>> Generic - shows each team shares dist array address amongst threads
>>>>>>>
>>>>>>>   Team 0 Thread 1: dist[0]: 1, 0x7fac01938880
>>>>>>>   Team 0 Thread 1: dist[1]: 1, 0x7fac01938884
>>>>>>>   
>>>>>>>   Team 0 Thread 0: dist[0]: 1, 0x7fac01938880
>>>>>>>   Team 0 Thread 0: dist[1]: 1, 0x7fac01938884
>>>>>>>   
>>>>>>>   Team 0 Thread 0: After reduction into dist[0]: 2
>>>>>>>   Team 0 Thread 0: gpu_results address: 0x7fabc5000000
>>>>>>>   --------------------------------------------------
>>>>>>>   Team 1 Thread 1: dist[0]: 1, 0x7fac19354e10
>>>>>>>   Team 1 Thread 1: dist[1]: 1, 0x7fac19354e14
>>>>>>>   
>>>>>>>   Team 1 Thread 0: dist[0]: 1, 0x7fac19354e10
>>>>>>>   Team 1 Thread 0: dist[1]: 1, 0x7fac19354e14
>>>>>>>   
>>>>>>>   Team 1 Thread 0: After reduction into dist[0]: 2
>>>>>>>   Team 1 Thread 0: gpu_results address: 0x7fabc5000000
>>>>>>
>>>>>> Could you check if it works with `-fno-openmp-cuda-parallel-target-regions` option?
>>>>>
>>>>> Unfortunately that crashes:
>>>>> llvm-project/llvm/lib/IR/Instructions.cpp:495: void llvm::CallInst::init(llvm::FunctionType*, llvm::Value*, llvm::ArrayRef<llvm::Value*>, llvm::ArrayRef<llvm::OperandBundleDefT<llvm::Value*> >, const llvm::Twine&): Assertion `(i >= FTy->getNumParams() || FTy->getParamType(i) == Args[i]->getType()) && "Calling a function with a bad signature!"' failed.
>>>>
>>>> Hmm, could you provide a full stack trace?
>>>
>>> At this point I am not sure I want to dig into that crash as our llvm-branch is not caught up to trunk.
>>>
>>> I did build trunk and ran some tests on a sm_70:
>>> -Without this patch: code fails with incomplete results
>>> -Without this patch and with -fno-openmp-cuda-parallel-target-regions: code fails with incomplete results
>>>
>>> -With this patch: code fails with incomplete results (thread private array)
>>> Team 0 Thread 1: dist[0]: 0, 0x7c1e800000a8
>>> Team 0 Thread 1: dist[1]: 1, 0x7c1e800000ac
>>>
>>> Team 0 Thread 0: dist[0]: 1, 0x7c1e800000a0
>>> Team 0 Thread 0: dist[1]: 0, 0x7c1e800000a4
>>>
>>> Team 0 Thread 0: After reduction into dist[0]: 1
>>> Team 0 Thread 0: gpu_results address: 0x7c1ebc800000
>>>
>>> Team 1 Thread 1: dist[0]: 0, 0x7c1e816f27c8
>>> Team 1 Thread 1: dist[1]: 1, 0x7c1e816f27cc
>>>
>>> Team 1 Thread 0: dist[0]: 1, 0x7c1e816f27c0
>>> Team 1 Thread 0: dist[1]: 0, 0x7c1e816f27c4
>>>
>>> Team 1 Thread 0: After reduction into dist[0]: 1
>>> Team 1 Thread 0: gpu_results address: 0x7c1ebc800000
>>>
>>> gpu_results[0]: 1
>>> gpu_results[1]: 1
>>> FAIL
>>>
>>> -With this patch and with -fno-openmp-cuda-parallel-target-regions: Pass
>>> Team 0 Thread 1: dist[0]: 1, 0x7a5b56000018
>>> Team 0 Thread 1: dist[1]: 1, 0x7a5b5600001c
>>>
>>> Team 0 Thread 0: dist[0]: 1, 0x7a5b56000018
>>> Team 0 Thread 0: dist[1]: 1, 0x7a5b5600001c
>>>
>>> Team 0 Thread 0: After reduction into dist[0]: 2
>>> Team 0 Thread 0: gpu_results address: 0x7a5afc800000
>>>
>>> Team 1 Thread 1: dist[0]: 1, 0x7a5b56000018
>>> Team 1 Thread 1: dist[1]: 1, 0x7a5b5600001c
>>>
>>> Team 1 Thread 0: dist[0]: 1, 0x7a5b56000018
>>> Team 1 Thread 0: dist[1]: 1, 0x7a5b5600001c
>>>
>>> Team 1 Thread 0: After reduction into dist[0]: 2
>>> Team 1 Thread 0: gpu_results address: 0x7a5afc800000
>>>
>>> gpu_results[0]: 2
>>> gpu_results[1]: 2
>>> PASS
>>>
>>> I am concerned about team 0 and team 1 having the same address for the dist array here.
>>
>> It is caused by the problem with the runtime. It should work with `-fno-openmp-cuda-parallel-target-regions` (I think) option (it uses a different runtime function for this case) and I just want to check that it really works. Looks like currently, runtime allocates a unique array for each thread.
>
> Unfortunately, this patch + the flag does not work for the larger reproducer, the CPU check passes but GPU check fails with incorrect results.
> https://github.com/zjin-lcf/oneAPI-DirectProgramming/blob/master/all-pairs-distance-omp/main.cpp

Ok, I'll drop this patch and prepare the more conservative one, where the kernel will be executed in non-SPMD mode instead. Later it will be improved by the LLVM optimization.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99432/new/

https://reviews.llvm.org/D99432