[Openmp-commits] [PATCH] D77609: [OpenMP] Added the support for hidden helper task in RTL

Fri Jan 22 19:11:41 PST 2021

tianshilei1992 added a comment.

In D77609#2515690 <https://reviews.llvm.org/D77609#2515690>, @JonChesterfield wrote:

> The information I've got on the possible race is:
> When this patch is applied (by git's automerge, I think) to the rocm stack, a test located at:
> https://github.com/ROCm-Developer-Tools/aomp/blob/master/test/smoke/devices/devices.c
> fails in unpredictable fashion.
>
> I've reproduced the test here as it's fairly short, but it uses some functions on the device that the trunk implementation returns zero for. Adjusted so it builds on trunk. Run as
>
>   export LD_LIBRARY_PATH=$HOME/llvm-install/lib/ ; $HOME/llvm-install/bin/clang  -O2  -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target=nvptx64-nvidia-cuda -march=sm_50   devices.c -o devices -L/usr/local/cuda/targets/x86_64-linux/lib -lcudart && valgrind --fair-sched=yes ./devices
>
>
>
>   // devices.c
>   #include <stdio.h>
>   #include <omp.h>
>   
>   int main() {
>     int num_devs = omp_get_num_devices();
>     for (int device_num = 0; device_num < num_devs ; device_num++) {
>   #pragma omp target device(device_num) nowait
>   #pragma omp teams num_teams(2) thread_limit(4)
>   #pragma omp parallel num_threads(2)
>       {
>         // need to pass the total device number to all devices, per module load
>         int num_threads = omp_get_num_threads();
>         int num_teams   = omp_get_num_teams();
>         int num_devices = omp_get_num_devices(); // not legal in 4.5
>   
>         // need to pass the device id to the device starting the kernel
>         int thread_id   = omp_get_thread_num();
>         int team_id     = omp_get_team_num();
>         int device_id   = 0; // omp_get_device_num();  // no API in omp 4.5
>   
>         // assume we have homogeneous devices
>         int total_threads = num_devices * num_teams * num_threads;
>         int gthread_id    = (device_id * num_teams * num_threads) + (team_id * num_threads) + thread_id;
>   
>         // print out id
>         printf("Hello OpenMP 5 from \n");
>         printf(" Device num  %d of %d devices\n", device_id, num_devices);
>         printf(" Team num    %d of %d teams  \n", team_id,   num_teams);
>         printf(" Thread num  %d of %d threads\n", thread_id, num_threads);
>         printf(" Global thread %d of %d total threads\n", gthread_id, total_threads);
>       };
>     };
>   #pragma omp taskwait
>     printf("The host device num is %d\n", omp_get_device_num());
>     printf("The initial device num is %d\n", omp_get_initial_device());
>     printf("The number of devices are %d\n", num_devs);
>   }
>
> Trunk before this patch makes a use of uninitialized memory but the test succeeds (prints a lot of stuff).
>
>   ==27099== Conditional jump or move depends on uninitialised value(s)
>   ==27099==    at 0x4C36DC1: __tgt_target_teams_nowait_mapper (llvm-project/openmp/libomptarget/src/interface.cpp:470)
>   ==27099==    by 0x40148E: .omp_task_entry. (in /home/amd/aomp/aomp/test/smoke/devices/devices)
>   ==27099==    by 0x4B5B688: __kmp_invoke_task(int, kmp_task*, kmp_taskdata*) (llvm-project/openmp/runtime/src/kmp_tasking.cpp:1562)
>   ==27099==    by 0x4B5B8BB: __kmp_omp_task (llvm-project/openmp/runtime/src/kmp_tasking.cpp:1679)
>   ==27099==    by 0x4B5BB7E: __kmpc_omp_task (llvm-project/openmp/runtime/src/kmp_tasking.cpp:1739)
>   ==27099==    by 0x401309: main
>
> With this patch applied, most of the print output is lost, and the uninitialized data error changes
>
>   The host device num is 1
>   The initial device num is 1
>   ==20091== Thread 9:
>   ==20091== Conditional jump or move depends on uninitialised value(s)
>   ==20091==    at 0x4C3ADC1: __tgt_target_teams_nowait_mapper (llvm-project/openmp/libomptarget/src/interface.cpp:470)
>   ==20091==    by 0x40148E: .omp_task_entry. (in /home/amd/aomp/aomp/test/smoke/devices/devices)
>   ==20091==    by 0x4B5C399: __kmp_invoke_task(int, kmp_task*, kmp_taskdata*) (llvm-project/openmp/runtime/src/kmp_tasking.cpp:1633)
>   ==20091==    by 0x4B60012: int __kmp_execute_tasks_template<kmp_flag_64<false, true> >(kmp_info*, int, kmp_flag_64<false, true>*, int, int*, void*, int) (llvm-project/openmp/runtime/src/kmp_tasking.cpp:3012)
>   ==20091==    by 0x4B6AE91: int __kmp_execute_tasks_64<false, true>(kmp_info*, int, kmp_flag_64<false, true>*, int, int*, void*, int) (llvm-project/openmp/runtime/src/kmp_tasking.cpp:3111)
>   ==20091==    by 0x4B79901: kmp_flag_64<false, true>::execute_tasks(kmp_info*, int, int, int*, void*, int) (llvm-project/openmp/runtime/src/kmp_wait_release.h:915)
>   ==20091==    by 0x4B7497C: bool __kmp_wait_template<kmp_flag_64<false, true>, true, false, true>(kmp_info*, kmp_flag_64<false, true>*, void*) (llvm-project/openmp/runtime/src/kmp_wait_release.h:345)
>   ==20091==    by 0x4B797D9: kmp_flag_64<false, true>::wait(kmp_info*, int, void*) (llvm-project/openmp/runtime/src/kmp_wait_release.h:922)
>   ==20091==    by 0x4B70559: __kmp_hyper_barrier_release(barrier_type, kmp_info*, int, int, int, void*) (llvm-project/openmp/runtime/src/kmp_barrier.cpp:672)
>   ==20091==    by 0x4B7401D: __kmp_fork_barrier(int, int) (llvm-project/openmp/runtime/src/kmp_barrier.cpp:1982)
>   ==20091==    by 0x4B3B701: __kmp_launch_thread (llvm-project/openmp/runtime/src/kmp_runtime.cpp:5776)
>   ==20091==    by 0x4BB976D: __kmp_launch_worker(void*) (llvm-project/openmp/runtime/src/z_Linux_util.cpp:591)
>   ==20091== 
>   The number of devices are 1
>   CUDA error: Error returned from cuDeviceGet
>
> This is more obvious on the amd implementation because it segfaults on a null pointer dereference.

If you take a look at the code around `interface.cpp:470`, it is:

  EXTERN int __tgt_target_teams_nowait_mapper(
      ident_t *loc, int64_t device_id, void *host_ptr, int32_t arg_num,
      void **args_base, void **args, int64_t *arg_sizes, int64_t *arg_types,
      map_var_info_t *arg_names, void **arg_mappers, int32_t team_num,
      int32_t thread_limit, int32_t depNum, void *depList, int32_t noAliasDepNum,
      void *noAliasDepList) {
    TIMESCOPE();
    if (depNum + noAliasDepNum > 0)
      __kmpc_omp_taskwait(loc, __kmpc_global_thread_num(loc));

    return __tgt_target_teams_mapper(loc, device_id, host_ptr, arg_num, args_base,
                                     args, arg_sizes, arg_types, arg_names,
                                     arg_mappers, team_num, thread_limit);
  }

Line 470 is `if (depNum + noAliasDepNum > 0)`. The reason it raises an error is, `depNum` and `noAliasDepNum` are not passed to the function call at all due to the known issue we have in `clang`. Actually, `depNum`, `depList`, `noAliasDepNum`, and `noAliasDepList` are all not passed on the callsite. So your issue encountered probably has nothing to do with this part.

I did try on my local systems with NVIDIA GPUs. I didn't encounter any crash/hang with 1000 runs. The only potential problem is `printf` in the target region doesn't work at all, which I believe has nothing to do with this patch.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D77609/new/

https://reviews.llvm.org/D77609