<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/74507>74507</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            openmp cuda offload incorrectly calls cuda from a static initializer
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          fs-nv
      </td>
    </tr>
</table>

<pre>
    Originating from [this](https://forums.developer.nvidia.com/t/ncu-does-not-detect-kernels-error-the-application-returned-an-error-code-11/273975) Nsight Compute profiler forum post, I believe llvm's opemp cuda offloading incorrectly calls the CUDA API from its library's static initializer. This can be easily seen when inspecting the callstack at `cuInit` for a simple application built with `clang++ -fopenmp`.

```
#0 0x00007ffff28d9310 in cuInit () from /lib/x86_64-linux-gnu/libcuda.so.1
#1 0x00007ffff42f45f0 in llvm::omp::target::plugin::CUDAPluginTy::initImpl() () from /opt/lib/libomptarget.rtl.cuda.so
#2  0x00007ffff42ff82f in llvm::omp::target::plugin::GenericPluginTy::init() () from /opt/lib/libomptarget.rtl.cuda.so
#3  0x00007ffff42f6529 in llvm::omp::target::plugin::Plugin::Plugin() () from /opt/lib/libomptarget.rtl.cuda.so
#4  0x00007ffff42ffa8e in __tgt_rtl_init_plugin () from /opt/lib/libomptarget.rtl.cuda.so
#5 0x00007ffff7ae374d in RTLsTy::attemptLoadRTL(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RTLInfoTy&) () from /opt/lib/libomptarget.so.18git
#6  0x00007ffff7ae3435 in RTLsTy::loadRTLs() () from /opt/lib/libomptarget.so.18git
#7 0x00007ffff7ae31a8 in init() () from /opt/lib/libomptarget.so.18git
#8 0x00007ffff7fe0b9a in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffe2b8, env=env@entry=0x7fffffffe2c8) at dl-init.c:72
#9  0x00007ffff7fe0ca1 in call_init (env=0x7fffffffe2c8, argv=0x7fffffffe2b8, argc=1, l=<optimized out>) at dl-init.c:30
#10 _dl_init (main_map=0x7ffff7ffe190, argc=1, argv=0x7fffffffe2b8, env=0x7fffffffe2c8) at dl-init.c:119
#11 0x00007ffff7fd013a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#12 0x0000000000000001 in ?? ()
#13 0x00007fffffffe574 in ?? ()
#14 0x0000000000000000 in ?? ()
```

According to the [CUDA documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#initialization), this is not allowed:

> The CUDA interfaces use global state that is initialized during host program initiation and destroyed during host program termination. The CUDA runtime and driver cannot detect if this state is invalid, so using any of these interfaces (implicitly or explicitly) during program initiation (or termination after main) will result in undefined behavior.

While this may happen to work in isolation, Nvidia's CUDA profiler, Nsight Computes, relies on applications using this interface correctly with respect to program and library initialization. In this case, the profiler will either crash or at least fail to profile as it suspends the app in the first CUDA API call, and with this API call happening during program initialization, the application is never even fully initialized, which prevents the profiler's frontend process from attaching to it.

The request here is to change llvm's usage of CUDA in openmp offloading to avoid CUDA API calls in static library or program initializers to comply with the CUDA documentation and avoid undefined behavior caused by undefined library initialization ordering.

This is the sample application, but any similar one would probably reproduce this, too:
```
#include <iostream>
#include <omp.h>
#include <cstdlib>

void saxpy(float a, float* x, float* y, int sz) {
        double t = 0.0;
        double tb, te;
        tb = omp_get_wtime();
#pragma omp target teams distribute parallel for map(to:x[0:sz]) map(tofrom:y[0:sz])
{
        for (int i = 0; i < sz; i++) {
                y[i] = a * x[i] + y[i];
        }
}
        te = omp_get_wtime();
        t = te - tb;
        printf("Time of kernel: %lf\n", t);
}

int main() {
        auto x = (float*) malloc(1000 * sizeof(float));
        auto y = (float*) calloc(1000, sizeof(float));
        
        for (int i = 0; i < 1000; i++) {
                x[i] = i;
        }
        
        saxpy(42, x, y, 1000);
        
    return 0;
}
```
It is compiled as below, where `sm_75` may need to be replaced to the actual target GPU architecture.
```
clang++ -O3 -fopenmp -fopenmp-targets=nvptx64 saxpy.cpp -o saxpy --offload-arch=sm_75 -fopenmp-offload-mandatory
```
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJysWE2P4zYS_TXsS0GGLPnz0Ad3Ox00MEgGQQd7NCiyZHFDkVqSctvz6xdFyt-eJLPZwbQtuaiqV4_F4qO492prEJ_Z9IVN10-8D411z7XPzO6psvLw_KtTW2V4UGYLtbMtsOlLaJRn0zUrFk0InWflihVvrHirretbP5K4Q207dCOzU1LxkbAtK94CK96M6DNp0WfGhkxiQBGyP9AZ1D5D56zLQoMZ7zqtBA_Kmsxh6J1BmXEzjBBWYjYes-KtmJfL-ZQVS_jFq20T4NW2XR8QOmdrpdFBRASd9YEVr_AOFWqFOwStdy0r5h5sh20HopccbF1ryyUlqoywzqEI-gCCa-0hNAivv69XsPr6nnhQwYNWlePuED35wIMSoIwKimv1Dd0IPhrlQXADFQJyr_QBPKKBzwYNKOM7FJFY8h7jBC7-AB6AzXLRvxsV2CynJICDV22nES64gapXOsCnCk18QHOzZcULK14gq22Hpu3YLB-xfM3y1fA5y4f_6bYoc8j3eZ7n87qu62Ihl-U4B2UghQdWLIjfNPPFm1YVK972i9lmNsm0Mv0-25o-GYjEkbej8cn3-NL3pKgn0zr6juyXK1aubNuli8DdFkO67nS_VSZdE-df4_3HIf1CBL-3nR6Q3QC0XTjB1KqybZc8j1zQowHhCV8BNwDrRVH_IMCf0aBT4h7j_wFfeYtvNi2WP4jv6_31Pwc2uSOOL5CAbTZhGzYu6A1RsElA_lGo6WWkOcdyPpEU6bePL_5INw8B2y58sVz-9vGFFQsfZLJsNmK_H4_TTcW9EhsfnDJbVr6KhjvqCqfB9MMmOK6CP5rLn65GcK2t4MG6sx3oT1hDHWZGg3_7-PJuavtxiPd_n2daOYutCqfMZ3Cb-qSc3qauU87-Byf1Ltj8NtaYLyjW_1DJd74XV75rzKsljy2G61Qn5Fqzcs3KV9sF1apvKMH2YaCfu61g5Tp-TXI0wR1YuR4Ppl0y7S5M-T6Gqusai2pB49DQMPp8PErE1HgAqTNCNBKsXM2LUwpLuM1B8PFdDinKrd8TyntYQ2Yxle8ScIerPHfvcQ4beQbQcmU2Le_OweZ1jeNlfhvsu4gep3AHYTxenjGMr7mR-biM80vIfOAubHqP7sFWMpvQtxw2kv1ils0mVD5n3sfF4PviX-SdlW-sfBt8noeXl1AI_nQ--ZPhk3vv-feG3-yc8XMlhHVRMQQbN3E2fYkqQVrRt2hC3KcfSSVphb-WR9T4hq9MZJ2zW8fbVplttu2VRFa8KSNxP2pCq1lRnnRGClEsafZImIHyYGwA6lWfGBvXpQAof4KPo5ZRJqCruUAPvUfYaltxHYUMQmh4IFdnPSNB9tQ6obE-wABwsEc5wo0EiT44e_jO4ICujWLSmtEZhutNUC2m553aoSPNRDkkhQiqTpklZBHUjmslY2-20HsKxM0BLA1Ej5eZsWJBykkJRWrOOsD98Y7KcYD5IB1WLKy7hAy8DuiAFhk9-am0Boe-14FqpjcSa2VQQoUN3ynrrpTXvxqlMaXR8gM0vOvQUNl8WvdH7LTe6mEuX-GXWBlRV0aKjoI22q60rqefHMlaDwTxLA_9QEwqiiMhcFa2UTg6jCqUkBw5oHkYlC1cl9kI3k3yJ7jHVHEXajsygio0NIOO-4bo5gE0ch-g5koPYWg4cA8qgO99h0Ymhc27jqigy1o5H86Sm_psbFxGJtwRxdEy0EnZPpzP8zJ5PcY5iWhaLkg1hzs0UPdaHy6Lnh75bJRooHM0IvirnOMU1c6agEbSrwK9Tz2Oh8BFMzQHFa6qgWrf4X969AEadLGogwXRcLO9OJ30nm-RqnpYrpBU_eVRJVjgO6vkNVU04ccTyXEqrbtnBV2Ka9vuWBCnk85VD4vEp0D3hQ6C957uDxfGxxUE1kmkGbqhI_Utiu357TmHpqDqQ1zhXrVKcwfWIHzaXkfOK17pAzjsnJW9SKssTrW15_Z3d_JRRuheIrDyVVkfHPKW9tsHZtt2o-Y7NuGDJBF0ssbPSJTn--7AigXNVQBOiOIlK1awv7o70J0yAfy3qLTmL0dnS2n7ihoHsHIN-Shn5b2tisnipSlU8QHbdpsths0nNdhhOzuNKsrO8W3LaRQk8QYBeetBKlLJVTxKc8e1Rh1PoSQtikUgVvds-pKzcuW_xe1tebJR7bNydbixDzEvMiOH1JxNAJWyY-VLvHwlHug6nWZvKWH5krwrNl3H5zgkRo8_FS9wtF9SwubrI4j1mSf8a57yZWI_IGRE9oWhc8qEOj5QfNAeZmtI7zNYuQJWTHXNpq-GFUWcoiu3ZxT0STSknWVxmy_vg4V9hHCsJlasEud0IGHFYkzChVjw6hva-jxseZNJ9HV45Etc-oo761-4-lsTGZ396VTuL6dSPZ6wi2DHRTWJjMZlFFdPQv0AIABAeokE-QPyr9vCe9Q81A6VRklbVIXafqZNgPo0m-W-3cynbJbHfdwgSuqgFfXzTnORbuMmI0LP9XFh_fz1d-BONIoUTe9w9DD-1TucX8vTe5zTRZbceVauza4L-9kktZmR6DrIbLqBLBt2iIxCsnIdMZ-dHK0tN5KOs4cbME_yuZTLcsmf8Hk8z8eLPJ9NyqfmeTnPJ1jMZnmRz3k9l2WO5XI659NxWUvJ8Uk9FznJ9nw6XhSTvBxJnJS5EPlkWaNY8iUdwFqu9Ig2uZF12yflfY_P88k0nz9pXqH28W1kURj8hGik5TNdP7lneiar-q1nk1wrH_zZS1BB4_PA1uX7vAcv86I57dEPXts99U4_X6v1rQpNXw1CPe3O8Ytk-r9R0EE4AvWseIuJ_DcAAP__tLqopQ">