<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/85770>85770</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [LLVM 18+][OpenMP] OpenMP target offloading will cause missed CUPTI callbacks
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          Thyre
      </td>
    </tr>
</table>

<pre>
    [LLVM 18+][OpenMP] OpenMP target offloading will cause missed CUPTI callbacks
===

## Introduction to the issue

Analyzing performance of applications is important to get insigt into further optimization potential. There exist several different tools to look into this. For NVIDIA GPUs, both Nsight Compute and Nsight Systems exist. With both these tools, users can quickly get insight into their application and get some optimization recommendations. 

In https://github.com/llvm/llvm-project/pull/74397, it was reported that Nsight Compute shows issues with several different LLVM versions. Here, it was noted that LLVM calls CUDA functions from static initializers, which is not allowed as specified in the documentation:

> The CUDA interfaces use global state that is initialized during host program initiation and destroyed during host program termination. The CUDA runtime and driver cannot detect if this state is invalid, so using any of these interfaces (implicitly or explicitly) during program initiation or termination after main) will result in undefined behavior.

In addition, other tools like Score-P suffer issues due to the early initialization of `libomptarget.so` which also initializes the OpenMP Tools interface. More information can be found in https://github.com/llvm/llvm-project/pull/74397. Basically, the OpenMP Tools Interface was initialized during `_dl_init`. If we tried to initialize other adapters, like CUDA or rocProfiler/rocTracer/rocm_smi, we would end up with a segmentation fault. This issue has been resolved from our side, where we now try to initialize adapters at a later point in `_start`.

## Changes to the initialization since LLVM 18

In https://github.com/llvm/llvm-project/pull/74397, changes to the device initialization were made. As noted after the PR was merged, all available devices are getting initialized eagerly through `PluginAdaptorTy::initDevices` now. While this works fine for the OpenMP Tools Interface, this has certainly an impact on tools using CUPTI to analyze user applications. 

Lets look at a simple example code:

```c
#include <stdio.h>
#include <omp.h>

int main( void )
{
    #pragma omp target teams num_teams(2)
        {
         printf("omp_is_initial_device() = %d | omp_get_team_num() = %d\n", omp_is_initial_device(), omp_get_team_num());
        }
}
```

Running the user application normally yields these results

```console
$ clang -fopenmp --offload-arch=native ./reproducer.c
$ ./a.out
omp_is_initial_device() = 0 | omp_get_team_num() = 1
omp_is_initial_device() = 0 | omp_get_team_num() = 0
```

Trying to run the same application with Nsight Compute fails as expected. We see the following error message:

```console
$ ncu ./a.out
==ERROR== Cuda is initialized before the tool, e.g. by calling a Cuda API from a static initializer.
==ERROR== Initializing Cuda during program initialization results in undefined behavior.
==ERROR== See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#initialization
==ERROR== The application returned an error code (1).
==WARNING== No kernels were profiled.
```

These results are all known and reported in the mentioned issues above. Now for the new part.

### CUDA calls happening before performance tools can initialize

With the changes in the mentioned PR, LLVM will initialize devices extremely early during program execution. This includes a call to `cuStreamCreate`. See this backtrace for more infomation:

```gdb
#0  0x00007ffff3f95c60 in cuStreamCreate(CUstream_st**, unsigned int) ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#1 0x00007ffff3f81ff7 in llvm::omp::target::plugin::CUDAStreamRef::create(llvm::omp::target::plugin::GenericDeviceTy&) ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#2 0x00007ffff3f81be5 in llvm::omp::target::plugin::GenericDeviceResourceManagerTy<llvm::omp::target::plugin::CUDAStreamRef>::resizeResourcePoolImpl(unsigned int, unsigned int) () from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#3 0x00007ffff3f81889 in llvm::omp::target::plugin::GenericDeviceResourceManagerTy<llvm::omp::target::plugin::CUDAStreamRef>::resizeResourcePool(unsigned int) () from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#4 0x00007ffff3f7db96 in llvm::omp::target::plugin::CUDADeviceTy::initImpl(llvm::omp::target::plugin::GenericPluginTy&) ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#5 0x00007ffff3f9a3dd in llvm::omp::target::plugin::GenericDeviceTy::init(llvm::omp::target::plugin::GenericPluginTy&) ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#6 0x00007ffff3fa04b0 in llvm::omp::target::plugin::GenericPluginTy::initDevice(int) ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#7 0x00007ffff3fa0bab in __tgt_rtl_init_device ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.rtl.cuda.so
#8 0x00007ffff5edfbfd in DeviceTy::init() ()
   from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#9 0x00007ffff5efb8ed in PluginAdaptorTy::initDevices(PluginManager&) ()
 from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#10 0x00007ffff5efc169 in PluginManager::registerLib(__tgt_bin_desc*) ()
 from /home/jreuter/Projects/Compilers/llvm-project/_build/_install/lib/x86_64-unknown-linux-gnu/libomptarget.so.19.0git
#11 0x000055555555618d in omp_offloading.descriptor_reg ()
#12 0x00007ffff3c29ebb in call_init (env=<optimized out>, argv=0x7fffffffa788, argc=1) at ../csu/libc-start.c:145
#13 __libc_start_main_impl (main=0x555555556290 <main>, argc=1, argv=0x7fffffffa788, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
 stack_end=0x7fffffffa778) at ../csu/libc-start.c:379
#14 0x00005555555561c5 in _start ()
```

The initial caller is `omp_offloading.descriptor_reg`. This function is created [here](https://github.com/llvm/llvm-project/blob/d9c31ee9568277e4303715736b40925e41503596/llvm/lib/Frontend/Offloading/OffloadWrapper.cpp#L194) with a priority of `1`. The reason for this is given as well:
```c
  // Add this function to constructors.
  // Set priority to 1 so that __tgt_register_lib is executed AFTER
  // __tgt_register_requires (we want to know what requirements have been
  // asked for before we load a libomptarget plugin so that by the time the
  // plugin is loaded it can report how many devices there are which can
  // satisfy these requirements).
```

As the initialization is happening this early, a performance infrastructure tool has no chance to catch this. With Score-P, one can observe the following result:

```
$ scorep --cuda --thread=omp:ompt clang -fopenmp --offload-arch=native reproducer.c
$ ./a.out
[Score-P] src/adapters/cuda/scorep_cupti4_activity.c:572: Warning: [CUPTI Activity] Context not found! 
omp_is_initial_device() = 0 | omp_get_team_num() = 1
omp_is_initial_device() = 0 | omp_get_team_num() = 0
[Score-P] src/adapters/cuda/scorep_cupti4_activity.c:131: Warning: [CUPTI Activity] Context not found! Cannot flush context ...
[Score-P] src/measurement/scorep_definition_cube4.c:771: Warning: Given metric name "CUDA Memory" was changed to "CUDA_Memory" for CubePL processing, i.e., .spec file and CubeGUI derived metrics processing. Given name still used for display. Note, profiling only.
```

Score-P requries information about created streams to correctly associate and evaluate buffers allocated for the CUPTI adapter. As `cuStreamCreate` is happening before Score-P has a chance to initialize CUPTI, we will miss this function call. This yields to OpenMP target events missing from our measurements entirely. 

![image](https://github.com/llvm/llvm-project/assets/14841361/a1e91633-272f-48bb-a194-f7a696130fcc)

We can compare this to LLVM 17.0.6, where everything is working perfectly fine.

![image](https://github.com/llvm/llvm-project/assets/14841361/c8a1c124-871f-4b11-af69-b62e403bdceb)

>From our side, there's nothing we can do about that. Initializing any earlier would mean that we're in `_dl_init` again, where initializing CUPTI will simply cause a segmentation fault. There may be a way around the missed stream creation by trying to do our handling when detecting we missed it, but disregards other CUDA function calls we missed as well and is not guaranteed to work.

### Appendix: Build Score-P to reproduce issue

Score-P v8.4 does not contain the fix to allow using CUDA and our OMPT adapter simultaneously. One can do the following steps to build Score-P master and reproduce the issue:

```console
$ wget https://perftools.pages.jsc.fz-juelich.de/cicd/scorep/branches/master/latest.tar.gz
$ tar -xf latest.tar.gz --strip-components=1
$ mkdir _build && cd _build
$ NVCC="nvcc -allow-unsupported-compiler" ../configure --enable-debug --enable-shared --with-nocross-compiler-suite=clang --without-mpi --without-shmem --prefix=$(pwd)/_install
$ make && make install
$ export PATH=$(pwd)/_install/bin:${PATH}
```
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzkOltv27jSv0Z5GUiw5ftDHxyn2ROg7QZp9vTRoKiRxY1E6pCUHffXf5ih5FuS7rZ7DtDiK4Jatsjh3Gc4M8I5tdGI76LJdTS5uRKtL41991juLV5lJt_Tiw8f_v0RhvMopSXR5Pr3BvXH-2hyA-EJvLAb9GCKojIiV3oDO1VVIEXrEGrlHOaw-uP-8Q6kqKpMyCcXDW6iwTIa3XR_4Wv4Px1F6QjutLcmb6VXRoM34EsE5VyLp2uXWlT7r3Rig7YwthZaIpgCRNNUSgra7EA5UHVjrBfaEyhCVmmnNvThDRSt9SVaMI1XtfrKu6AxHrVXokrgsUSLgM_KeXC4RSsqyFVRoEUGaCpHYCtjngJAXyqXwK2x8Onfdzd3S_jt_g8XpSvIjC_hk1Ob0sPK1E3rEYTO-58-753H2oWjEviifBm2-BIdhpMITOvQOpBCw39aJZ-q_ZGk0vcooLKnbOBzaJkzNZ6TalGaukadB34lcMriOw2l942LRssovY3S243yZZsl0tRReltV2_4jbqz5E6WP0tumraoovZ2NR4sZ4as87IQDiyQEzMGXwl-ywZVm54KEHeyI8pesZk3conUBzX-hxRPw2hxg80JSNgerP26WULRaBl0orKnBeeGVBKUVCVh9Rcts3ZVKlqQt2ngQVWV2mINw4BqUqlCYg9Ksh7mRbY3aM7-IMafqO3pPChPOVdqjLYRERzKDTWUyUfHxGBAl1TxgkUPeWlLm0jgPjTUbK-ru_UGEOTpvzf6NxR5trTSvTo5o2FZ7VQdVy63aoiXlISpz9Cg9qIJ1tsOMkdqKSuXEFWegdXSQ0HsyraCLJ5RF6VzVpGfKV3swFvC5_xalix7NV8gx9hRfEIVHC7VQmraxC7Ho2oo0GlqdY6E05pBhKbbK2ORCS0WeK5ZHugLD9hwss1JPCJ-lsRjfg2tJl3o1y1vsXQsKW-2PsugQLCCaDiqVmboJTi5xJpoOOk0RlTMn4nMMqPOJj3z2gUsJfDSWuMZOioGT-WYIhWk1K9Y_srIEroVTpPF7ov8FInc9Imwpr6hcNB2s82pNb6LpIIG7AnYI3pLW-1MqO96KXDS-sxvmMCuasWCNvLemUBXaKL21Rj5aIfvneu1qxaaGsDNtlQPqHNomGLwAh5uDXUEh2sqTFqvOL0ApHGSI5LGcqbaYB3M2rQWncgw2TL56h6DNDrzdXyDfow3Cg4BKkMY1RmnWMeKB88IyB14JSatS6A26QzQ6VxanKPb0wfK_7EHl-dE5bpV8gcGOSK9Fjgkse3cYjIr23D-w7Gu0G2TDFlUFYitUJbKqB-lAWKQw4UkpTvUExQbJRHxpTbspiVn3VbtRekksNfZxT_SNlrTnJsAiS9Fml8CXUlUYPMzO2CcHZMlQGPsNRQ1qrBwLXaL1QulqD0JTLBfSA6cFtCV4p5BgeAOCcwLkIHmWB5zHtQ_oXQjZrAqOXBgFecGf0uR46ding_AnD0qhtKzaHCEarZzPlUnKaPT-tbembk7f8f-kdMHZzWFrVA5Ruujez67DAwBAlI4aKza1AFM3farlUdQOdFuv-SlK5-lx92BxABANFtBYpX0RpfMoTU3drJVbd3JdB5nzqwVEoxuI0kkO0WxFZ6036Bn8Wrf1xZpostJRmrKrfRtk__olIPobXZ_g22eAh4ee2acce2i1JlGT0lxKFzQ51qraw15hlbsuTIUA4l6Xo9HOVH0-mY5BVkJvIC5Mg7puII67jDYWVpbR6Ibi1BYhIV-GDWenaBN5BEBvRGJaH376K24P_orVw_8SnME3mPpo98xTQ1kCs9YJShVOWMvO-SJdK4SqHCVH-Nyg9Jgn8AXBITKIwlD2RHDRWmOhRufE5m2LupSElu0lM8NF4f3Dw-8P4RFWbS4uE6gMC4qzhAM5B9JATDYJZHtOBzmNCRuX93cheohX0sHkrTPv-jXscgjOq-lNdUytWf--ncG8OOUz4kXIyI10id6qXIkubsg2F91HLOPu-FrpTbxpORbeKp3jc1L6umJndIrXWwdTxngqeYu-tYSy0J0kyTFSxjeM0sUZ_l-WD5_uPv3WAfpk4AmtxsqFqNSElCBPvqWJpxbLcYgi1JM2u5D5Hq4PXRZOmYIyhF2XzonMbDGBT2Z3iC0ad9AI618J5xzRKWkJ94RSNA2ye-l06PRGGQINZWxHJTmFyDc1Oq-P0y9QvH8gXeTsgDPbk5ykj7z47C3WWO27ZPRCs_AZZdsn9qz2HF4cCKaALJhMqf3sLYp6ZVF45FTuM3axl27enrIxZk_dJ6T1q9eYXkKbPDtwbQAweB4MBoNZURTFqFhM5HRAtF6cms5Xfzj-Ye18lC75bwWt5moDCdCzawqxoA91bI1ReluamtT3T4ut58TxPqRFLkpvyf1QaulepkzrrFVVTg9KOy84f6pUFqW3z_PpejqOW826FFdKt8_xRrdhwTG3t75KyJ4oye9JHp5TPB8WxYwo5rSNEx5TN-EhQAnPDadG4ZmULLDnAYvwk-z59B1gfkONVsmQWz3uo3T6q_AwveRhhpPv5OEZ8Q_oTGslfhSaklLKPFc_Lo_34VeLTn09gL43prqrmypK5-da-5YW_5ycH11yfj5f_AKcf8H1n5vL43Muz_JsMf0BH3Gw7MM9qtPA75dWuJn9Uk5ichFaxCjP_4mqnvLx_wsPp-c8FINxNvgxHh6Iv7jTR-n8Fwres0t-ZCIjfqzXfuPX1od6V3eh-iUomp9SNMG8yAq2kld1_mcUkzPJcJEMNsofaFqc01Rk85Dj_2V5KZ2HJV00es1Qf0Z6h4MLguVwujgS3FPTxcaNch7tBzp0HvQ2U3qdo5OcVf8K9PY59KT7Nx3OWcCmbtbHpmFCNFlFwl5b3JzRRVDOs0iZLjBjY6bLD9sx7UC95RvoqmtwYQ6m9ZRqpCsQdkNvB88MoigKMZvPuxcyGt3QnRaEhyShi7XraJMx14QTGY2Ww_HkiNAI1mt6H2rG61oovVZ1UxEeXNijow40p4sBRKNVePH-_NRvoMam_CZFhdLq7bfWV_n620s6tXFeyKc16vwCh9n8L1kymp3IaPxC0pIz_cCic5G-UQLob8YsV27V0LX2m6rCt1y-E_d9PtoV7lg5RJPrEi1Gk5sonX93IT6rDKl_vpCjIeJiMp2nsxmOR4PRbDiZjabZeLBIJzgeTgajyWJ6AofN5tYa7Ymv6e3vB_SPX75Y0TRoE9k0UTr6MFyMQ-uLmyGNVcYqv-_aUMOOTASLwhnd1Ti4OQIbtUUNwsEOq-p4lb8oWQMEumGZ52HrgWHegDTaedtKb6xLLjZ8Rn_ExxsYgjOhg9nF0s5PkUEQPqFYgTksbx_fP1wAu9hi8T-tsqGRuEPYdV16cjSwoyO6BTVq76AUW-Qm0AVQ4Z4wZ5509ZsdArEYBJw6KAjJzgH9bB-KharmquEF0G6xcgyKwpLnKlAoRUFpdlALvT-UcDx3nwSdzh1CKS7RdMIrV-wP5ekjZScFtddMY-leazqp08IVS5RrR-xSzipYShdWBPG2NhS0uLOiDdesuMYFUnhZdtMLXNLq2qZcyNfIpJvMod1elnlD1e6tGtKxsOsIYANxTIkNxLEvLQryOiExJTH93TL83yrBR5PrnobJDTgr6fWhddkXUQNWa9k2Xo3XQnq1VX7P7m0yS6PREr4ISyymx2hyHXpNy24dQV6RmT97nhzgjm6UDuEn7QL8U5YMR8MfZskqjB0UVetKcji8IkmSt1GrUbg2mMgRLa6kc7d_LdsMx4zXbHaJ12_sFWv0VknQoqYcP-Wa70esjd1Haco90VC15UZ3t2B9XEAuZdVmeP8BGmskOscufAUqwYQ-E9eghEJVYcCC1v72xx3kaNUW8-54d7I56RBjjJxXVQWt65xXrlxTiX0Cn4znHmionJONGV3tv-Uh-hkHcipWcRH6OG0gMtP6Q0wMxVkXvL61KH21B-GckUp0I0m4FVVLXzIemXA8ECN5d19cDzLvdId7zq8VoM99VOece1zJBYkTD3RSFmfo_awA8ahWzl1ELcoRutDft_3MxVQabjls0GY6_zAwcKJXDlB7ZbHan7eIo3QYTa5VLTY_mj0I55Bz7uF4Ph6OpkP6bYiL4XQ0itNZWsTjeZbFYrgYx8VMTBfT4WhQSHnSzOX2QnC90tSNsF0x35tu1mCWDJLpcfQBt2j3vuT2fWi39-NxQcqF0pj876mUczGUw3Qcz2fDIh5nw2EsiukizqYpjgejLJeYXVB5eznLwdE0Smc8y8AU7QIjctPpM4Xw5LwpR-GYYqBC242Y1Ch0CPY7gsZdj4t5FxAb7sT3TFRnbT5Wc1ZBnhLYd5ONb42rhCmMPWS0ZCf2ICzP-HBDKExDBvsL5kh7KQs59GFzw1wohc7Z7ncl6m5Qq-NBB0VxPTprPXkNixthc9eN55xNvXUdruPGLk9kO-_G3TatsEJ7DF6Q1OaNltmSLDlXz-Rfr-m2eLBlb44x-eWoZr9qO0_GkBsMx1IIEF2zrFDPPL9BOcVhqONmyVgSQ37_eP_YexuSRFt5odG0juz2d31QjfPMxHls2FqyM2RrQbln31fskD7OmP7dRvWOXMy5wZCpcb8wacQGXfKnk0nxNf6zxUrJMuHWrFQyP0QzumVYoWWJZEMBMTIy4dH5xAubbL4eT_TCQvxcwNlriGPnrWpichFGczpJN8vDrvopVxbC9R64UDIFmXc_HJd9-vdqRZfFNNVbKSFmWcStdm0Teq98QJjtSrtLodGF2lBOGceoRVZhnGPWbo5fXSks5hDHdK-JtZHWOHcAFLtWeYxGN13ax6tM6-O6USffXFljDXHcWCxI-W6idByl82aX8zjJsWBxJFk8YU8qP79Ygc-cxd8vH__1DYDpbRaKo-k4ml3z4hdTKlf5u1G-GC3EFb4bzoaD-XgyGiyuyndCjLO5zOezRSZkikLiZCLTyXw6ySYoposr9S4dpOPBaLgYDsbzdJFk6SCdTefFZLyYZ-k4i8YDrIWqEnK6ibGbK9bQd_PJbDa4qkSGlePx7TTVuOvUN02jyc2VfceOOms3LhoPKuW8O0Lxylf4vxrvvmpt9e6740ho40fpLdP2fwEAAP__KvJAdA">