<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/62764>62764</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [OMPT] Nowait target regions cause missing OMPT thread_begin for at least one thread
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          Thyre
      </td>
    </tr>
</table>

<pre>
    # [OMPT] Nowait target regions cause missing OMPT thread_begin for at least one thread

## Issue description

While implementing support for OpenMP target instrumentation into our profiling and tracing software [Score-P](https://www.vi-hps.org/projects/score-p/), we ran into an issue while looking at a simple program utilizing `#pragma omp target nowait`. While LLVM currently does not support any callbacks for target offloading, we noticed issues on the host side once we added this pragma.

It seems like a helper thread is created which manages the target region. This helper thread is not correctly dispatching `ompt_callback_thread_begin` with `endpoint = begin` before dispatching any other events, causing issues in our tool. In our case, this causes an error message when we encounter callbacks called by this helper thread. While we may be able to work around this issue, it would still be nice to get this issue fixed. An example can be found below.

## Reproduce the issue

The issue can be reproduced by compiling the following tool and looking at it's output. The tool registers `ompt_callback_thread_begin`, `ompt_callback_thread_end`, `ompt_callback_implicit_task`, and `ompt_callback_parallel_begin` and prints the information on each call.
On a thread begin callback, we assign a unique thread id to the calling thread for identification.

```C
#include <assert.h>
#include <inttypes.h>
#include <omp.h>
#include <omp-tools.h>
#include <stdio.h>
#include <stdatomic.h>

__thread int32_t thread_id = -1;
#define INVALID_TASK     6666666
#define INVALID_PARALLEL 7777777

void
thread_begin_cb( ompt_thread_t thread_type,
 ompt_data_t*  thread_data )
{
    assert( thread_id == -1 );
    static atomic_int_least32_t thread_counter = 1;
    thread_id = atomic_fetch_add( &thread_counter, 1 );
    thread_data->value = thread_id;
    printf( "[%s] tid = %d | type = %d\n", __FUNCTION__, thread_id, thread_type );
}

void
thread_end_cb( ompt_data_t* thread_data )
{
    if ( thread_id == -1 )
    {
        printf( "[%s] tid = %" PRId32 "; WARNING: thread_begin_cb not dispatched; thread_data->value = %" PRId32 " (supposed to be >= 1)\n",
 __FUNCTION__,
                thread_id,
                (int32_t) thread_data->value );
    }
    else if ( thread_data->value != thread_id )
    {
        printf( "[%s] tid = %" PRId32 "; WARNING tid != thread_data->value (%" PRId32 " != %" PRId32 ")\n",
 __FUNCTION__,
                thread_id,
 thread_id,
                (int32_t) thread_data->value );
    }
 else
    {
        printf( "[%s] tid = %" PRId32 "\n",
 __FUNCTION__,
                thread_id );
 }
}

static uint64_t
new_task( void )
{
    static atomic_uint_least64_t task_counter = 6660001;
    return atomic_fetch_add( &task_counter, 1 );
}

void
implicit_task_cb( ompt_scope_endpoint_t endpoint,
                  ompt_data_t* parallel_data,
                  ompt_data_t*          task_data,
 unsigned int          actual_parallelism,
 unsigned int          index, /* For initial tasks, that are not created
 by a teams construct, this argument is 1. */
                  int                   flags )
{
 if ( endpoint == ompt_scope_begin )
    {
        uint64_t old = task_data->value;
        task_data->value = new_task();
        if( old != 0 )
        {
            printf( "[%s] tid = %" PRId32 " | parallel_data = %" PRIu64 " | task_data = %" PRIu64 " (reused: %" PRIu64 ") | endpoint = %d | actual_parallelism = %u | index = %u | flags = %d\n",
                    __FUNCTION__,
 thread_id,
                    parallel_data == NULL ? INVALID_PARALLEL : parallel_data->value,
                    task_data == NULL ? INVALID_TASK : task_data->value,
                    old,
 endpoint,
                    actual_parallelism,
 index,
                    flags );
            return;
 }
    }
    printf( "[%s] tid = %" PRId32 " | parallel_data = %" PRIu64 " | task_data = %" PRIu64 " | endpoint = %d | actual_parallelism = %u | index = %u | flags = %d\n",
            __FUNCTION__,
 thread_id,
            parallel_data == NULL ? INVALID_PARALLEL : parallel_data->value,
            task_data == NULL ? INVALID_TASK : task_data->value,
            endpoint,
 actual_parallelism,
            index,
            flags );
}

void
parallel_begin_cb( ompt_data_t* encountering_task_data,
                   const ompt_frame_t* encountering_task_frame,
                   ompt_data_t* parallel_data,
                   unsigned int requested_parallelism,
                   int                 flags,
 const void*         codeptr_ra )
{
    static atomic_uint_least64_t parallel_counter = 7770001;
    parallel_data->value = atomic_fetch_add( &parallel_counter, 1 );

    printf( "[%s] tid = %" PRId32 " | parallel_data = %" PRIu64 " | %sencountering_task_data = %" PRIu64 " | flags = %d | requested_parallelism = %u | codeptr_ra = %p\n",
            __FUNCTION__,
            thread_id,
 parallel_data->value,
            ( encountering_task_data == NULL || encountering_task_data->value == 0 ) ? "WARNING " : "",
 encountering_task_data == NULL ? INVALID_TASK : encountering_task_data->value,
            flags ,
 requested_parallelism,
            codeptr_ra );
}

static int
my_initialize_tool( ompt_function_lookup_t lookup,
 int                    initial_device_num,
 ompt_data_t*           tool_data )
{
    printf( "[%s] tid = %" PRId32 " | initial_device_num %d\n",
            __FUNCTION__,
 thread_id,
            initial_device_num );

 ompt_set_callback_t set_callback =
        ( ompt_set_callback_t )lookup( "ompt_set_callback" );
    assert( set_callback != 0 );

    int return_value;
    return_value = set_callback(ompt_callback_thread_begin, (ompt_callback_t) &thread_begin_cb);
    assert(return_value == ompt_set_always);
    return_value = set_callback(ompt_callback_thread_end, (ompt_callback_t) &thread_end_cb);
 assert(return_value == ompt_set_always);
    return_value = set_callback(ompt_callback_implicit_task, (ompt_callback_t) &implicit_task_cb);
    assert(return_value == ompt_set_always);
 return_value = set_callback(ompt_callback_parallel_begin, (ompt_callback_t) &parallel_begin_cb);
    assert(return_value == ompt_set_always);

 return 1; /* non-zero indicates success */
}

static void
my_finalize_tool( ompt_data_t* tool_data )
{
    printf( "[%s] tid = %" PRId32 "\n",
            __FUNCTION__,
 thread_id );
}

ompt_start_tool_result_t*
ompt_start_tool( unsigned int omp_version,
                 const char*  runtime_version )
{
    setbuf( stdout, NULL );
    printf( "[%s] tid = %" PRId32 " | omp_version %d | runtime_version = \'%s\'\n",
 __FUNCTION__,
            thread_id,
            omp_version,
 runtime_version );
    static ompt_start_tool_result_t tool = { &my_initialize_tool,
 &my_finalize_tool,
 ompt_data_none };
    return &tool;
}


int main(int argc, char **argv)
{
#pragma omp target nowait 
    for(int j = 0; j < 10; ++j){} 
#pragma omp taskwait
    return 0;

} // end main
```

Compiling the tool and running it yields the following output:

```bash
> clang -fopenmp -fopenmp-targets=nvptx64 reproducer.c -o reproducer
> OMP_NUM_THREADS=1 ./reproducer
ompt_pre_init(): tool_setting = 1
[ompt_start_tool] tid = -1 | omp_version 201611 | runtime_version = 'LLVM OMP version: 5.0.20140926'
ompt_pre_init(): ompt_enabled = 0
[my_initialize_tool] tid = -1 | initial_device_num 0
[thread_begin_cb] tid = 1 | type = 1
[implicit_task_cb] tid = 1 | parallel_data = 0 | task_data = 6660001 | endpoint = 1 | actual_parallelism = 1 | index = 1 | flags = 1
[parallel_begin_cb] tid = -1 | parallel_data = 7770001 | WARNING encountering_task_data = 0 | flags = -2147483646 | requested_parallelism = 8 | codeptr_ra = 0x7f604e36f284
[thread_begin_cb] tid = 3 | type = 2
[thread_begin_cb] tid = 2 | type = 2
[thread_begin_cb] tid = 4 | type = 2
[thread_begin_cb] tid = 5 | type = 2
[thread_begin_cb] tid = 6 | type = 2
[implicit_task_cb] tid = -1 | parallel_data = 7770001 | task_data = 6660002 | endpoint = 1 | actual_parallelism = 8 | index = 0 | flags = 2
[implicit_task_cb] tid = 4 | parallel_data = 7770001 | task_data = 6660003 | endpoint = 1 | actual_parallelism = 8 | index = 4 | flags = 2
[implicit_task_cb] tid = 6 | parallel_data = 7770001 | task_data = 6660004 | endpoint = 1 | actual_parallelism = 8 | index = 1 | flags = 2
[implicit_task_cb] tid = 2 | parallel_data = 7770001 | task_data = 6660005 | endpoint = 1 | actual_parallelism = 8 | index = 3 | flags = 2
[implicit_task_cb] tid = 5 | parallel_data = 7770001 | task_data = 6660007 | endpoint = 1 | actual_parallelism = 8 | index = 6 | flags = 2
[implicit_task_cb] tid = 3 | parallel_data = 7770001 | task_data = 6660006 | endpoint = 1 | actual_parallelism = 8 | index = 2 | flags = 2
[thread_begin_cb] tid = 7 | type = 2
[implicit_task_cb] tid = 7 | parallel_data = 7770001 | task_data = 6660008 | endpoint = 1 | actual_parallelism = 8 | index = 7 | flags = 2
[thread_begin_cb] tid = 8 | type = 2
[implicit_task_cb] tid = 8 | parallel_data = 7770001 | task_data = 6660009 | endpoint = 1 | actual_parallelism = 8 | index = 5 | flags = 2
[implicit_task_cb] tid = -1 | parallel_data = 7777777 | task_data = 6660002 | endpoint = 2 | actual_parallelism = 8 | index = 0 | flags = 2
[implicit_task_cb] tid = 7 | parallel_data = 7777777 | task_data = 6660008 | endpoint = 2 | actual_parallelism = 0 | index = 7 | flags = 2
[implicit_task_cb] tid = 6 | parallel_data = 7777777 | task_data = 6660004 | endpoint = 2 | actual_parallelism = 0 | index = 1 | flags = 2
[implicit_task_cb] tid = 4 | parallel_data = 7777777 | task_data = 6660003 | endpoint = 2 | actual_parallelism = 0 | index = 4 | flags = 2
[implicit_task_cb] tid = 8 | parallel_data = 7777777 | task_data = 6660009 | endpoint = 2 | actual_parallelism = 0 | index = 5 | flags = 2
[implicit_task_cb] tid = 5 | parallel_data = 7777777 | task_data = 6660007 | endpoint = 2 | actual_parallelism = 0 | index = 6 | flags = 2
[implicit_task_cb] tid = 2 | parallel_data = 7777777 | task_data = 6660005 | endpoint = 2 | actual_parallelism = 0 | index = 3 | flags = 2
[implicit_task_cb] tid = 3 | parallel_data = 7777777 | task_data = 6660006 | endpoint = 2 | actual_parallelism = 0 | index = 2 | flags = 2
[implicit_task_cb] tid = -1 | parallel_data = 0 | task_data = 0 | endpoint = 2 | actual_parallelism = 0 | index = 1 | flags = 1
[thread_end_cb] tid = -1; WARNING: thread_begin_cb not dispatched; thread_data->value = 0 (supposed to be >= 1)
[implicit_task_cb] tid = 1 | parallel_data = 0 | task_data = 6660001 | endpoint = 2 | actual_parallelism = 0 | index = 1 | flags = 1
[thread_end_cb] tid = 1
[thread_end_cb] tid = 6
[thread_end_cb] tid = 3
[thread_end_cb] tid = 2
[thread_end_cb] tid = 4
[thread_end_cb] tid = 8
[thread_end_cb] tid = 5
[thread_end_cb] tid = 7
[my_finalize_tool] tid = 1
```

On closer inspection, one can notice that `thread_end_cb` is called with a `tid = -1` and no value set for its thread_data. This thread is not calling `thread_begin_cb` callback. 

Please also note that `tid = -1` creates a parallel region whose parallelism does not adhere to `OMP_NUM_THREADS`.

## Solution

Looking at GDB outputs from the error message in our tool, it seems like the thread with `tid = -1` represents a helper thread used for the `nowait` clause in our example program. The runtime could either report no callbacks for this helper thread and parallel regions it creates, or should also report `ompt_callback_thread_begin` to allow proper instrumentation via OMPT callbacks.

## Environment

I used the following environment to reproduce the issue:

- Pop_OS 22.04
- Intel Core i7-1260P, NVIDIA MX550
- LLVM 15.0.6 / LLVM 16.0.3
- CUDA 12.0

</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzEW19T47iy_zTmpYuUIydOeOAhwHAudRmgZmb33DeXYncS7TiSjySTYT_9LUm24_9AYPakpobEbrV-_UfdLdlNlWJbjnjpza-8-c0ZzfVOyMsfuxeJZ2uRvFx6JABvfvX49emHN7-BB3GgTIOmcosaJG6Z4ApimiuEPVOK8S0YWtA7iTSJ1rhlHDZCAtWQIlUaBMfiruffeP6q-J8EZqo7pXKEBFUsWaaZ4HWaf-9YisD2WYp75NrMpfIsE1LbGR4z5F-fSmyMKy1zQ0cNH2BcCxC5hEyKDUvNYMoT0JLGlpHY6AOVaIT9HguJ50_e_MYjy53WmfKClUduPXJ7OBwmz-x8l6mJkFuP3GZS_IWxVh65VXZYZgkvPHINBwRJi5nNXyvbwQqRCvHTQtBAQVmRDLCtpHvINUvZ3-auF_oeCTJJt3sKYp-VsnFrBS_0J-B0cn__51eIcymR6_QFEoEKuNCVeih_gZim6ZrGP5VVVsFJbDapoAnj2wIwF5rFmDiwCgQHvUPYCaVBsQRB8BgNHU0STEDvmAKHb1K31J0GhbhXkLKfCBR2mGYoC7MDUxBLpBoTo414B3vK6RaVnarhWhP4YWboDDeyxUJKjK24TGVUx7tCZWKf6aiUNqo7ohf6cGB6Z6iQJ5lgXIMX3EB1d40bIbHB0ehO6B1KwGfkxtLX1uHNvUJLjFvP0kKkE7hzP2Kq0JBaFdkFoowToJRCwh6VolvjDMiNNpHHIucaZc1K5hsmsH5xLBo6KO1-QNjTF1gj0HWKoAUchPwJVIqcF9axEA0QpuEg8jQBpVmamjGcxXaM0feRFjbsFyYTWHHAX9R6Zky5od9YrmtMxWHSs3a_YSZFkhueOyzmrVH9KC-W7GRJb4WMxT5zy9KM3og0FQf7S4jUrtTakmHaIwsFItdZro2PoCMzTqM0SvWqGxiFDNEgT4YozEJlMdORpupnQWSwdQgzKo350qPfGbJMMq6dmzO-EXLvQpPggDTeWYsXin3kQEt_dzG0ZF2sUxe5gULO2X9yrNZGYgxqJjD0Tpv2hlnzLDFRc8NiO23ThKHv_l1XJmU8TvMEwQuuqVIo9WTnBV_6bjOu9UuGapBA7LOxe-fGeMOjlU6YGLtLtdizuE5h_4-iUilcByTSZU5iiV3z51MvuKo4JrhhHOHu4c_V_d1N9GP1_X_BfEL3GaJ7Wn1b3d9_uYeF-9TnfxasSHF154vitUeWYN2luF4hM1r0SGECR5JQTSPtkRWUROYKmAzjploUMhiwzk6GfUNUJ60dE9SolcmMMTjtRYzryCbouqrKqGQ4TBuDm7oseGxQx7uIJomB4JGwycV4bhdFTapzL_jyTNMcLc9qhga5XUIbx5-YqoXMlSlMdIHEI_MEvMU1GF1WV7z5NTf05Bqi6PaPh-sfd48PUeTicznN8YcbWwPqLW5GLYs8qdv1aLTXbcY2MGawiq4x6I2a8AiBp293SUAsTXAF_159e7h7-JcXrKDllDanlnkPjdKHTdNhbUSwtYZCG4DWhu6L8xpycdR-IUHLBk25yk_dMP0UHlkWS9sjFwNoW95WGdL8wFRhS__t0dOGJ_5GiziixnxtLMs-tU_7GX-i0n-zHYwRPlWnH5O7ifS47psBoAidOeM6nEXaXeR4cHUBWYKJEAMLvhl28yruGkZgxjeibhiGvu83Y69EnUs-GHRrLLohdzCSNUqbejBTscgwKsvlSEP5dVCZ0I6CVTVkLr192NE6BlJzbM7tztUm9yMhjXVO06r6Ymr_2gjGE_xlaz2zbVvBramTONOMpnZa5dKC2adJdPsOt3UpuK5fTKGGdK8gFnbLGeuq7Kdya3egZssynYBH7C5yQPgGrOqzSelW9XlSEbfq2xjjLzWTubpxPGSVHgwideupUnW1dhuu17BGMynU3L-92q14dhHbaVzQ8pvQeuGdFAFs-m-4XIsuD2cVXSXMEA1ZSsyVSYir7m0T7wyXxmayKkG63ljez-1963vNS4W5O2XLkNPAQHh7PWhb1baVZCZ--OP-HrzgtlvjGhU0hhx9ZGyWhop7Z7DVtq1Jut43xlmkNfneEJbGA0QZCoZHHxdj27uhCss9yQPahcd_2aX_Ow57iqP-Xgf9fMfs-uCYw9U-w77X9bnBFN48dOjfkFQnTYxvo5682v3YtObYbCTd4xAfe3OM0YlFQTNxS_xPjkpj8qpKK81206rV6XGQk9DqsVZ1xCLBTMtIDu3cRgu5Srp6MbdYLDrFXL-_ju2q26x7irx_KNIYbv0ONTKoGTHspV6jNuNM3Rruevb-YFP79MSdd0QOV3kNCX4MJYtrF2z7KBu2rgoiG348QsptodWaLT1IQ9a3zN4XyF7BMhqCyltvX4OtNTQQwYqFZOKm_b1_iYoSnP2NkRYirQLZJuexZoJHqRA_8yzS4L7Us3hfGV3W9FGCzyzGiOf74YO2o5MIkY6d3Zy2urpQfkvu7J2mEyPcfgHrp-BQ_2k9qlmkVxvD5iiPXJS2sPro0Lhyulk7HU8sm5PWtwjdqOYSgam2ou4WpX7DWqGJYTnyXMBuA9sEdlFWZ5nH1DogSHv646YMdUTTA31R7bEnIUaevAVveTJZm_Gfg9p8XDKGtnv68EnqfR_g1mObMcQ9xdbnQG4At6fu5dkEF_z8b5TCVIssphoVqDyOUanG0UJ_dD2WifuXaMN4T3CtHVt_duD7QHAbyxxOjZpKbSWJJKo81VaGXgKDuFFOin0WPaNUTPDhItLVh_GOSpseZM4122M5bqg6RL3OrYqUTkRuD4VcWm75yWlJpAa8Vke1kZnh82uPLBxT--3d56PjaaZXg30q6nn0NGQ-9zTXol8Y7w_7KoJyLne75dHd1M4FR7sL7zlJNeHSDBvwsuKAlGvYUxMVluYrldvYvgqwo9ItvxWV2-euM4y8wQFHJBshC8Z_WcF9s-7N12uY-i4GXHnk6i_Df3HlLW6gn7v6aV8NaUvot8KLZWDfaDF7VieXu1E-CK5TXzeezVdP5GXOuX0LQsMLwzRRrUf37vm8FzTUWE2wpmpXXAq-QJxSvoXzjciQ77Pqy7nTl_KCG_6c6V_h7PjagJzEcC5qv4_cHr8-RQ9_fI1-_M-3L6ub715wM4WJR27btNY9MonWu8qDyyL8KdT2FSP3FMuxnl-1A0ptoZ5POyuT-NNwOh1em2Rh39x5_PoE5RoKVjCf-BPiT2f-BQnNih2Dai8jp-sUk8JxSqg9i6aLtqdKPHJo1zy14dPmQ9ajgjqZvDOou7v0e86tigcf3dOq6dhR1bR1TjVtbTmPOLv5u6ucLtJiB2_vllu0kR2w35r-nExni9kyCGfhKxvfZd-u1_-12IT-DINwQ5azt5gpaJqJvGUMOWHM7IQx8xPGhENjxtzuTdbs876es9JR71u2vK9t_rfBnZ2INvgg2tlJaMMT0c4-iLa9st-Gduh86zW08w-iDU5COz8R7eKDaMOT0AYnog0_iLZ9rPimQLI4JZAsTpRw-UEJF6dIuDxFwuWJEvY8Bn2XhPOTPG4ssJvPewI7-f2BfcR9RtH2uM8oWv_N7nNiYB9F2xPY34X2tMA-kjRH0fYkzXehPS1pjiyzUbQ9y-xdaE9bZiNpaBRtTxp6F9rT0tBIih9F25Pi34X2tBQ_kjRH0fYkzXehHU6aJwXcvh2c_8mBYNrOesXZdgPf573k6r_yUus_tuH93Up7A0n4OknwOkmnbOmSdDa1XZLl6yTz10kW9WOS5uFhj2r6DsUeOcSpUCiBcZWhfSzpkWvb3xZTXnRSuXcHvdBvIgl92wXlWnxsSxK1REc_LlpGuADnkgpdlxuzDSSVzxYdUq3WqKL34zhtVSCGftVJMoG6OE8pUoVAUyUMjxrwBib39qMCWnl10aoFh51QCHX_rNrQaLJDaVuNvNBvn82Ffl8z0XeR5u3-v_tjB9C_bq6K40UFGyn29uSx2VtVa8oq2p9qLWn2HNPprGwIa8opMZOokGvVaV_LTTywPXQ7NAOrbjyIU9sIWcxcdk8VfX2uT6k4BoTYtmIhs41lEm2XHhftJr1u85ttI2pqXhnZCrtYB5Sgdpa9NWbB_NXGOC2Apqk4GLyZ8-pGA-Uzo661s8LYZ7cv_JlJwc2w-t07p7Xm-TAeac3ssqd_rHl2fA5PIosevwMhE39WXrvjGlO4FhKBLc6nJPSf7MOWP-9u7lbw9f_mc78ktaet0_nEn4TgkdvidzjxJ0FJcv3HzQqmZFKMOUsug-QiuKBneDkNl7NgOg0Dcra7XPjxhkxnJJgvZxuyDteLkCwITRZ0MV1jkJyxS-KTwJ9PF_6S-AGZTC-Ws_k0DIP1xWa5mF54Mx_3lKWTNH3eT4TcnlmZL0OyCGdnKV1jqmxjLiEcD2UjH_HmN2fy0ow5X-db5c38lCmtjlw006nt6P0tTbtnuUwvm02xW6Z3-XoSi71Hbg2M4s950RzrkVvXKOmRWyvc_wcAAP__VRbqyw">