<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/67664>67664</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[OMPT] No sign of `target tasks` for target regions in OMPT callbacks
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
Thyre
</td>
</tr>
</table>
<pre>
## Description
With LLVM 17.0, parts of the OMPT interface for target device activity were introduced. This includes the corresponding callbacks, with the device tracing interface still being absent. The introduced callbacks include:
- `ompt_callback_device_initialize`
- `ompt_callback_device_finalize`
- `ompt_callback_device_load`
- `ompt_callback_device_target[_emi]`
- `ompt_callback_device_target_data_op[_emi]`
- `ompt_callback_device_submit[_emi]`
This significantly improves the general situation with OMPT. While the callbacks cannot reliably be used for profiling (which is why the device tracing interface even exists), we can at least get an idea of the kernels being executed and so on. For the device tracing interface, there's an implementation in the AMD fork which allows to use the device tracing interface for AMD GPUs.
However, most events do not help us in regards to events happening on the host while target kernels are executed. This may be due to task events missing which are related to target kernels.
The OpenMP 5.2 specification notes several things which relate to those tasks. Some examples:
- Page 13, L18
> **target task** A mergeable and untied task that is generated by a device construct or a call to a device memory routine and that coordinates activity between the current device and the target device.
- Page 24
> When a target construct is encountered, a new target task is generated. The target task region encloses the target region. The target task is complete after the execution of the target region is complete.
- Page 280 (also present for other target directives)
> The target enter data construct generates a target task. The generated task region encloses the target enter data region. If a depend clause is present, it is associated with the target task. If the nowait clause is present, execution of the target task may be deferred. If the nowait clause is not present, the target task is an included task.
- Page 281
> Callbacks associated with events for target tasks are the same as for the task construct defined in Section 12.5; (flags & ompt_task_target) always evaluates to true in the dispatched callback.
- Page 285
> Events associated with a target task are the same as for the task construct defined in Section 12.5. [...]
> The target-begin event occurs after creation of the target task and completion of all predecessor tasks that are not target tasks for the same device. The target-begin event is a target-task-begin event.
Looking at the following source code with an OMPT interface:
```c
int main( void )
{
#pragma omp target
{
}
return 0;
}
```
I would expect to see the following events (manually constructed):
```
[thread_begin_cb] tid = 1 | type = initial
[implicit_task_cb] tid = 1 | parallel_data = 0 | task_data = 6660001 | endpoint = begin | actual_parallelism = 1 | index = 1 | flags = initial
[task_create_cb] tid = 1 | encountering_task_data = 6660001 | new_task_data = 6660002 | flags = target_explicit_undeferred | has_dependences = 0 | codeptr_ra = 0x557bf73bf57a
[task_schedule_cb] tid = 1 | prior_task_data = 6660001 | prior_status = switch | next_task_data = 6660002
[device_initialize_cb] tid = 1
[device_load_cb] tid = 1
[target_emi_cb] tid = 1 | endpoint = begin | kind = target | device_num = 0 | task_data = 6660002 | target_task_data = 0 | target_data = 0 | codeptr_ra = 0x557bf73bf71e
[target_submit_emi_cb] tid = 1 | endpoint = begin
[target_submit_emi_cb] tid = 1 | endpoint = end
[target_emi_cb] tid = 1 | endpoint = end | kind = target | device_num = 0 | task_data = 6660002 | target_task_data = 0 | target_data = 0 | codeptr_ra = 0x557bf73bf71e
[task_schedule_cb] tid = 1 | prior_task_data = 6660002 | prior_status = complete | next_task_data = 6660001
[implicit_task_cb] tid = 1 | parallel_data = 0 | task_data = 6660001 | endpoint = end | actual_parallelism = 0 | index = 1 | flags = initial
[thread_end_cb] tid = 1
[my_finalize_tool] tid = 1
[device_finalize_cb] tid = 1
```
However, the task events are missing, leaving the following events:
```
[thread_begin_cb] tid = 1 | type = initial
[implicit_task_cb] tid = 1 | parallel_data = 0 | task_data = 6660001 | endpoint = begin | actual_parallelism = 1 | index = 1 | flags = initial
[device_initialize_cb] tid = 1
[device_load_cb] tid = 1
[target_emi_cb] tid = 1 | endpoint = begin | kind = target | device_num = 0 | task_data = 6660001 | target_task_data = 0 | target_data = 0 | codeptr_ra = 0x55e191c895d9
[target_submit_emi_cb] tid = 1 | endpoint = begin
[target_submit_emi_cb] tid = 1 | endpoint = end
[target_emi_cb] tid = 1 | endpoint = end | kind = target | device_num = 0 | task_data = 6660001 | target_task_data = 0 | target_data = 0 | codeptr_ra = 0x55e191c895d9
[implicit_task_cb] tid = 1 | parallel_data = 0 | task_data = 6660001 | endpoint = end | actual_parallelism = 0 | index = 1 | flags = initial
[thread_end_cb] tid = 1
[my_finalize_tool] tid = 1
[device_finalize_cb] tid = 1
```
## Consequences of the missing events
Right now, we're in the process of implementing support for the target events of the OMPT interface to Score-P. While we are able to add support for the device tracing interface and get events for data transfers and kernels, the missing task events prevent us both from determining the user waiting times because of synchronous target regions and analysing trace files to identify potential performance deficencies in the instrumented application. To show the issue, we can look at the following Vampir screenshot of an application instrumented with a development snapshot of Score-P:
![OpenMP Target example with OMPT interface using AMD llvm-project fork](https://github.com/llvm/llvm-project/assets/14841361/0e6e3333-d342-4093-9124-9fd73876ffb2)
The host timeline only shows the execution of a single function while multiple target kernels are executed. We will not know if the host did wait at any point of the function execution. We can compare this to a similar CUDA version, which shows host events. We use this to determine synchronization points for example:
![CUDA example with OMPT & CUDA adapters using LLVM 15.0.6](https://github.com/llvm/llvm-project/assets/14841361/eafa2bf6-d2a7-4096-a426-6b4a6db76f9d)
## Why not just use `ompt_callback_target`
Even with adding host events to `ompt_callback_target[_emi]`, there are some OpenMP functions like `omp_target_alloc` or `omp_target_memcpy` which will not show up since they only dispatch `ompt_callback_target_data_op[_emi]`. In addition, there are internal concerns that `target` callbacks may get dispatched out of order or too early. This is supported by a recent comment from NVIDIA while reporting an issue in NVHPC.
> Moreover , the method of measuring time with help of the OMPT callbacks is imprecise. The LLVM implementation issues a stream synchronization earlier than NVHPC does, but it doesn't have to. A stream synchronization can be issued after all target OMPT events belonging to a single 'target' region without violating the OpenMP specification, as a way of pipelining device operations.
## What would I expect
In the case of undeferred tasks spawned during target events, the runtime could maybe just pass call the corresponding callbacks at the beginning (`ompt_callback_task_create` and `ompt_callback_task_schedule`) and end (`ompt_callback_task_schedule`) to the OMPT interface to allow host events to be recorded. Async `nowait` regions are more compilated, as LLVM currently uses, as far as I know, non-OpenMP helper threads for managing those regions. Those threads are currently not dispatching all events, causing #62764. In that case, the handling could match the one done for OpenMP tasking already, with some events dispatched by the helper threads.
In the case that OpenMP target regions are combined as an optimization of the compiler, I would also expect to see a lower amount of callbacks, only showing the runtime behavior.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzsWt9z2zby_2volx1rJEqirAc_2HH8rWeSNvNtmjx6QGApogYBHgBKUf_6mwVAipKlpGl7Nzd3nek4FQns7_0ssEvmnNxoxNtseZ8tH65Y52tjbz_We4tXpRH72yyfZ_kcHtBxK1svjc6mD9n0Lv79LH0N7959eg-z1WSa5W-gZdY7MBX4GuGn9x8-gtQebcU4QmUseGY36EHgVnIExr3cSr-HHVqkldaIjqOYwMdaOpCaq06gC8S4sRZda7SQegOcKVUy_uKI6Y7EoDWJrLeM06IDa-elUlAiPWWlQ-2JxZjlgWLPNpvfjXW9hqyYmqb1z_3K58juWWrpJVPyN8yK6bcWV1L_3qXKMPE7lkWTZsv7Z2xktnz43VueBfPs2bTftdV1ZSPPcYt_g98oqGQlOdNe7UE2rTXb5MUNarRMgZO-YxRO0XkUKRP4XEuF0dmDMzjT2niwqCQr1R5KhM6hCMHUWlNJRT7N8ptdLXkN0sGu3n89GHCLGvCLdN5l-ToEEHHUwDwoZM4DhSjTIAWyPpZf0GpULsUQfkHeeRTAtABnwOgJPFJ4f4UvMfI1WszylQvkm1Zhg9pHQ0gdtt-9fyDlXiAqxJQyOwfekNpf14tMQrv_78MvbhLd8YPZ4RYtsW6M80F170AYIKPWqFroKODB4oZZEfikNTVrW9TEwUTBaiKwiy6KWdzbhFkcLJJSt2HBVaJDIumZe-npNtI5oprUs0i-ZWTLsHBMeALHoYXwU4v6_QdYTnJwLfIQZMF42nh04EhZpsDXUm9cYhHJB-q1IRsy9-Im8LNpSGpGTnBDrl_DB7ZBmM3JZO9mN_EpAEA2fwtZfpfld0lIohMfwB00aDfISoUhIjrtJSlEavuaeQrLGPmkZ7kH1nuRG-287bgHY4GFuCdJh_cNNsbuwZrOSx2JB4LcGCukZqT1gKIl-h1i9BbvrEV9QNqwEY_xd3Kkc744VvZzjRpYv-MgqHSAmpuO4g4F2YmBxh2MzHKkb0Ta8VuLG_IZaq6MS8CQ3sdXr7dIB9yQqzwCqzzGVItBR7RSlh5RGW86UfVmSpDBlDPQWqR6ELLHUH4ONpIWybIYUOLINCPpkMwAhKQjE_W6u4P9SIuo1iEQvmWMEe3eLk9ViI0WtQCuGGGCdL0O5AsZHMScM1wGJkNxPBLkKdpLmx2T_jylS8YNUvf5jRVaSz6-RJBwZkT0lI6MSBgrbrTIqatmx7Z_M5SGUyUTwIyOGSHVA8YQX8caBJYW1BEIRk4TWEmNgtDwZ3K80TDLJ8tsfk-xUim2cZDlBYSySHv7ypuvgakd2zvALVNd8Duhje2wB3UhXcs8r0fnjFM1l8dqvo3KnOp4FE9_UrMJZMv7yWRCVfxCeF-XuJE6WhYM5511Kf24RXYpOAhsUuKlFQRrrUWBHJ0L_iHHBCQjHShIjlzWKxIUS2B1SSx5yLJr2j1-ORkXkHfGvIQDoA-0K0OllR4409mAxQKTlfXJ2fXkLEiHnvAfj7-l9tAwqbP8BrZGChgQI1vdj2ybz1vLNg2jGEoij96Ol4YHo1cjD1n0ndUwzeb3PY-HE7HGsj7BznRKAH5pkXsKTId4YoCUOVl-0zDdMaX2h-ghiF8f9D9hsbz3tUUmnoPRn3mZLR_AkwXmDzCDbPUG_L7F8DOdk4eddP6RXKZkOre1ZZYphSqcVMPzaSRJG4ZnRVFMp9O4A7VoDbmDXsRAoMeM-46p556edM2Ii9QCv4x-p1Q_I3EUlCIfz8o71EWpN8-XhdS4O_s2P2Gfjun4Jdmp0z3ehoU1c8-xFKDm6Eb2oUhuvX22yWhflstVWa3mZbVcsWNtHIFSp87r01pp7FcUie-dZ76L7N1Oel4nJb_481oOAry6P50KcbqSbkQX1_TWauQF35yNjBepxcjY4VniprvmGzGXpzeB8fGC6fjVydPL7lnN8FSfeOH6DrX-DAHU4g_Zk44j_-nW_OPBnp8P9uFA-tVwn_378K73wgW0m34v2kVkR30555r90NB49saor2bvsPI8tXPFa3R_Hc41qVjRuSHdJem1QralUnausP1dvf474Hb2lwEEztYzfrNeivX_Ltz-S635N9Z9N9alVvsbox3-o4tHunS56ntmCc5Gm_5fbmpPt-7YxszylR0una01dN0iIkOrMVx3urY11o9ui7HbEHH1fO_eG_iZG4vXH_o27Q4DBIeelzfAhHhF-GKnki6II5a0PnjcW6ZdhXTJ1KLvAvbg3xthXARaG--AnYPS-BoqaxoQ6NE2UvfloHNoYcdkUN7LBh2UyEODwlTg9prX1mjTueP2URSCaab2ka8NTVap4vVeCjJntYfWePo_pqBFWxnbMM1DZ0Ry1Fyi6_0hw42KvIACWEsJEq7QE_howNVmF1c51-GoKa2MeXl9Zf3EmlZacNwialcbH27Zekz2mF_qHgjcojItPQSnWdtvTc49vefms2x5n5quH1OYxJbpoW8_8msXDHX3_gGU2jbXrTW_0p2zMvYlWz5k-U3tfRvKcf6Y5Y8b6euunHDTZPkj7Uj_9Buz_JE5h95l-eNscbOYzYtZlj9OscD5fD6_FvNFfr2YrufX61m-uF5XYjW_WRVVVeaHC_jQOw79a3K_khrBaLUPRnevO4kMSA-FUHU6tkti27vplJftt_rfn8k2SoWmxos2O5DVoX8upAihSB5lmoKHAC3l3MBukCZQoyigs2bs9kgXG8RONlIxC29-ebiDLVonjQ5hE5reUbP60PMPlOIIIVLoswSHDJC_xbgJMsWsTM4-GxaB8etoyPIiysQEaz3lcoyKOCZcTqaT4i-MBWQVy8uquBY5W1EsFNdskRfXRblghShXRbUWJ7GQcPZzvQ8u-rVzPhjm1cQrNWiOUfrtFtPMiokwiRzZmMx6icrRwKyfBIXIcaYZBht9BDhQ8qUXqR_XMaUMz4opGHvypsGGt3t6Fb0_xF9Ala6leOah6bOPgd_3Iy-Je3Y0OIEnHbT2KdQOOgQI0EwBN5qj1am3lxXTwYijmV7D9hA77ENX1HQhC4wVaEk_bwwgs2rfj4JdX136AYpFTiDGTRPALCD_j5-eHp7uUrZapOWh4acjqhIQ__jphw9vjtqC2fwtvDcWzRYtDMUGfW0EidQgc53tS0d0fRicjQvlaHjswsATuXSpZRkC_3TWR-I4ymJvkTWvUpBUl2HIwZLIIAyGUlh2HqQPP3WWrzzUbEsVeAJ3l6gRgJSpsojUvw2DpghjQYMUwSUqozdBW3PAQeKTet2rfmBBhiCvbaVRzPe1NsXx0XAujIhI2R3bk9Va2RIG05Z0QjAt2rDUTc5mKvOpifmU2phHDc408GKxno-aZLGX7Fq20yhAJC-OTzu9u22ng3d54NKwfYkRGFrmXJrKXf4GoS_P4T6g0zz6TGYNzUNKBzpcnF_TdykCUqzDwnDavUTzZH0YdJ47wIVR8ilelZQonPJOTOCOIoekimMcEnM4C9F121gMlUiGgW3ya4jvNG5Ue4JSl95UzNI_T6EM0jNt9HWKEEqhEOB07o7FpmGaxdALc9rEmZIojG3TShLkwI1AroeRkOpKjVxLh7zojnmRr4pFQLA4P2WuH8hDzbQIXxH03idgpBdGIwj6Q9IlucnikRGJsx8-Pwko3o_XD7hWxm8RjrWdXArfINrA6PgsGk1fhkEOC1Mz03rZ9Dme0Ch6J3ZM-s5_GHMet_8ZKLMjFGhMF88fRx_UDIejPq37_CixZltp7ASuxO1crOdrdoW3s2K9XE_n-aK4qm8rMS94tS6FuJkKVi5ueJ6vMZ-Va7y5WRfsSt7m03w-Xec3s3y2mBaTYlZUi_l0ydZituTTm2wxxYZJNaH6PzF2cxWQ67ZYFcXiSrESlQtfLOW5xl1_YM6z5cOVvQ1nhrLbuGwxVdJ5d6DipVfhUydKDbqV_WjCpyqk_lCpImhQ5I_miL0TpD6B-6vOqtvvPshE9M_yx6DRPwMAAP__Mttt7A">