<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/56389>56389</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [OpenMP][Offload] runtime fails to launch kernels with more than 32 arguments
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          FabioLuporini
      </td>
    </tr>
</table>

<pre>
    ## Description

The title is probably self-explanatory. "Fat" OpenMP parallel loops, that is parallel loops with a relatively large number of symbols, won't offload producing the following error message:

```
...
Too many arguments in kmp_invoke_microtask, aborting execution.
Too many arguments in kmp_invoke_microtask, aborting execution.
CUDA error: unspecified launch failure
Libomptarget error: Call to targetDataEnd failed, abort target.
Libomptarget error: Failed to process data after launching the kernel.
```

This is actually no surprise... I tried to debug it (only quite simplistic things admittedly), and eventually ended up [here](https://github.com/llvm/llvm-project/blob/main/openmp/libomptarget/DeviceRTL/src/Parallelism.cpp#L71). Both comments and [code](https://github.com/llvm/llvm-project/blob/main/openmp/libomptarget/DeviceRTL/include/generated_microtask_cases.gen) make it clear that attempting to offload a kernel with more than 32 symbols will fail.

## Minimal example

Simple minimal failing example available [here](https://github.com/FabioLuporini/hpc-bugs/tree/main/nvidiagpu.clang/launch-gt-32-args)
As you'll see, that's really a dummy use case... the real-life examples stem from [Devito](https://github.com/devitocodes/devito), which generates solvers for partial differential equations from symbolic specification, and as you may guess, very often we end up with kernels with tens of terms...

## Comments

I wonder:

* Why the implementation doesn't use variadic functions. What am I missing?
* This wasn't happening in clang 12 or clang 13 IIRC, I can see from the commit that ships the new DeviceRTL that something has changed. Feels weird I'm the first one bumping into this? I apologise if I missed a duplicate issue. I searched but couldn't find any
* Any workaround while we wait for a patch?
* I'm not seeing this issue with clang's AMD pluging or with NVidia's nvc (which I thought to be based on llvm? might be wrong here...)

Thanks for looking into this. Any help would be appreciated. If the patch is as simple as supporting up to say 128 arguments, I'm happy to help out :) but I doubt...
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy9Vl1v20YQ_DXSy8KERMW2_KAHx64AA04bpEn7GBzJpXjV8Y65Dyn69509UrZVFP0AigKCJPKOe7uzM7OsXHPazMoVPvTIofZ6iNrZ2eJxtrgfvz93TFFHw6QDDd5VqjInCmzaK_4-GGVVdP5U0Kwstyrim34a2H74SIPyyhg2ZJwbwqx8oNipmKNcrNBRx44UeTYq6gMjulF-x2RTX7En11I49ZUzOcYR2ZW3EXdb41QjGTWp1naH6EytM8Yd5Yq9d556DkHteLa6f1vS7GYxffJlURRTqc5Rr-yJcHzq2cZA2tK-H75qe3B7_trr2ruowl4yUZXzMR_1nesksP23YR6-PN6PZSB9SjYMXOtWcwN4kq07apU2yfO4-1lXrh-iABdfn3oAzhQdjfcfVVQ_2CY_yM3L4dNq8VeBtvkRCQXAa6BKDYKRaiM6NOZz7sGevWVT_CnUZ0qBBPioOiYkeCLrKCQ_eB0YzaAnil6PpzVcpR3pCHqtncXWb0lHpqD7wegQdY0jcTBiNb2OkRtzmpV3uTQUygegPx7BtkHENNDs-n3HQO36ESG7GEFNsKPc4rMDEVNV1K7HhTGH888Vav6Na5B7WxlX4adXGjTcOjC9H2TXG9Bw-cgHXfOnz8_4H3yN748T5XXoi3rAI6vn2yUyLei9A_tx5EgUyRoZ1q75_zLUtjYJ5yE8W_YKKL5S9GutAocCK8gWtN6zdKM2rPyoZwXUETh3373IUk00GMXdO8-y29KqPIsZK-CmULG40OZoRh-01b0yEIVCp_ntjp-l90z9tEMijPLJO0kdcAMexf-401tVafecBucREtfdUF-BdbCbbfTMr2Dag2602g2pqOF7O0E1M_9qF69W5RWwDUK-nOV9oJNLsCoUGSTIaH-4EeB0mZKKmtT3J0qBSVAW6ouAZPnK6JbPNQUKwJha73opSnoX3d-W1eRtwqTwcjVp49hp-Me52wjvzIF9gH16MeeogWuj2xbo2XzB35ISawpjEmMLIb7Jk-q8eFadyqWDKyfaJViF3Ef4E8gR2dKRRYwixcyNkSfTFMB6EMOHq_ThxZcvmPEwSeXt0pOMhYb9H32-vKdfu1MGNZNGHsypUuM4jINE0D8or1WDelq0M9dZ4EEhdw8z6nUIYNhstX0Nmy3sqKYYnRqgM2EhjD5zg5YlAczp_4qenj49CA5P6LQVQoxASmaifUgqiyl0GgNR7lo-0otIp0XXc3Y7nBeohpx23BS05Qwfa9_QE9IZo7baB0xJy1SlfhhTk0GAvFEI8lCDM24HxyXdTkVykzkJZ0VDZd6HxOLGAWKvO6xWCdJ3yTRj2a2WbtvTKyz3mHpH5_fKu4Q18Aw6RMOPChUKuxT4FevuAssxaeui4DJOkTwfcPrIikltUM79h0caTNrJLkTLqz_-IqrMy_ZQy6AY6Y0p0rm066L4UgUclBSI3mffBAa9lkWsHL0TUMF2odxZwedZpex-VAbeVvYXQBa53o4NqCyoSCwwwUMTYqKArs29yCXngRfGycX5XxqGafRDCwgZIJhluX59bciEyeAIwU6yJx_m0Ias-bvckSewOVURuc-bzaq5W92peX5j28AsxpcxMQv8H80ZF-QThN1zds8ggad3igs1Xtj2S1bz5M3mX4-l3E4xouub1fpu3m2aitvluqxU3ZQVr25Lrsv14rpc1Ov1cl1Vc5g4MpEa8FIpcsgh8B_5z_WmXJTl4nZxvby5Xr1bF8sll9fL9aJZ3KrVTXUze7fgXoaL5FE4v5v7TU4pO_u7hbw9hNdFBYnvLGfIJL5KII_fXIyGeU5gkwv4HRTVyLA">