<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/122668>122668</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            clang-20.0.0 generates invalid code when uploading to gpu

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          bschulz81

      </td>

    </tr>

</table>

<pre>

    Attached are two files, main.cpp and mdspan.h. (in *.txt format. please rename main.txt to main.cpp and mdspan.txt to mdspan.h)

mdspan.h is somewhat an extension for the c++23 mdspan class, just that it works with extents on the heap as well as on the stack, and that it contains code for gpu offloading and some mathematical algorithms (strassen algorithm, lu, cholesky and qr decomposition).

It has a membervariable, called datastruct, that can be mapped to gpu.

It also contains various mathematical functions. For example, for matrix multiplication and the lu decomposition.

The matrix multiplication has flags for gpu offload. if that is set to true, mdspans datastruct object is extracted and offloaded, where the matrices are multiplied. 

In the main program in main.cpp, two matrices are multiplied on gpu, and that works if compiled with clang-20.0.0.

Now uncomment the commented out lines in main.cpp.

They create a third mdspan object, with totally different data. and then it makes an lu decomposition which I know is possible for this data. A matrix multiplication is involved there. 

But the flags of the lu decomposition are set such that it (and the multiplication) wont use the gpu.

If used alone, the lu decomposition works.

But together with the 2 calls to matrix multiply on gpu, the program will crash with the following note:

ordinary matrix multiplication, on gpu

"PluginInterface" error: Faliure to copy data from device to host. Pointers: host = 0x00007ffd7e0ec408, device = 0x00007f8647801028, size = 1: Error in cuMemcpyDtoHAsync: an illegal memory access was encountered

omptarget error: Copying data from device failed.

omptarget error: Call to targetDataEnd failed, abort target.

omptarget error: Failed to process data after launching the kernel.

omptarget error: Consult https://openmp.llvm.org/design/Runtimes.html for debugging options.

All this that does not make much sense. 

How can a program offloading a calculation on gpu fail if i add a function that works on cpu which has nothing to do with the gpu functions?

Also, when compiling, clang puts out the following warnings: (and takes a very loong time to compile that small file)

100%] Linking CXX executable arraytest

/usr/bin/cmake -E cmake_link_script CMakeFiles/arraytest.dir/link.txt --verbose=1

nvlink warning : Stack size for entry function '__omp_offloading_10306_1944f12__Z16lu_decompositionIdSt6vectorImSaImEEEvR6mdspanIT_T0_ES7_S7_32matrix_multiplication_parametersmbbm_l3121' cannot be statically determined

nvlink warning : Stack size for entry function '__omp_offloading_10306_1944f12__Z19matrix_multiply_dotIdSt6vectorImSaImEES2_S2_EbR6mdspanIT_T0_ERS3_IS4_T1_ERS3_IS4_T2_Ebmm_l3473' cannot be statically determined

nvlink warning : Stack size for entry function '__omp_offloading_10306_1944f12__Z19matrix_multiply_dotIdSt6vectorImSaImEES2_S2_EbR6mdspanIT_T0_ERS3_IS4_T1_ERS3_IS4_T2_Ebmm_l3432' cannot be statically determined

Such warnings are, according to a google search, printed if an algorithm contains recursion. But there is no recursion on matrix multiply or on the lu decomposition.... 

What the lu decomposition has, however, is that it allocates a matrix which is often larger than the array where it works on. But, in that case, this is done on cpu. 

The matrix multiplication function has a tile option. then it just uploads submatrices to gpu. If that flag is set on, then the matrix multiplication will fail on its own. It maybe that the device mapper insists that the entire array is copied before it can copy it back? But that does not make too much sense for very large data. 

Also, in the above example, the lu decomposition does not copy to gpu at all. 

How can a working gpu function fail if i just add code that works solely on cpu with entirely different data when this function is known to work successfully on its own?

[main.txt](https://github.com/user-attachments/files/18392692/main.txt)

[mdspan.txt](https://github.com/user-attachments/files/18392693/mdspan.txt)

</pre>

<img width="1" height="1" alt="" src="http://email.email.llvm.org/o/eJzcWE9v4zjy_TTKpRBBphw7Pvjg7sQYA7_5YdBpYAd7EUpUSeKEIrUkZcfz6RdFSbYzSRZ72L0sOmgbEln_-OrVo9F71RiibfLwLXl4usMhtNZtSy_bQf_5uLgrbXXe7kJA2VIF6AjCyUKtNPlEfIcOlUll3wOaCrrK92jSNoVEPCoDidil4S1AbV2HIYVeE3oCRwY7Grfy62A_NTO_mowmYpNkuyTbzQ9AefC2o1OLAdAAvQUyXlnD_iC0BDIR3xLxTeSTEZAafQz7j8EHCLxRBThZ9-rhpEI72ggerIkGWsIe0MOJtObP6bEPKF_ZDEc7W5HWBFTGg7QVxRCafgBb19pipUwTF3O80GFoqcOgJGpA3VinQtt5LpoPDr0nc33MbvTA_8vWavKv52joHw4qkrbrrVdBWZOITTqW5xCgRQ8IHXUluSM6haWmaAG1pgoqDOiDG2TghzF-iQZKjqzvqeKqN_1wtYfa22t-bNEO_n0a9WAkx-FT2FsH9IZdPzrlSnQYnHqDbtBB9VpJ5KVT9Qj08D6XyfHPlr7YyPnVGhv_1zKnoOrpQDx4ivgJbohxjBDwN9mDLf8gGdfSW3AoAyPcVLM1qnjfqSUG_RyMJB-7YA6JqhSmQplplTLQO9s47ECZC7RjrU_2KyuMraYf3qFqBKaqgYuj-OgiSKVG09yLLOV_o-__tycYjLRdRyaM2B-_s-EhgFaG_G001xqfQTrCQIAQWuXm9puKEyvAToMNqPUZKlXX5NgLFzKdD9FwC3T4ynmZDycKp1bJFg7wauyJ691b71WpaWpV5Sdruy9OXHHwR6uPFL05mov-bRjTHeFg608BFUvNcPCDbC8Nm4jHGYHvvSViAydrAgx-PPipGQ41P6kAtTU0ts4nvuKhpTfR2YY45KmMLYGIjehH3rvN9nwDAl44o-iktAbp0LdXI7XV2p6YV4wNlOS70aN1lTLozp_XkQ1PLrJdIsRvemiUOZhArkZJiRBAzlmX5DvYo1YDI59bvz_HA4La2Q4qOioZX7TWhxR-s4oteN7FTyDJnyB7y7IsW9d1taaM5DJ7ZOfT1tsFj6vl-jFbZCIu8OrP8fWCrT1zMAxbOfxKnezPT8H-svNnI_ktGlBaU4Oauc66M6CU5D2c0AMZaQcOiyouS9cHdA2Fa37fbX_m8n3Iq0ZutfSLXah1ZJX4_AkDPptq2hJ7t7QuTG-_MLGPi9lI72yMN4aAdSAHGgcjW46LD_mVnCH9VSjW-EEHaEPoufaJ2Cdib3syXZ9qfexS65pE7CviEZ-I_Y_BBNWRT9vQ6dh5FZVD07A324_sPaJox0lyV8ZeqSx5Rlnsb-i4hzwZf-nBX-wpDhC8APZ26jHY5aDHPh7BF-vFvKYAqwrwMj1uac8akP0wMQczvrFhLIyFyl4bIdqbp0-S7-cMvJ3Y20z0qUwThyCzJ_QDT_mZPC69dEJnlGkilGd-GEkNjuTOoK3lCFQ39UWk5TFq3zE0WBhdlMoiyxLxkDw8wf8p88r2v__-O9AbySHwTAZ0Ds-BfIjtuB-8S8S-VHxYMhb7_hnil0Ir81p46VQf4Puv-Er7UYHtLybSSvFuXhil0_39kVxpPSX50yLJdubIr-YMgTN8YR0zthyjgUxw5-tZJGJdFLbri-tpFossz1bFYrNc1gtRFH9frPRQvOO_Q_USVkeSwbpD94KH7vn5-fhjNQ6Vw8_iZ1Y8v6yLl3WRi5GjivccVfTosCMmlK4su0LnC7FIxJohxiAso_6KqoPHES_slIld_t9JcfM-zHNR2fBJli-ieBHFc_mXXH-85MXhZVn8XNx853Udp7Zc5_-zqeXi30ttYpEk270wtcwtyEM7MqqUcajFxkdorG00T3N0suX3vVNR5qiaB8JFN1_1qiM5OL4XpDCpBUcsKIy9vmK2-TCL3az3PwjUNJ2572_c-p_qgBbjTaO1JzqS468znypW1NpKDJFYJr8j0SkWMYEMaGZ71kY4xhD7fBKj6sqRMado3MxK3k_ahDWTh8oamrh0jvlrYX1Bz3iDCExu42RILyIvXp2GnjHlwQ_lRc9OtwY4TBKcJdmsw0fpEU2EL71HmRNHAys-5ueTSeHAc-dcTizLu6c5HS8rrA688sFfX5MJys0FU3wf61lfl1TbsXY8q6KkUQFKvsnl-wkZH4ZdsPZm4MU-G-cAn86kWd8PHDUdV2mPdHsN-hQjF2cxnLGAgBEeH4crHzm3we3Au5mk8Vx4nMb7580g9VbTKC3jPI3X3FiiD2p-nJYRORcHykfRbjg6tscamkVLPejR6HRQl8nLfw_f5st98vCUiMf3EqVRoR3KVNouzjxy9xh_X-D7Co-06ceF_eIx34jVRiRifzE3j1Z2cfmZ4D_gJGcnV4MXN_PfXbXNq02-wTvaLtb5ar3Mswdx124Fimr1uFxuykWWL_Ms29T1Wmyqh_xxscJVdae2IhMP2WKRZ8t8nYtU4hKXyxKpKutcVutkmVGHSl80253yfqDtQojV6vFOY0naz7_PuC2vui-HxifLTDPyr_uCCpq2t9dDaMiQi0SjzBG1muARD3ps4olZm364G5ze_osqsp_p4753drwe7mOwsZJjvMet-GcAAAD__xv4Yuc">