<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/69448>69448</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [MLIR] TODOs in NVGPU and NVVM Dialects
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            mlir
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
            grypp
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          grypp
      </td>
    </tr>
</table>

<pre>
    I am working on NVGPU and NVVM dialects for Nvidia Hopper support. After having discussions at the LLVM US meeting, I want to make a list of improvements and opportunities in these dialects. I am looking for feedbacks and collaborations to refine and expand upon the following improvement goals in these dialects.

**TODOs in NVVM**
- [ ] PTX and SM Versions on the NVVM Ops
- [ ] NVVM Ops for Missing PTX Instructions
- [ ] Using BasicPtxBuilderInterface vs LLVM intrinsic

**TODOs in NVGPU**
- [ ] High-Level Ops for NVGPU Dialect
- [ ] Dynamic Shared Memory Support
- [ ] \(Optional) Splitting Host and Device Transformations in NVGPU/GPU dialect 

Below I have explained the individual tasks in detail.

**[NVVM] PTX and SM Versions on the NVVM Ops**

NVVM Ops don't include PTX or SM versions unlike LLVM intrinsics. This information is valuable for giving compilation errors, correct codegen, and etc.. We can encode PTX and SM versions on the Ops or somewhere in MLIR.  Currently, MLIR generates any intrinsic and let LLVM's ISEL complains that instruction isn't available for that target. Not nice :)

[LLVM intrinsic with SM and PTX info](https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Basic/BuiltinsNVPTX.def#L655) vs [Same MLIR Op missing SM and PTX info ](https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td#L215-L227)


**[NVVM] NVVM Ops for Missing PTX Instructions**

The graph below illustrates the recent advancements in the PTX ISA. It's worth noting that neither LLVM's NVPTX backend nor MLIR's NVVM dialect fully supports the entire PTX model. It would be highly advantageous to include all PTX models and their associated options as NVVM Ops. We have already implemented some of newer PTX instructions as NVVM Ops, particularly to leverage the use of Ampere and Hopper tensor cores. However, I still believe that some PTX instruction or their options remain absent.

The lack of NVVM dialect could lead to duplication among different compilers. For example, OpenAI's Triton generates PTXs using a custom approach, and upstream MLIR employs [BasicPtxBuilderInterface](https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/LLVMIR/IR/BasicPtxBuilderInterface.cpp) for similar reasons. We can unify these efforts within the NVVM dialect.

![Evolution of PTX ISA](https://github.com/llvm/llvm-project/assets/3865321/af6238be-6eee-4aa1-8e0c-7f6b3f73771c)

**[NVVM] BasicPtxBuilderInterface vs LLVM intrinsic**

The [BasicPtxBuilderInterface](https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/LLVMIR/IR/BasicPtxBuilderInterface.cpp) was implemented to automatically generate PTX instructions as inline assembly from the tablegen Op, without typing C++. This interface also supports for PTX predication that LLVM's NVPTX backend does not (did not). It has become crucial in today's context for using Hopper's TMA (`cp.async.bulk`), Tensor Core instructions (`wgmma.mma_async`), arrive/wait barriers (`mbarrier`), and etc.

What's the best method for generating PTX instructions? Should we opt for this interface, or should we utilize LLVM intrinsics? In my opinion, the latter is the more suitable choice.

This interface is still remains important for generating PTX code quickly, especially for the evolving PTX ISA. 

(An example of an NVVM Op that utilizes this interface is (cp.async.bulk)](https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td#L1374-L1408)

### TODOs in NVGPU Dialect

**[NVGPU] High-Level Ops for NVGPU Dialect**

The NVVM dialect is intentionally designed to directly correspond to PTX instructions. Generating PTX requires specialized skills and can be highly complex, particularly when dealing with tensor cores. While the complexity serves a purpose, it is challenging for high-level compiler people to deal with descriptors, fragmented result registers, asynchrony, and more. It demands expertise.

On the other hand, we have the NVGPU dialect, which started incorporating high-level abstractions. Can we leverage the NVGPU dialect to simplify the use of NVVM?

For instance, let's look at the `nvgpu.wargroup.mma` below. It performs a warp-group level GEMM for `128x128x64` shape with `f32 += f16 * f16` data types. Many can understand this Op and generate and start using. In contrast, the actual PTX code is notably complex. It involves aspects such as asynchrony, iteration over the 64-bit wgmma descriptors, matrix-b transposition, and the selection of 'n' (with 'm' and 'k' being constants as per PTX specifications), grouping multiple `wgmma.mma_async` instructions to complete the given shape. 

```
%matrixD = nvgpu.wargroup.mma %wgmmaDescA, %wgmmaDescB, %matrixC {transposeB}:
!nvgpu.wgmma.descriptor<tensor = memref<128x64xf16, 3>>,
!nvgpu.wgmma.descriptor<tensor = memref<64x128xf16, 3>>,
!nvgpu.wargroup.accumulator<fragmented = vector<128x128xf32>>
->
!nvgpu.wargroup.accumulator<fragmented = vector<128x128xf32>>
```

The PTX instructions generated from this Op are as follows. If you are not familiar with PTX, piecing together the following PTX code can be quite challenging.

```
wgmma.fence.sync.aligned;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2,...}, %dA, %dB, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2,...}, %dA+2, %dB+128, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2,...}, %dA+4, %dB+256, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f1, %f2,...}, %dA+8, %dB+348, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501,...} %dA+512, %dB, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501,...} %dA+514, %dB+128, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501,...} %dA+516, %dB+256, p, 1, 1, 0, 1;
wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 {%f500,%f501,...} %dA+518, %dB+348, p, 1, 1, 0, 1;
wgmma.commit_group.sync.aligned;
wgmma.wait_group.sync.aligned 1;
```

**[NVGPU] Dynamic Shared Memory Support**

Currently, there isn't a practical method for using dynamic shared memory, whether in the NVGPU or GPU dialects. One workaround is to create 0-sized global memrefs or LLVM arrays, but this approach seems somewhat ugly (works because it fits LLVM's NVPTX expectations). I believe we can handle this more elegantly in MLIR, and hide the uggliness behind this dialects.


[Example of using 0-sized memref](https://github.com/llvm/llvm-project/blob/main/mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir#L665)


**[NVGPU/GPU] Splitting Host and Device Transformations**

In the NVGPU and GPU dialects, some Ops are for host and some for device. Host Ops are usually lowered into runtime calls, while device Ops are lowered into PTX or intrinsics. They are conceptually different, and it may be worthwhile to consider splitting the transformation pass.

If you are constructing your pass pipeline in MLIR, having separate host and device passes would be more clear in my opinion (but no strong opinion here)




</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzUWl1T47jS_jXmpiuuxCYhXHBBYDJLFQzUgZndu1OK3Lb1IkteSU7I-fVvdcvOF-zOnq3Z3TpVkIAjtVrdTz_dLUV4ryqDeJVMF0mWVW7btkmWJdPbM9GF2rorfnS2ssX26g5EAxvrXpWpwBr48u3z01cQpoAv3749QKGERhk8lNbBl7UqlICfbNuiA9-1rXUhhesyoINarElEobzsvFfWeBABQo1wf__tAb4-Q4MYlKmS7AbuYCNMgGChEa8IArTyAWwJqmmdXWODJnjWwvIinVFBoQdlSKLHnV4p8Aa0tbwB0rJELFZCvsb50motVtaJwCoFCw5LZZA_xLeW3rrWslwordZ2Q4IO9IDKCv3R0sn4Nhlf968Z_bw83j7ySLJdfBQ_HkEyXUAyvYWnl1946ecH-IYu2qlfnQ3-2PrTKcNz3t2D8p4UJDl3xgfXSd7Z6aSvPGohvJJP4W3RKV2guzMBXSkkwtpHtygTnDJeyd_dzOenrx_v5idV1aN7XKPeKRgBdBuNdDr-dmtEoyQ818JhAQ_YWLeF54ik08HJ9CbJ5o8t7U_oJLuE51arQBiCn6wPbMdbXCuJ8OKE8aV1Te_nvdpLUqf3GRzucoHabuCOgIuEBC2UwYI9oUyh1qrohIYg_CuLKzAIpT9wejJdsLv_oHMP7Bhfd-4trEmyiwDKSN0VyNKsI2HrQVhntHrFE9f5FF5qRUruDADKw1roTqw0slcqxdEpbdMqHYegc9Z5ikZpnSPrSFtghYYecXQEmabwM4IUBtDQp4c7XJ_skLZgHXjb4KZGR1aEh_u7f6UAN51zaILekmh6BhUadCIgBel2vxWWrTHwDpPswsPd86d7Vpvc4yHUggy0wz0oH40m1kLp3XZ5WBCuwpDCFxvAEEiS_DrJLo88OF0c2xI2KtS0O1KENktGTaa3STavQ2g9i1gm2bJSoe5WqbRNki21Xg9vo9bZ_yPkZ8uVtqskWzZCmSRbSi2I-5a9dw-ecJDSe6d0UMZ_-fb08ktaYJlk-f1sOiXkrz2FxbNoMNrvsYWmZ4ITZeGHaNto5Y6U7R8McZ0tyW53_0qyJeH3sfVpKEjfbDId3WfZxamhPw6YP0Zt70LmpUaonGhrWHEQK607HyKeCIoOJRG3KNbCyD6ZRP6Owp-vU7gLjK-NdaEGY5lVGDYGVajR7SHI_gBKKWgKMKTpPe2cP9rnSCg7rbdDWoyKoAnKxUUbW6CmZWFjO13ACqFWVa23Uc0gKrQdJ6gh_oXW-5kxm4UalQPhvZVKBKT0GPlO-J0tOWSZ1IR2KIotJTPNVsCCw5MSrcENuh40e1MfyqFYbYULSnZaOL0l1TSu0YkKeXOdZ0nXTUvRTur1lUFA460jWkGfwk92Q7Ni2vdBaU1eU7jGaG_W6EQR4BimvQ4bdEjIBLHyaEJ6CgYt5CvpcuQPyXbWKApSveharWSkPtFYLlXKEomXelpE51NYWgf4JshipPFji-b6jn394lSw5oC5nl5-8dAxZAXIzgfbgGhbZ4WsBw7tWh8ciiZGLTattlsO5d_KzD8yerVafRyy_PJbGqSSysVLDkmvGqWFA4fCW-N32aAzqtz29RCWJeOdmFMdJLveDScZc5JMF5_WVnfRzeUQkH9y38J7DD7Jlvl8Ns2zCT0qZ1k-X-FohoijcyEmozmO5eiinK3y8iK_uJjIU3Y65aX_om76kJz-Jxy8Ef6IGoIF0QVLFYQURGUD1D-kCWU0l9HeY7PSWyidbdj5gbJwhQYeW4oCgoXtAoRtS5Fyk2SLJFvsSpbBsEJ7u-dOwh4t2joshqBlsvgNUi4seiJxSLJ5oYikQ5JdMtvWwsMKJZGMdJ1UQnMqsIXYsiBpTcC3wEvGYI4sFmP-4ZpEJrOxbFPht0amq06_JrMxISi7gZdIdTeW650DC8VZm6ppRNo04t88eT9POKfWlFc3QgVY0b_ohllN___B8L4cO8TZz7WIKYyMvkIfoMFQ2yLWe9F1Qzo9VC3Jl_BcMzdukPi1r5gO3UFrUvjvhnVBafWfd5Unyboz0GzBtsooy7VjYEIO1BSqqF1D5vGdYmiArK2SeMLhR2BQvk8UkfUZptYFahk_2BwXpr92Sr7GAhN9i-RnAmXMI4Brq9e72oLS_3H4z6_NwPrEScIMWTCirt-9P7ES6Zlk82NoZJf_bAE2yS_OR_eT8_H8Hcnl8QeOO7vjXu0dIVIT9Uc6vQ958Cgf96YzsZ_TWyiQDypielbUguhtbEZ8aw0_PgVvCp-Pne_w10459NA7Xf2HipxXpfuKibLVvtTiRgLf3pU2mxqpwxOaxHILcFzD_FwrHYueXoIKW_Do1tTAQNu51noOGsXblLXQGk01HEnQ6iPNxhtKDWjREtpo6yh0XLRAL51qQ9-XlU5UPTc79J0O4LBSPmD8mEFXO2u2A0VQnDHpFdgIU3hqa9EF5Y-j7TFmactlbi1MwTzdF40xfx80zfxhrWQNPghHyigjrWtt74WDzYkVVeGDp26EIalHNeORZNq8pwzUFxNDRclZOF8eqkxlGeGA6nlSSGPkPm3t63DOlMzGZl21XboRrnK2a4l4k9k4NglslxYd9cjktI1w7YiHQVT-86eHB_ZWMhtPsvkb_c7Oab6vRYvRQ8lsXOYZUBLLb6GczCDJrumdxhUiCMpzhJgH6mtjqVSgI8WLyB6PLbtql1vpHzZszD4p8SnlJCd8GNhUyNAJvSc7xclOrPaI5t0pQzRHkKRgCB58J2vK1cdIUQHjcRjYNUZ-nJ2PVioAp6tTFDYiOPU2WkFwwvjWehV6ou87EvBI3uzruSS7oI6ciDFaLLto6H8anGQXr_T3CuNhBDs0cDnR9t0Ih3HZ53zfpz92E01pOh0Uhc2HufU4AwfbGydE6FVqjSb68oT-Z-P-p-e-adzyLZCP30MKkmzKi9-il9ek3-GDRf8gyriB5GIxGA4XycUtpYShFu5l80b2Vk_ym559aP0GG4dlkt9EPL4R1rIbyJP8E_1kN39O2uycAf5dacO-hZRd02kRJR4wE0ldo4zPh7gp86yXGM_0dn_9YMGnrtslnncV6xBvxVCo9qFInavvD359CnclbG3Hj6mcLEWjtBIuBv_Tyy-cORRKPi2wFTKFHh8d76K0Tz2_dirgYUpIfwd80XslGokp1xRCc4pM8sXhgB3qjwalzezcTLL562SWlnmWlvROFHWxSLJpOemRWWZJdpOmKYExPikGFBeMXq7bJ7uXcfzrL9dgkR1osZhk839Sl_NDXbLp7J_UZX6oS37-19plOiZZ8a_JTp2dLtNJ9vdg5bt6nP-daPmuNrO_Ey_f1ebPIkbaplHh35GZf4d_qGv9YNSBvA-Z-X1b8fsXQu96iqOLhBBvGYYLAGi59JRCHzbCsaUv-mV8XKbhZWJhGyl8d3BF5al1cFCl-hQeDfIdqXC2MwV3tRakQ6rfxiPPPUel7YqXpgTLNyHcKwvnxJbrqFUXYtoZzgjBIza-vzChLrPSWy6brHvlMwtB9bAKUKrgT08-qLKXYVclpXC3O1XdxNRDhT03LcrHBhw1VoKMN9zMDDVcrYr-RLeqtDLoafVaDRXrxxeew_3Jp33bHG09GKQvNX5gMxyQSuLljTX93RM3v5-fvr7YeOG65NpiFOzIrNdNGqfl97PZ9HvXEburQgLlH75mfA_Qu0Mc0cRDIJG9-bSbOmgqMbg3HJbgT-hJwWulcfFhaOc77pm13aDjHixYcJ0JqiF3a-37Pk1jL2A39WhKf6l4fHmIWx4orZHYhrjQ7nR8QIkK0IgtVTR8ZxKX4iLbeFWgA78zGx8DHpkKWuGPAXRQaXEjwJWaqeih49HQqhb5jPEArv23DDy2grunnfX6PdNE9PsrFga-1Cg4xPfnVBRnFJDGgg_Ommr3AZHKx2jh17PiKi8u80txhleT2eXFLJvn88uz-mo6LibTQk5meZZPy3E-Ha9mmRiPZxMUczkuztRVNs7yyXgyn0yyWTZPi_Nxnuez_HI1LSfiMkvOx9gIpVOKhdS66kx53-HV7PL8fH6mxQq177_TEZHNX-lwVxw6q67yyflYKx_8XkBQQfP3QNh-09vTo5_d9zz6Uxx_1jl99V9HK-vpk2zJqv5_AAAA___KYx7T">