<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/118851>118851</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [NVPTX] Incomplete and inconsistent mechanisms for lowering vectors loads/stores with sub-32-bit values
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          dakersnar
      </td>
    </tr>
</table>

<pre>
    I started this discussion here: https://github.com/llvm/llvm-project/pull/67322#issuecomment-2518512550, and it was requested that I open a new issue. Let me know if there is anything I need to adjust (labels, formatting, etc), I'm relatively new to open source LLVM.

CC @Artem-B, here is a simple repro that demonstrates the issue: https://godbolt.org/z/no5fEGo3h.

There are two major pieces to this issue that I'm hoping to solve:
1. Support for efficiently lowering `load/store v8i8` and `store v16i8` is missing from the current backend. These vectors currently get split into two or four scalar loads/stores.
2. Existing lowering for sub-32-bit load/store vectors is handled inconsistently, `load v16xi8` is handled by a DagCombine (https://github.com/llvm/llvm-project/pull/67322), while lowering for `load/store v8x[i16,f16,bf16] is handled via the type legalizer (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp#L6322).

Given that, I think the goal should be to unify the approaches into a single mechanism, whichever of the two we prefer, and extend it to support the missing vector types.

> A clear demonstration of a valid use case, and a patch demonstrating improved code generation would be welcome, and we can discuss the details there. Without actually implementing a prototype and seeing how clean/messy it is, it's hard to tell ahead of time what would work best here.

I actually have two very rough prototypes working with both the DagCombine approach and the Type Legalizer approach. The Dag Combine approach is very slightly cleaner, and the generated PTX is near identical in both cases (with the only real difference being whether the generated vector is made up of u32s or b32s). However, it generates a slightly different SelectionDAG pattern post type-legalization, so it would represent a bit more churn if we were to transition the existing 16-bit vector handling over to it. I was hoping to lean on @Artem-B's experience in this space to see if you had any gut feelings for which approach would be preferred, and also whether you share my thinking that unifying them is a good idea at all. 

As a quick example, here is the post-type-legalizer diff of the selectiondag for the approaches, when the input is the "generic_8xi16" function from the compiler explorer link.

![Image](https://github.com/user-attachments/assets/c777fa28-2ebb-47df-bb41-7a19b7ccf496)
![Image](https://github.com/user-attachments/assets/d13db882-ec11-4dc7-93d5-6540e2f02233)

Left is with the current Type Legalizer handling and right is with the DagCombine handling (I just changed it for the load, the store is untouched). The PTX ends up near-identical in this simple case, but I'm not sure whether the left side is enabling specific optimizations for `v2x[i/f/bf]16` in other cases, as I know there was a lot of work done in that space that I haven't dug super deep into yet, e.g. https://github.com/llvm/llvm-project/commit/620db1f3dd08ebbba71b0e16f83c11323e04bc05#diff-6fda74ef428299644e9f49a2b0994c0d850a760b89828f655030a114060d075a

Anyway, I can put up a patch for one of the approaches if that would be helpful, or possibly both, if you don't lean in one direction. I'll also note that the Dag Combine approach could be rewritten to match the output of the Type Legalizer approach, and vice versa, so there might be two questions at play here: 1) where do we prefer this lives and 2) what do we prefer the type-legalizer output to look like.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJysV01v4zjS_jXKpWBDoj4sH3JIJ5N-DeRdDNDB7NwWFFmy2KFIDUnZ8fz6RZGS4_TsLDC7e2l3JJH19dTzVHHv1dEg3mf1l6x-uuNzGKy7l_wNnTfc3XVWXu4P4AN3ASWEQXmQyovZe2UNDOgwKx9gCGHyWfmQseeMPR9VGOZuK-yYsWetT-vPZnL2O4qQsedp1jpjz82uZCxjpfJ-RmHHEU3YsLpo64LVdZ6xR-BGggpw5h4c_jajT37wAAewExrgYPAM8YYtvGCAEeHN2DOoHgI5CMoDN5cwKHOEAxikCyxw-X32ATLWat6h9mSst27kIShzpL8wiIzt6X-HjO1GcKh5UCfUl2gy2OSAt7MTCC8vv_z_Nssfsvzh8RGyKn9wAcfNFzp_dQO8GieN4HByNoUhcbTGB8cDenI4hfIvsmplZ3XYWnfM2PPvGXs2tu5_-mrLYTH7Gs1whxDOFkb-3TqYFAq616baxbuX9MWYBjtRWoIFb_WJzGb5Q7GFb_M0WRcoI4B9r4RCE_QFtD2joxNZk2vLZcaefbAO4dSqNmvyWK-syZeHRZOeKg-j8p4O9s6OMU4xO4cmQMfFGxq5hdcBPcIJRbDOr6_1BY4YwE9aBVCGAjlbsA56OzvwgmvugDzxqyue0sG28NO78lTKD58pGD93m5JtOhXgs_-LWeVh4EZqlKCMsMYrH6IbVMclaIrr_RrY-nl3AQ5P_Phox04ZJGT9l30RsXcelMbPMfwx9-9Z_UUVTcYe-_hvRz_10617J8Vj2sNlQtB45Fr9ju4_8rLTtsvY88iVuflG0bNX7o5I3_ztl59ff11_D99QvywRbMU0Zax8aVKIC3a_qhOaCMzYb4RW8xb9PVquwQ921hI6JKTORvWX-I5Pk7NcDOgTNKi_zFEjjCgGbpQflwyKAU_owPYpBWcLZ4TJYY9uJRl8D5i4hpphgT99vQI3ISTmzy9eZ-VP8ABCI3c3fUzMaHvgcOJaSZg9guAeV0McJh7EcHvAHEGNk7MnlCCsRDiiweWm8xr5GbWw4_WaM91qVjaOnkoMXGmfaG8Lf1dhsHMALsLMtb5A5B7iWDLIYXI22AgHus8j0uPBnmNAVNkRvb9QRlRkRxUytiNAuUigAbUGPiCXMa9qRDgTsSSHz9a9QYc-RPJb8nX48GXgp1SIE7oLODsfhw-HfDxO7pxVGKCzYYgB3nTXWvroO717pUBerrhe30daoYPwh5PKJ-Neq-NARBPj_kBERF8qBEr4-fVXOmGo1kpSEgXXoEzyjgrsqZmiw3TSGn0Bh1yDVH2PDo1A6GKOzwNSiX4wsOCLqJJLhHmitM4l88R2Xck8dQv8nz0TlFM9rqejtKxhrPYCfEONglD09PCVUBfQGZisDxHFm4UFIs7oQm-j1Mb6kUKhp0s4EFmORDRimJ0hYT0THF3sxuC48SpCleLBlXWLJpLsElVkIXpsqQ8DGdrCIcr6hwRR-sGaT-K584DvEzoV86dM0jE_cRGte0Ty52JnGLgknYfjHKBHJGs-0mXs_4-yXxsq9b9Dee1M7e21OHSlH0hNx0uio-glITwSUPoLx6TrR2slwYIDD8C13kJC_AO9_G1W4g3wnVP_3c4DlDAqx-a2HOhiBVey8msNJU_s_5n4EsFhSr4y0xzWizPGIjyU-Ef7HsWBQT-beNeNCttxUhodZVlbhw60Mm8rv7Eiq78cRn7ErH7691Ixe3QbHgIXA1EMyTH3HuN_xG636zlrNwy7blPtZL_puqrY7Hix73ZC9NW-Ibn731mURSm7tmUbFEWxqaTYbfalrDdNXeXI-pyxslws5g8v2MesXXt3HU1-4JQriAksjrrt06kbdrp-mbH2AHHQJEE6YhSYtYpJxB9TmaOSKw-zCXYWA6Fyn8iLmAeN9EQJRD-bT_ST-iGNlavMdPM63xkbwM8OP3GOpni9ktEeGt5FV_2EQvVKgJ2CGhda8OvAcWJxxsjYc08TQJ_VT0UTJyADNt4cKTC2kodDGsHT_E1NzkHbQJCOuiCtWXqZh7WX00hPsmAytgsg5yPJMHUD4pT0_YJxPMDtcfvXdw7aL0jCnhuWy67oSynzFruu47uiy7Fo-rYURVGyEvOqE3mdsZIacdP0ku8q7CvWsv2-qSrc99Wesy7f7yuRy7bO-a7Ju3bfsrZv6jovc14UVd7kMt_VfGECcznzS5puSLipU-fpOg1QmikrS9ffTjZ9Ss6VtwbUUz9ruoomfOu96vQl6lCUhUSH0qY8RlKlIhkEqVzikm2EB-k3UZ6xYSlA-DOtFKtxh2enQiDCoR2DXI9qNwcKaPH-T7R45dmTEjRwO88X1Uk4GWNHdWksiKteBCAPMGl-uS6bRcb2BGeHIG_muNQIWp1IDI0Elj6jDevzVwg_sO3iOwmQtW-g1Rtu7-R9Kfflnt_hfbEry4qxMm_vhvtK9LuiYG3V5XnD5a6v2mpftDvBiqbou-pO3bOcVQXL66ItiqrdNlXZNEzUPe52bd7ssyrHkSu9JXzSOncXt7L7omjburhL22jcyBm7rrYZY7Sgu_sI6m4--qzKtfLBf1wTVNBxlU-zd_0EB0P0rjGkEe92ofkYkVOLXxeMdRH6vFMllrtZnk5cz-jvZqfv_3Inxojo7iXk0z37ZwAAAP__ylqmvg">