[llvm] [NVPTX] Generalize and extend upsizing when lowering 8/16-bit-element vector loads/stores (PR #119622)

Fri Dec 13 11:26:32 PST 2024

================
@@ -6,19 +6,17 @@ target triple = "nvptx64-unknown-unknown"
 define void @kernel_func(ptr %in.vec, ptr %out.vec0) nounwind {
 ; CHECK-LABEL: kernel_func(
 ; CHECK:       {
-; CHECK-NEXT:    .reg .b32 %r<10>;
+; CHECK-NEXT:    .reg .b32 %r<14>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
 ; CHECK-NEXT:    ld.param.u32 %r1, [kernel_func_param_0];
-; CHECK-NEXT:    ld.u32 %r2, [%r1+8];
-; CHECK-NEXT:    ld.u32 %r3, [%r1];
-; CHECK-NEXT:    ld.u32 %r4, [%r1+24];
-; CHECK-NEXT:    ld.u32 %r5, [%r1+16];
-; CHECK-NEXT:    ld.param.u32 %r6, [kernel_func_param_1];
-; CHECK-NEXT:    prmt.b32 %r7, %r5, %r4, 0x4000U;
-; CHECK-NEXT:    prmt.b32 %r8, %r3, %r2, 0x40U;
-; CHECK-NEXT:    prmt.b32 %r9, %r8, %r7, 0x7610U;
-; CHECK-NEXT:    st.u32 [%r6], %r9;
+; CHECK-NEXT:    ld.v4.b32 {%r2, %r3, %r4, %r5}, [%r1];
+; CHECK-NEXT:    ld.v4.b32 {%r6, %r7, %r8, %r9}, [%r1+16];
----------------
Artem-B wrote:

Interesting. We seem to be loading twice as much data as we did before.
While the generated code is valid, and in practice will work faster (old code would effectively fetch the same data, due to caching, only using twice as many instructions), I'm somewhat concerned that we're now loading unused data.

I think we could previously eliminate unused loads because the vector was split, but when we signed up for loading the whole vector, we have no choice but load all elements.

That may become problematic in corner cases when we need to use only few elements of the vector.
We may need to eventually take into account whether individual vector elements are used or not, and fall back to splitting the vector if only few are used.

This may be worth adding a TODO at the point where we make that decision.



https://github.com/llvm/llvm-project/pull/119622