[llvm] [NVPTX] Generalize and extend upsizing when lowering 8/16-bit-element vector loads/stores (PR #119622)

Fri Dec 13 13:49:10 PST 2024

================
@@ -6,19 +6,17 @@ target triple = "nvptx64-unknown-unknown"
 define void @kernel_func(ptr %in.vec, ptr %out.vec0) nounwind {
 ; CHECK-LABEL: kernel_func(
 ; CHECK:       {
-; CHECK-NEXT:    .reg .b32 %r<10>;
+; CHECK-NEXT:    .reg .b32 %r<14>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
 ; CHECK-NEXT:    ld.param.u32 %r1, [kernel_func_param_0];
-; CHECK-NEXT:    ld.u32 %r2, [%r1+8];
-; CHECK-NEXT:    ld.u32 %r3, [%r1];
-; CHECK-NEXT:    ld.u32 %r4, [%r1+24];
-; CHECK-NEXT:    ld.u32 %r5, [%r1+16];
-; CHECK-NEXT:    ld.param.u32 %r6, [kernel_func_param_1];
-; CHECK-NEXT:    prmt.b32 %r7, %r5, %r4, 0x4000U;
-; CHECK-NEXT:    prmt.b32 %r8, %r3, %r2, 0x40U;
-; CHECK-NEXT:    prmt.b32 %r9, %r8, %r7, 0x7610U;
-; CHECK-NEXT:    st.u32 [%r6], %r9;
+; CHECK-NEXT:    ld.v4.b32 {%r2, %r3, %r4, %r5}, [%r1];
+; CHECK-NEXT:    ld.v4.b32 {%r6, %r7, %r8, %r9}, [%r1+16];
----------------
dakersnar wrote:

Yeah I noticed this and came to the same conclusions.  Like you said, overall this is a reduction in total number of load instructions, so I think it is a net win.

As for possible corner cases where it wouldn't be a win, my gut says it is unlikely that we will find many end-to-end cases that arrive at the backend with a substantial percentage of unused loads. DCE should run a few times before the Vectorizer, right, so unless something happens _after_ the vectorizer runs that makes those loads become unused, I think we can generally trust that vector loads handed to the backend should be lowered as given whenever possible.

https://github.com/llvm/llvm-project/pull/119622