[llvm] [NVPTX][InferAS] assume alloca instructions are in local AS (PR #121710)

Wed Jan 8 12:42:30 PST 2025

================
@@ -9,38 +9,43 @@ define dso_local void @caller_St8x4(ptr nocapture noundef readonly byval(%struct
 ; CHECK-NEXT:    .local .align 8 .b8 __local_depot0[32];
 ; CHECK-NEXT:    .reg .b32 %SP;
 ; CHECK-NEXT:    .reg .b32 %SPL;
-; CHECK-NEXT:    .reg .b32 %r<2>;
-; CHECK-NEXT:    .reg .b64 %rd<13>;
+; CHECK-NEXT:    .reg .b32 %r<4>;
+; CHECK-NEXT:    .reg .b64 %rd<17>;
 ; CHECK-EMPTY:
 ; CHECK-NEXT:  // %bb.0:
 ; CHECK-NEXT:    mov.u32 %SPL, __local_depot0;
 ; CHECK-NEXT:    cvta.local.u32 %SP, %SPL;
 ; CHECK-NEXT:    ld.param.u32 %r1, [caller_St8x4_param_1];
+; CHECK-NEXT:    add.u32 %r3, %SPL, 0;
 ; CHECK-NEXT:    ld.param.u64 %rd1, [caller_St8x4_param_0+24];
-; CHECK-NEXT:    st.u64 [%SP+24], %rd1;
+; CHECK-NEXT:    st.local.u64 [%r3+24], %rd1;
 ; CHECK-NEXT:    ld.param.u64 %rd2, [caller_St8x4_param_0+16];
-; CHECK-NEXT:    st.u64 [%SP+16], %rd2;
+; CHECK-NEXT:    st.local.u64 [%r3+16], %rd2;
 ; CHECK-NEXT:    ld.param.u64 %rd3, [caller_St8x4_param_0+8];
-; CHECK-NEXT:    st.u64 [%SP+8], %rd3;
+; CHECK-NEXT:    st.local.u64 [%r3+8], %rd3;
 ; CHECK-NEXT:    ld.param.u64 %rd4, [caller_St8x4_param_0];
-; CHECK-NEXT:    st.u64 [%SP], %rd4;
+; CHECK-NEXT:    st.local.u64 [%r3], %rd4;
+; CHECK-NEXT:    ld.u64 %rd5, [%SP+8];
+; CHECK-NEXT:    ld.u64 %rd6, [%SP];
+; CHECK-NEXT:    ld.u64 %rd7, [%SP+24];
+; CHECK-NEXT:    ld.u64 %rd8, [%SP+16];
----------------
Artem-B wrote:

> Overall, while this change may interfere with this sort of load elimination for byval callees, this is a fairly minor edge case that can be addressed in a follow up.

This may be more of a problem than it appears. While stores are posted and only waste some memory bandwidth without slowing down execution too much, loads will stall, and for the "local" memory that would likely mean a round-trip to the L2 cache or memory and back for a large enough grid.

I'm fairly confident that this change will upset some of our users -- we have been burned by local loads/stores in the past and actively monitor generated code for additional local memory accesses.

Considering that there's no way to mitigate the impact, I'd prefer to wait for that follow-up improvement to either land first (I think we already may have cases where unexpected reloads from local memory happen) or combine them with this change and land them at the same time.

https://github.com/llvm/llvm-project/pull/121710