[llvm] [NVPTX] Add Volta Atomic SequentiallyConsistent Load and Store Operations (PR #98551)

Thu Jul 25 15:42:43 PDT 2024

================
@@ -578,47 +1011,134 @@ define void @shared_volatile(ptr addrspace(3) %a, ptr addrspace(3) %b, ptr addrs
   ; CHECK: st.volatile.shared.f64 [%rd{{[0-9]+}}], %fd{{[0-9]+}}
   store volatile double %f.add, ptr addrspace(3) %c
 
+  ; TODO: should be combined into single .u16 op
+  ; CHECK: ld.volatile.shared.v2.u8 {%rs{{[0-9]+}}, %rs{{[0-9]+}}}, [%rd{{[0-9]+}}]
+  %h.load = load volatile <2 x i8>, ptr addrspace(3) %b
+  %h.add = add <2 x i8> %h.load, <i8 1, i8 1>
+  ; CHECK: st.volatile.shared.v2.u8 [%rd{{[0-9]+}}], {%rs{{[0-9]+}}, %rs{{[0-9]+}}}
+  store volatile <2 x i8> %h.add, ptr addrspace(3) %b
+
+  ; CHECK: ld.volatile.shared.u32 %r{{[0-9]+}}, [%rd{{[0-9]+}}]
+  %i.load = load volatile <4 x i8>, ptr addrspace(3) %c
+  %i.add = add <4 x i8> %i.load, <i8 1, i8 1, i8 1, i8 1>
+  ; CHECK: st.volatile.shared.u32 [%rd{{[0-9]+}}], %r{{[0-9]+}}
+  store volatile <4 x i8> %i.add, ptr addrspace(3) %c
+
+  ; CHECK: ld.volatile.shared.u32 %r{{[0-9]+}}, [%rd{{[0-9]+}}]
+  %j.load = load volatile <2 x i16>, ptr addrspace(3) %c
+  %j.add = add <2 x i16> %j.load, <i16 1, i16 1>
+  ; CHECK: st.volatile.shared.u32 [%rd{{[0-9]+}}], %r{{[0-9]+}}
+  store volatile <2 x i16> %j.add, ptr addrspace(3) %c
+
+  ; TODO: should be combined into single .u64 op
+  ; CHECK: ld.volatile.shared.v4.u16 {%rs{{[0-9]+}}, %rs{{[0-9]+}}, %rs{{[0-9]+}}, %rs{{[0-9]+}}}, [%rd{{[0-9]+}}]
+  %k.load = load volatile <4 x i16>, ptr addrspace(3) %d
+  %k.add = add <4 x i16> %k.load, <i16 1, i16 1, i16 1, i16 1>
+  ; CHECK: st.volatile.shared.v4.u16 [%rd{{[0-9]+}}], {%rs{{[0-9]+}}, %rs{{[0-9]+}}, %rs{{[0-9]+}}, %rs{{[0-9]+}}}
+  store volatile <4 x i16> %k.add, ptr addrspace(3) %d
+
+  ; TODO: should be combined into single .u64 op
+  ; CHECK: ld.volatile.shared.v2.u32 {%r{{[0-9]+}}, %r{{[0-9]+}}}, [%rd{{[0-9]+}}]
+  %l.load = load volatile <2 x i32>, ptr addrspace(3) %d
+  %l.add = add <2 x i32> %l.load, <i32 1, i32 1>
+  ; CHECK: st.volatile.shared.v2.u32 [%rd{{[0-9]+}}], {%r{{[0-9]+}}, %r{{[0-9]+}}}
+  store volatile <2 x i32> %l.add, ptr addrspace(3) %d
+
+  ; TODO: should be combined into single .b128 op in sm_70+
+  ; CHECK: ld.volatile.shared.v4.u32 {%r{{[0-9]+}}, %r{{[0-9]+}}, %r{{[0-9]+}}, %r{{[0-9]+}}}, [%rd{{[0-9]+}}]
+  %m.load = load volatile <4 x i32>, ptr addrspace(3) %d
+  %m.add = add <4 x i32> %m.load, <i32 1, i32 1, i32 1, i32 1>
+  ; CHECK: st.volatile.shared.v4.u32 [%rd{{[0-9]+}}], {%r{{[0-9]+}}, %r{{[0-9]+}}, %r{{[0-9]+}}, %r{{[0-9]+}}}
+  store volatile <4 x i32> %m.add, ptr addrspace(3) %d
+
+  ; TODO: should be combined into single .b128 op in sm_70+
+  ; CHECK: ld.volatile.shared.v2.u64 {%rd{{[0-9]+}}, %rd{{[0-9]+}}}, [%rd{{[0-9]+}}]
+  %n.load = load volatile <2 x i64>, ptr addrspace(3) %d
+  %n.add = add <2 x i64> %n.load, <i64 1, i64 1>
+  ; CHECK: st.volatile.shared.v2.u64 [%rd{{[0-9]+}}], {%rd{{[0-9]+}}, %rd{{[0-9]+}}}
+  store volatile <2 x i64> %n.add, ptr addrspace(3) %d
+
+  ; Note: per https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#vectors
+  ; vectors cannot exceed 128-bit in length, i.e., .v4.u64 is not allowed.
+
+  ; TODO: should be combined into single .u64 op
----------------
Artem-B wrote:

Why? If we're talking specifically about v2.f32, loading them ad u64 would not give us any benefits, would it? If anything, we'll need to  generate a more verbose PTX to move from 64-bit integer register into a pair of f32, so we could add another FP value, and then glue and convert them back to u64.

AFAICT the comments here and above would apply only to the vectors exceeding 4 elements, but still under 128bits in total size, but none of the tests fall into that category. 

I would suggest removing the comments for now, or add some extra test cases where they would indicate a furute optimization opportunity. As things stand, I do not see any practical benefits for switching to sngle-element loads/stores for vectors that are 4 elements or less in size. Perhaps I'm missing something.

https://github.com/llvm/llvm-project/pull/98551