[llvm] [NVPTX] Optimize v16i8 reductions (PR #67322)

Mon Sep 25 13:30:04 PDT 2023

================
@@ -52,3 +52,129 @@ define float @ff(ptr %p) {
   %sum = fadd float %sum3, %v4
   ret float %sum
 }
+
+define void @combine_v16i8(ptr noundef align 16 %ptr1, ptr noundef align 16 %ptr2) {
+  ; ENABLED-LABEL: combine_v16i8
+  ; ENABLED: ld.v4.u32
+  ; ENABLED: st.u32
----------------
Artem-B wrote:

That does not tell us that we've lowered things correctly. For all we know, the `st.u32` may be storing something completely different from the reduction results. Ideally we want to track how elements get extracted -- correct BFI or shift/mask.

https://github.com/llvm/llvm-project/pull/67322