[llvm] [NVPTX] Lower 16xi8 and 8xi8 stores efficiently (PR #73646)

Wed Dec 13 06:34:18 PST 2023

================
@@ -5557,6 +5557,51 @@ static SDValue PerformLOADCombine(SDNode *N,
       DL);
 }
 
+// Lower a v16i8 (or a v8i8) store into a StoreV4 (or StoreV2) operation with
+// i32 results instead of letting ReplaceLoadVector split it into smaller stores
+// during legalization. This is done at dag-combine time, so that vector
+// operations with i8 elements can be optimised away instead of being needlessly
+// split during legalization, which involves storing to the stack and loading it
----------------
pasaulais wrote:

Note that this comment might be out of date, as it looks copied from `PerformLOADCombine` and that was written before stack optimizations were done

https://github.com/llvm/llvm-project/pull/73646