[llvm] [NVPTX] Lower 16xi8 and 8xi8 stores efficiently (PR #73646)

Wed Nov 29 10:25:42 PST 2023

================
@@ -5557,6 +5557,51 @@ static SDValue PerformLOADCombine(SDNode *N,
       DL);
 }
 
+// Lower a v16i8 (or a v8i8) store into a StoreV4 (or StoreV2) operation with
+// i32 results instead of letting ReplaceLoadVector split it into smaller stores
+// during legalization. This is done at dag-combine time, so that vector
+// operations with i8 elements can be optimised away instead of being needlessly
+// split during legalization, which involves storing to the stack and loading it
----------------
Artem-B wrote:

Nice. Legalizer assuming that stack loads/stores are cheap is indeed a rather bad misoptimization for NVPTX.

https://github.com/llvm/llvm-project/pull/73646