[llvm] [NVPTX] Preserve v16i8 vector loads when legalizing (PR #67322)

Wed Oct 18 11:21:57 PDT 2023

================
@@ -5294,6 +5295,98 @@ static SDValue PerformEXTRACTCombine(SDNode *N,
   return Result;
 }
 
+static SDValue PerformLOADCombine(SDNode *N,
+                                  TargetLowering::DAGCombinerInfo &DCI) {
+  SelectionDAG &DAG = DCI.DAG;
+  LoadSDNode *LD = cast<LoadSDNode>(N);
+
+  // Lower a v16i8 load into a LoadV4 operation with i32 results instead of
+  // letting ReplaceLoadVector split it into smaller loads during legalization.
+  // This is done at dag-combine1 time, so that vector operations with i8
+  // elements can be optimised away instead of being needlessly split during
+  // legalization, which involves storing to the stack and loading it back.
+  EVT VT = N->getValueType(0);
+  if (VT != MVT::v16i8)
----------------
Artem-B wrote:

> I don't think there is a ld.v2.b32 instruction we could use

V2 ld/st variants do exist: https://github.com/llvm/llvm-project/blob/ddf1de20a3f7db3bca1ef6ba7e6cbb90aac5fd2d/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td#L2919

The code is easily parametrizable by `NumElts`. 

https://github.com/llvm/llvm-project/pull/67322