[llvm] [NVPTX] Optimize v2x16 BUILD_VECTORs to PRMT (PR #116675)

Thu Dec 12 02:45:44 PST 2024

================
@@ -6176,6 +6176,57 @@ static SDValue PerformLOADCombine(SDNode *N,
       DL);
 }
 
+static SDValue
+PerformBUILD_VECTORCombine(SDNode *N, TargetLowering::DAGCombinerInfo &DCI) {
+  auto VT = N->getValueType(0);
+  if (!DCI.isAfterLegalizeDAG() || !Isv2x16VT(VT))
+    return SDValue();
+
+  auto Op0 = N->getOperand(0);
+  auto Op1 = N->getOperand(1);
+
+  // Start out by assuming we want to take the lower 2 bytes of each i32
+  // operand.
+  uint64_t Op0Bytes = 0x10;
+  uint64_t Op1Bytes = 0x54;
+
+  std::pair<SDValue *, uint64_t *> OpData[2] = {{&Op0, &Op0Bytes},
+                                                {&Op1, &Op1Bytes}};
+
+  // Check that each operand is an i16, truncated from an i32 operand. We'll
+  // select individual bytes from those original operands. Optionally, fold in a
+  // shift right of that original operand.
+  for (auto &[Op, OpBytes] : OpData) {
+    // Eat up any bitcast
+    if (Op->getOpcode() == ISD::BITCAST)
+      *Op = Op->getOperand(0);
+
+    if (Op->getValueType() != MVT::i16 || Op->getOpcode() != ISD::TRUNCATE ||
----------------
frasercrmck wrote:

I've tried some simple tests with multiple uses of the truncate and/or the original value.

When you reuse the truncate you appear to increase register pressure (though the SASS remains the same): https://godbolt.org/z/MT8PqshsW

When you reuse the original value the register pressure looks better, indicating the PRMT is worthwhile. Though the SASS is the same: https://godbolt.org/z/MqdbT6W59

When you reuse both, the register pressure is still worse, though the SASS remains the same: https://godbolt.org/z/Mc8W763M8

So, even though the SASS remains the same in these simple examples, it indicates we should probably bail out if the truncate has multiple uses. Multiple uses of the original value appears to be alright.

https://github.com/llvm/llvm-project/pull/116675