[llvm] [AArch64][SVE] Use SVE for scalar FP converts in streaming[-compatible] functions (PR #112564)

Tue Oct 29 07:02:23 PDT 2024

================
@@ -18961,13 +18961,92 @@ static SDValue performVectorCompareAndMaskUnaryOpCombine(SDNode *N,
   return SDValue();
 }
 
+/// Tries to replace scalar FP <-> INT conversions with SVE in streaming
+/// functions, this can help to reduce the number of fmovs to/from GPRs.
+static SDValue
+tryToReplaceScalarFPConversionWithSVE(SDNode *N, SelectionDAG &DAG,
+                                      const AArch64Subtarget *Subtarget) {
+  if (N->isStrictFPOpcode())
+    return SDValue();
+
+  if (!Subtarget->isSVEorStreamingSVEAvailable() ||
+      (!Subtarget->isStreaming() && !Subtarget->isStreamingCompatible()))
+    return SDValue();
+
+  auto isSupportedType = [](EVT VT) {
+    if (!VT.isSimple())
+      return false;
+    // There are SVE instructions that can convert to/from all pairs of these
+    // int and float types. Note: We don't bother with i8 or i16 as those are
+    // illegal types for scalars.
+    return is_contained({MVT::i32, MVT::i64, MVT::f16, MVT::f32, MVT::f64},
+                        VT.getSimpleVT().SimpleTy);
+  };
+
+  if (!isSupportedType(N->getValueType(0)) ||
+      !isSupportedType(N->getOperand(0).getValueType()))
+    return SDValue();
+
+  unsigned Opc = N->getOpcode();
+  bool IsSigned = Opc == ISD::SINT_TO_FP || Opc == ISD::FP_TO_SINT;
+
+  SDValue SrcVal = N->getOperand(0);
+  EVT SrcTy = SrcVal.getValueType();
+  EVT DestTy = N->getValueType(0);
+
+  EVT SrcVecTy;
+  EVT DestVecTy;
+  if (DestTy.bitsGT(SrcTy)) {
+    DestVecTy = getPackedSVEVectorVT(DestTy);
+    SrcVecTy = SrcTy == MVT::i32 ? getPackedSVEVectorVT(SrcTy)
+                                 : DestVecTy.changeVectorElementType(SrcTy);
+  } else {
+    SrcVecTy = getPackedSVEVectorVT(SrcTy);
+    DestVecTy = DestTy == MVT::i32 ? getPackedSVEVectorVT(DestTy)
+                                   : SrcVecTy.changeVectorElementType(DestTy);
+  }
+
+  SDLoc DL(N);
+  SDValue ZeroIdx = DAG.getVectorIdxConstant(0, DL);
+  SDValue Vec = DAG.getNode(ISD::INSERT_VECTOR_ELT, DL, SrcVecTy,
+                            DAG.getUNDEF(SrcVecTy), SrcVal, ZeroIdx);
+
+  // Conversions between f64 and i32 are a special case as nxv2i32 is an illegal
+  // type (unlike the equivalent nxv2f32 for floating-point types). So the only
+  // way to lower to these variants is via the intrinsics. Note: We could
+  // sign/zero extend to the i64 variant, but that may result in extra extends
+  // or fmovs in the final assembly.
+  bool IsI32ToF64 = SrcTy == MVT::i32 && DestTy == MVT::f64;
+  bool isF64ToI32 = SrcTy == MVT::f64 && DestTy == MVT::i32;
+  if (IsI32ToF64 || isF64ToI32) {
+    unsigned IntrinsicOpc;
+    if (IsI32ToF64)
+      IntrinsicOpc = IsSigned ? Intrinsic::aarch64_sve_scvtf_f64i32
+                              : Intrinsic::aarch64_sve_ucvtf_f64i32;
+    else
+      IntrinsicOpc = IsSigned ? Intrinsic::aarch64_sve_fcvtzs_i32f64
+                              : Intrinsic::aarch64_sve_fcvtzu_i32f64;
+    SDValue PTrue = getPredicateForVector(DAG, DL, MVT::nxv2f64);
+    Vec = DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, DestVecTy,
+                      {DAG.getConstant(IntrinsicOpc, DL, MVT::i32),
----------------
MacDue wrote:

> This is not much better than the original isel route. Code generation is rarely about implementing a single function that emits exactly what you need, but rather a sequence of smaller cuts that each move the DAG to its final destination.

I get that, but this is not entirely the standard case. This is emulating Neon/FP, which has `f64` <-> `i32` conversions, SVE also has these conversions, but they're currently unused (outside of intrinsics), as `nxv2i32` is an illegal type that is promoted to `nxv2i64`. So you can't rely on the standard lowerings/patterns being in place. 

> You mention using 64-bit converts did not result in the best code generation. What are the resulting deficiencies?
 
I've shown a few such cases in my comments above. The issues were:

- Loads not being promoted to FP registers (likely due to the sign/zero extensions in the way)
- Pointless fmovs between GPRs + sign/zero extends for values that should stay within FP/vector registers
- Additional unavoidable sign/zero extends in some cases 
  - If you can only do i64 -> f64 conversions, and you have an i32, you'll need to extend it

You could probably write a bunch of folds to eventually tidy this up, but given that this will never be used for the general SVE path (since these converts are illegal there), the direct approach made more sense to me. 

The inspiration for using the intrinsics directly partly came from the existing `performFpToIntCombine()`, which sometimes uses the Neon intrinsics directly, so it didn't seem too unreasonable. 

`

https://github.com/llvm/llvm-project/pull/112564