[llvm] 2750f3e - [IR] Introduce llvm.experimental.vector.splice intrinsic
Cullen Rhodes via llvm-commits
llvm-commits at lists.llvm.org
Tue Mar 9 02:44:45 PST 2021
Author: Cullen Rhodes
Date: 2021-03-09T10:44:22Z
New Revision: 2750f3ed3155aedccf42e7eccec915d6578d18e4
URL: https://github.com/llvm/llvm-project/commit/2750f3ed3155aedccf42e7eccec915d6578d18e4
DIFF: https://github.com/llvm/llvm-project/commit/2750f3ed3155aedccf42e7eccec915d6578d18e4.diff
LOG: [IR] Introduce llvm.experimental.vector.splice intrinsic
This patch introduces a new intrinsic @llvm.experimental.vector.splice
that constructs a vector of the same type as the two input vectors,
based on a immediate where the sign of the immediate distinguishes two
variants. A positive immediate specifies an index into the first vector
and a negative immediate specifies the number of trailing elements to
extract from the first vector.
For example:
@llvm.experimental.vector.splice(<A,B,C,D>, <E,F,G,H>, 1) ==> <B, C, D, E> ; index
@llvm.experimental.vector.splice(<A,B,C,D>, <E,F,G,H>, -3) ==> <B, C, D, E> ; trailing element count
These intrinsics support both fixed and scalable vectors, where the
former is lowered to a shufflevector to maintain existing behaviour,
although while marked as experimental the recommended way to express
this operation for fixed-width vectors is to use shufflevector. For
scalable vectors where it is not possible to express a shufflevector
mask for this operation, a new ISD node has been implemented.
This is one of the named shufflevector intrinsics proposed on the
mailing-list in the RFC at [1].
Patch by Paul Walker and Cullen Rhodes.
[1] https://lists.llvm.org/pipermail/llvm-dev/2020-November/146864.html
Reviewed By: sdesmalen
Differential Revision: https://reviews.llvm.org/D94708
Added:
llvm/test/CodeGen/AArch64/named-vector-shuffles-neon.ll
llvm/test/CodeGen/AArch64/named-vector-shuffles-sve.ll
Modified:
llvm/docs/LangRef.rst
llvm/include/llvm/CodeGen/ISDOpcodes.h
llvm/include/llvm/CodeGen/TargetLowering.h
llvm/include/llvm/IR/Intrinsics.td
llvm/include/llvm/Target/TargetSelectionDAG.td
llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
llvm/lib/CodeGen/TargetLoweringBase.cpp
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
Removed:
################################################################################
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index c15101c49bf1..666f7d5b8c5b 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -16510,6 +16510,52 @@ Arguments:
The argument to this intrinsic must be a vector.
+'``llvm.experimental.vector.splice``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+ declare <2 x double> @llvm.experimental.vector.splice.v2f64(<2 x double> %vec1, <2 x double> %vec2, i32 %imm)
+ declare <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %vec1, <vscale x 4 x i32> %vec2, i32 %imm)
+
+Overview:
+"""""""""
+
+The '``llvm.experimental.vector.splice.*``' intrinsics construct a vector by
+concatenating elements from the first input vector with elements of the second
+input vector, returning a vector of the same type as the input vectors. The
+signed immediate, modulo the number of elements in the vector, is the index
+into the first vector from which to extract the result value. This means
+conceptually that for a positive immediate, a vector is extracted from
+``concat(%vec1, %vec2)`` starting at index ``imm``, whereas for a negative
+immediate, it extracts ``-imm`` trailing elements from the first vector, and
+the remaining elements from ``%vec2``.
+
+These intrinsics work for both fixed and scalable vectors. While this intrinsic
+is marked as experimental, the recommended way to express this operation for
+fixed-width vectors is still to use a shufflevector, as that may allow for more
+optimization opportunities.
+
+For example:
+
+.. code-block:: text
+
+ llvm.experimental.vector.splice(<A,B,C,D>, <E,F,G,H>, 1) ==> <B, C, D, E> ; index
+ llvm.experimental.vector.splice(<A,B,C,D>, <E,F,G,H>, -3) ==> <B, C, D, E> ; trailing elements
+
+
+Arguments:
+""""""""""
+
+The first two operands are vectors with the same type. The third argument
+``imm`` is the start index, modulo VL, where VL is the runtime vector length of
+the source/result vector. The ``imm`` is a signed integer constant in the range
+``-VL <= imm < VL``. For values outside of this range the result is poison.
+
Matrix Intrinsics
-----------------
diff --git a/llvm/include/llvm/CodeGen/ISDOpcodes.h b/llvm/include/llvm/CodeGen/ISDOpcodes.h
index d3448369db02..8cd73951de8f 100644
--- a/llvm/include/llvm/CodeGen/ISDOpcodes.h
+++ b/llvm/include/llvm/CodeGen/ISDOpcodes.h
@@ -556,6 +556,18 @@ enum NodeType {
/// in terms of the element size of VEC1/VEC2, not in terms of bytes.
VECTOR_SHUFFLE,
+ /// VECTOR_SPLICE(VEC1, VEC2, IMM) - Returns a subvector of the same type as
+ /// VEC1/VEC2 from CONCAT_VECTORS(VEC1, VEC2), based on the IMM in two ways.
+ /// Let the result type be T, if IMM is positive it represents the starting
+ /// element number (an index) from which a subvector of type T is extracted
+ /// from CONCAT_VECTORS(VEC1, VEC2). If IMM is negative it represents a count
+ /// specifying the number of trailing elements to extract from VEC1, where the
+ /// elements of T are selected using the following algorithm:
+ /// RESULT[i] = CONCAT_VECTORS(VEC1,VEC2)[VEC1.ElementCount - ABS(IMM) + i]
+ /// If IMM is not in the range [-VL, VL-1] the result vector is undefined. IMM
+ /// is a constant integer.
+ VECTOR_SPLICE,
+
/// SCALAR_TO_VECTOR(VAL) - This represents the operation of loading a
/// scalar value into element 0 of the resultant vector type. The top
/// elements 1 to N-1 of the N-element vector are undefined. The type
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 5d090f232113..0d2453a778a4 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -4511,6 +4511,10 @@ class TargetLowering : public TargetLoweringBase {
/// Returns true if the expansion was successful.
bool expandREM(SDNode *Node, SDValue &Result, SelectionDAG &DAG) const;
+ /// Method for building the DAG expansion of ISD::VECTOR_SPLICE. This
+ /// method accepts vectors as its arguments.
+ SDValue expandVectorSplice(SDNode *Node, SelectionDAG &DAG) const;
+
//===--------------------------------------------------------------------===//
// Instruction Emitting Hooks
//
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index b3c66b8e9fef..668877c5f592 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -1659,6 +1659,13 @@ def int_experimental_vector_extract : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
[llvm_anyvector_ty, llvm_i64_ty],
[IntrNoMem, ImmArg<ArgIndex<1>>]>;
+//===---------- Named shufflevector intrinsics ------===//
+def int_experimental_vector_splice : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+ [LLVMMatchType<0>,
+ LLVMMatchType<0>,
+ llvm_i32_ty],
+ [IntrNoMem, ImmArg<ArgIndex<2>>]>;
+
//===----------------------------------------------------------------------===//
//===----------------------------------------------------------------------===//
diff --git a/llvm/include/llvm/Target/TargetSelectionDAG.td b/llvm/include/llvm/Target/TargetSelectionDAG.td
index b612de31beb8..247ac68034b2 100644
--- a/llvm/include/llvm/Target/TargetSelectionDAG.td
+++ b/llvm/include/llvm/Target/TargetSelectionDAG.td
@@ -241,6 +241,9 @@ def SDTMaskedLoad: SDTypeProfile<1, 4, [ // masked load
def SDTVecShuffle : SDTypeProfile<1, 2, [
SDTCisSameAs<0, 1>, SDTCisSameAs<1, 2>
]>;
+def SDTVecSlice : SDTypeProfile<1, 3, [ // vector splice
+ SDTCisSameAs<0, 1>, SDTCisSameAs<1, 2>, SDTCisInt<3>
+]>;
def SDTVecExtract : SDTypeProfile<1, 2, [ // vector extract
SDTCisEltOfVec<0, 1>, SDTCisPtrTy<2>
]>;
@@ -655,6 +658,7 @@ def ist : SDNode<"ISD::STORE" , SDTIStore,
def vector_shuffle : SDNode<"ISD::VECTOR_SHUFFLE", SDTVecShuffle, []>;
def vector_reverse : SDNode<"ISD::VECTOR_REVERSE", SDTVecReverse>;
+def vector_splice : SDNode<"ISD::VECTOR_SPLICE", SDTVecSlice, []>;
def build_vector : SDNode<"ISD::BUILD_VECTOR", SDTypeProfile<1, -1, []>, []>;
def splat_vector : SDNode<"ISD::SPLAT_VECTOR", SDTypeProfile<1, 1, []>, []>;
def scalar_to_vector : SDNode<"ISD::SCALAR_TO_VECTOR", SDTypeProfile<1, 1, []>,
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
index a2d5c528b59e..5308bc983a3a 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
@@ -3208,6 +3208,10 @@ bool SelectionDAGLegalize::ExpandNode(SDNode *Node) {
Results.push_back(Tmp1);
break;
}
+ case ISD::VECTOR_SPLICE: {
+ Results.push_back(TLI.expandVectorSplice(Node, DAG));
+ break;
+ }
case ISD::EXTRACT_ELEMENT: {
EVT OpTy = Node->getOperand(0).getValueType();
if (cast<ConstantSDNode>(Node->getOperand(1))->getZExtValue()) {
@@ -4715,7 +4719,14 @@ void SelectionDAGLegalize::PromoteNode(SDNode *Node) {
Results.push_back(Tmp1);
break;
}
-
+ case ISD::VECTOR_SPLICE: {
+ Tmp1 = DAG.getNode(ISD::ANY_EXTEND, dl, NVT, Node->getOperand(0));
+ Tmp2 = DAG.getNode(ISD::ANY_EXTEND, dl, NVT, Node->getOperand(1));
+ Tmp3 = DAG.getNode(ISD::VECTOR_SPLICE, dl, NVT, Tmp1, Tmp2,
+ Node->getOperand(2));
+ Results.push_back(DAG.getNode(ISD::TRUNCATE, dl, OVT, Tmp3));
+ break;
+ }
case ISD::SELECT_CC: {
SDValue Cond = Node->getOperand(4);
ISD::CondCode CCCode = cast<CondCodeSDNode>(Cond)->get();
@@ -4753,7 +4764,6 @@ void SelectionDAGLegalize::PromoteNode(SDNode *Node) {
Results.push_back(Tmp1);
break;
}
-
case ISD::SETCC:
case ISD::STRICT_FSETCC:
case ISD::STRICT_FSETCCS: {
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
index 315be27796b2..3a33d1e7d0c8 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
@@ -100,6 +100,8 @@ void DAGTypeLegalizer::PromoteIntegerResult(SDNode *N, unsigned ResNo) {
Res = PromoteIntRes_VECTOR_REVERSE(N); break;
case ISD::VECTOR_SHUFFLE:
Res = PromoteIntRes_VECTOR_SHUFFLE(N); break;
+ case ISD::VECTOR_SPLICE:
+ Res = PromoteIntRes_VECTOR_SPLICE(N); break;
case ISD::INSERT_VECTOR_ELT:
Res = PromoteIntRes_INSERT_VECTOR_ELT(N); break;
case ISD::BUILD_VECTOR:
@@ -4616,6 +4618,15 @@ SDValue DAGTypeLegalizer::ExpandIntOp_ATOMIC_STORE(SDNode *N) {
return Swap.getValue(1);
}
+SDValue DAGTypeLegalizer::PromoteIntRes_VECTOR_SPLICE(SDNode *N) {
+ SDLoc dl(N);
+
+ SDValue V0 = GetPromotedInteger(N->getOperand(0));
+ SDValue V1 = GetPromotedInteger(N->getOperand(1));
+ EVT OutVT = V0.getValueType();
+
+ return DAG.getNode(ISD::VECTOR_SPLICE, dl, OutVT, V0, V1, N->getOperand(2));
+}
SDValue DAGTypeLegalizer::PromoteIntRes_EXTRACT_SUBVECTOR(SDNode *N) {
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h b/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h
index 8e52ba8e46f0..b38dc14763f5 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h
@@ -300,6 +300,7 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
SDValue PromoteIntRes_EXTRACT_SUBVECTOR(SDNode *N);
SDValue PromoteIntRes_VECTOR_REVERSE(SDNode *N);
SDValue PromoteIntRes_VECTOR_SHUFFLE(SDNode *N);
+ SDValue PromoteIntRes_VECTOR_SPLICE(SDNode *N);
SDValue PromoteIntRes_BUILD_VECTOR(SDNode *N);
SDValue PromoteIntRes_SCALAR_TO_VECTOR(SDNode *N);
SDValue PromoteIntRes_SPLAT_VECTOR(SDNode *N);
@@ -838,6 +839,7 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
void SplitVecRes_VECTOR_REVERSE(SDNode *N, SDValue &Lo, SDValue &Hi);
void SplitVecRes_VECTOR_SHUFFLE(ShuffleVectorSDNode *N, SDValue &Lo,
SDValue &Hi);
+ void SplitVecRes_VECTOR_SPLICE(SDNode *N, SDValue &Lo, SDValue &Hi);
void SplitVecRes_VAARG(SDNode *N, SDValue &Lo, SDValue &Hi);
void SplitVecRes_FP_TO_XINT_SAT(SDNode *N, SDValue &Lo, SDValue &Hi);
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
index 87db530dd461..92d9daa99c9f 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
@@ -947,6 +947,9 @@ void DAGTypeLegalizer::SplitVectorResult(SDNode *N, unsigned ResNo) {
case ISD::VECTOR_SHUFFLE:
SplitVecRes_VECTOR_SHUFFLE(cast<ShuffleVectorSDNode>(N), Lo, Hi);
break;
+ case ISD::VECTOR_SPLICE:
+ SplitVecRes_VECTOR_SPLICE(N, Lo, Hi);
+ break;
case ISD::VAARG:
SplitVecRes_VAARG(N, Lo, Hi);
break;
@@ -1257,7 +1260,7 @@ void DAGTypeLegalizer::SplitVecRes_EXTRACT_SUBVECTOR(SDNode *N, SDValue &Lo,
uint64_t IdxVal = cast<ConstantSDNode>(Idx)->getZExtValue();
Hi = DAG.getNode(
ISD::EXTRACT_SUBVECTOR, dl, HiVT, Vec,
- DAG.getVectorIdxConstant(IdxVal + LoVT.getVectorNumElements(), dl));
+ DAG.getVectorIdxConstant(IdxVal + LoVT.getVectorMinNumElements(), dl));
}
void DAGTypeLegalizer::SplitVecRes_INSERT_SUBVECTOR(SDNode *N, SDValue &Lo,
@@ -5519,3 +5522,19 @@ void DAGTypeLegalizer::SplitVecRes_VECTOR_REVERSE(SDNode *N, SDValue &Lo,
Lo = DAG.getNode(ISD::VECTOR_REVERSE, DL, InHi.getValueType(), InHi);
Hi = DAG.getNode(ISD::VECTOR_REVERSE, DL, InLo.getValueType(), InLo);
}
+
+void DAGTypeLegalizer::SplitVecRes_VECTOR_SPLICE(SDNode *N, SDValue &Lo,
+ SDValue &Hi) {
+ EVT VT = N->getValueType(0);
+ SDLoc DL(N);
+
+ EVT LoVT, HiVT;
+ std::tie(LoVT, HiVT) = DAG.GetSplitDestVTs(VT);
+
+ SDValue Expanded = TLI.expandVectorSplice(N, DAG);
+ Lo = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, LoVT, Expanded,
+ DAG.getVectorIdxConstant(0, DL));
+ Hi =
+ DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, HiVT, Expanded,
+ DAG.getVectorIdxConstant(LoVT.getVectorMinNumElements(), DL));
+}
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 1fe6c34a6726..6b3ccc3dc0e2 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -7105,6 +7105,9 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
case Intrinsic::experimental_vector_reverse:
visitVectorReverse(I);
return;
+ case Intrinsic::experimental_vector_splice:
+ visitVectorSplice(I);
+ return;
}
}
@@ -10956,3 +10959,37 @@ void SelectionDAGBuilder::visitFreeze(const FreezeInst &I) {
setValue(&I, DAG.getNode(ISD::MERGE_VALUES, getCurSDLoc(),
DAG.getVTList(ValueVTs), Values));
}
+
+void SelectionDAGBuilder::visitVectorSplice(const CallInst &I) {
+ const TargetLowering &TLI = DAG.getTargetLoweringInfo();
+ EVT VT = TLI.getValueType(DAG.getDataLayout(), I.getType());
+
+ SDLoc DL = getCurSDLoc();
+ SDValue V1 = getValue(I.getOperand(0));
+ SDValue V2 = getValue(I.getOperand(1));
+ int64_t Imm = cast<ConstantInt>(I.getOperand(2))->getSExtValue();
+
+ // VECTOR_SHUFFLE doesn't support a scalable mask so use a dedicated node.
+ if (VT.isScalableVector()) {
+ MVT IdxVT = TLI.getVectorIdxTy(DAG.getDataLayout());
+ setValue(&I, DAG.getNode(ISD::VECTOR_SPLICE, DL, VT, V1, V2,
+ DAG.getConstant(Imm, DL, IdxVT)));
+ return;
+ }
+
+ unsigned NumElts = VT.getVectorNumElements();
+
+ if ((-Imm > NumElts) || (Imm >= NumElts)) {
+ // Result is undefined if immediate is out-of-bounds.
+ setValue(&I, DAG.getUNDEF(VT));
+ return;
+ }
+
+ uint64_t Idx = (NumElts + Imm) % NumElts;
+
+ // Use VECTOR_SHUFFLE to maintain original behaviour for fixed-length vectors.
+ SmallVector<int, 8> Mask;
+ for (unsigned i = 0; i < NumElts; ++i)
+ Mask.push_back(Idx + i);
+ setValue(&I, DAG.getVectorShuffle(VT, DL, V1, V2, Mask));
+}
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
index 9065358902c9..a759f8babe33 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
@@ -778,6 +778,7 @@ class SelectionDAGBuilder {
void visitVectorReduce(const CallInst &I, unsigned Intrinsic);
void visitVectorReverse(const CallInst &I);
+ void visitVectorSplice(const CallInst &I);
void visitUserOp1(const Instruction &I) {
llvm_unreachable("UserOp1 should not exist at instruction selection time!");
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
index 20f29048179c..079227412215 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
@@ -288,6 +288,7 @@ std::string SDNode::getOperationName(const SelectionDAG *G) const {
case ISD::EXTRACT_SUBVECTOR: return "extract_subvector";
case ISD::SCALAR_TO_VECTOR: return "scalar_to_vector";
case ISD::VECTOR_SHUFFLE: return "vector_shuffle";
+ case ISD::VECTOR_SPLICE: return "vector_splice";
case ISD::SPLAT_VECTOR: return "splat_vector";
case ISD::VECTOR_REVERSE: return "vector_reverse";
case ISD::CARRY_FALSE: return "carry_false";
diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index 11276540bf49..b02a65e91ff3 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -8625,3 +8625,76 @@ SDValue TargetLowering::expandFP_TO_INT_SAT(SDNode *Node,
SDValue ZeroInt = DAG.getConstant(0, dl, DstVT);
return DAG.getSelectCC(dl, Src, Src, ZeroInt, Select, ISD::CondCode::SETUO);
}
+
+SDValue TargetLowering::expandVectorSplice(SDNode *Node,
+ SelectionDAG &DAG) const {
+ assert(Node->getOpcode() == ISD::VECTOR_SPLICE && "Unexpected opcode!");
+ assert(Node->getValueType(0).isScalableVector() &&
+ "Fixed length vector types expected to use SHUFFLE_VECTOR!");
+
+ EVT VT = Node->getValueType(0);
+ SDValue V1 = Node->getOperand(0);
+ SDValue V2 = Node->getOperand(1);
+ int64_t Imm = cast<ConstantSDNode>(Node->getOperand(2))->getSExtValue();
+ SDLoc DL(Node);
+
+ // Expand through memory thusly:
+ // Alloca CONCAT_VECTORS_TYPES(V1, V2) Ptr
+ // Store V1, Ptr
+ // Store V2, Ptr + sizeof(V1)
+ // If (Imm < 0)
+ // TrailingElts = -Imm
+ // Ptr = Ptr + sizeof(V1) - (TrailingElts * sizeof(VT.Elt))
+ // else
+ // Ptr = Ptr + (Imm * sizeof(VT.Elt))
+ // Res = Load Ptr
+
+ Align Alignment = DAG.getReducedAlign(VT, /*UseABI=*/false);
+
+ EVT MemVT = EVT::getVectorVT(*DAG.getContext(), VT.getVectorElementType(),
+ VT.getVectorElementCount() * 2);
+ SDValue StackPtr = DAG.CreateStackTemporary(MemVT.getStoreSize(), Alignment);
+ EVT PtrVT = StackPtr.getValueType();
+ auto &MF = DAG.getMachineFunction();
+ auto FrameIndex = cast<FrameIndexSDNode>(StackPtr.getNode())->getIndex();
+ auto PtrInfo = MachinePointerInfo::getFixedStack(MF, FrameIndex);
+
+ // Store the lo part of CONCAT_VECTORS(V1, V2)
+ SDValue StoreV1 = DAG.getStore(DAG.getEntryNode(), DL, V1, StackPtr, PtrInfo);
+ // Store the hi part of CONCAT_VECTORS(V1, V2)
+ SDValue OffsetToV2 = DAG.getVScale(
+ DL, PtrVT,
+ APInt(PtrVT.getFixedSizeInBits(), VT.getStoreSize().getKnownMinSize()));
+ SDValue StackPtr2 = DAG.getNode(ISD::ADD, DL, PtrVT, StackPtr, OffsetToV2);
+ SDValue StoreV2 = DAG.getStore(StoreV1, DL, V2, StackPtr2, PtrInfo);
+
+ if (Imm >= 0) {
+ // Load back the required element. getVectorElementPointer takes care of
+ // clamping the index if it's out-of-bounds.
+ StackPtr = getVectorElementPointer(DAG, StackPtr, VT, Node->getOperand(2));
+ // Load the spliced result
+ return DAG.getLoad(VT, DL, StoreV2, StackPtr,
+ MachinePointerInfo::getUnknownStack(MF));
+ }
+
+ uint64_t TrailingElts = -Imm;
+
+ // NOTE: TrailingElts must be clamped so as not to read outside of V1:V2.
+ TypeSize EltByteSize = VT.getVectorElementType().getStoreSize();
+ SDValue TrailingBytes =
+ DAG.getConstant(TrailingElts * EltByteSize, DL, PtrVT);
+
+ if (TrailingElts > VT.getVectorMinNumElements()) {
+ SDValue VLBytes = DAG.getVScale(
+ DL, PtrVT,
+ APInt(PtrVT.getFixedSizeInBits(), VT.getStoreSize().getKnownMinSize()));
+ TrailingBytes = DAG.getNode(ISD::UMIN, DL, PtrVT, TrailingBytes, VLBytes);
+ }
+
+ // Calculate the start address of the spliced result.
+ StackPtr2 = DAG.getNode(ISD::SUB, DL, PtrVT, StackPtr2, TrailingBytes);
+
+ // Load the spliced result
+ return DAG.getLoad(VT, DL, StoreV2, StackPtr2,
+ MachinePointerInfo::getUnknownStack(MF));
+}
diff --git a/llvm/lib/CodeGen/TargetLoweringBase.cpp b/llvm/lib/CodeGen/TargetLoweringBase.cpp
index 61c562852f2d..9ec94892c0d2 100644
--- a/llvm/lib/CodeGen/TargetLoweringBase.cpp
+++ b/llvm/lib/CodeGen/TargetLoweringBase.cpp
@@ -849,6 +849,9 @@ void TargetLoweringBase::initActions() {
setOperationAction(ISD::VECREDUCE_FMIN, VT, Expand);
setOperationAction(ISD::VECREDUCE_SEQ_FADD, VT, Expand);
setOperationAction(ISD::VECREDUCE_SEQ_FMUL, VT, Expand);
+
+ // Named vector shuffles default to expand.
+ setOperationAction(ISD::VECTOR_SPLICE, VT, Expand);
}
// Most targets ignore the @llvm.prefetch intrinsic.
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 67ec46f49671..aff46b92c4cf 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -1108,6 +1108,7 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
setOperationAction(ISD::MUL, VT, Custom);
setOperationAction(ISD::SPLAT_VECTOR, VT, Custom);
setOperationAction(ISD::SELECT, VT, Custom);
+ setOperationAction(ISD::SETCC, VT, Custom);
setOperationAction(ISD::SDIV, VT, Custom);
setOperationAction(ISD::UDIV, VT, Custom);
setOperationAction(ISD::SMIN, VT, Custom);
@@ -1276,6 +1277,11 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
for (auto VT : {MVT::v4f16, MVT::v8f16, MVT::v4f32})
setOperationAction(ISD::VECREDUCE_FADD, VT, Custom);
}
+
+ setOperationPromotedToType(ISD::VECTOR_SPLICE, MVT::nxv2i1, MVT::nxv2i64);
+ setOperationPromotedToType(ISD::VECTOR_SPLICE, MVT::nxv4i1, MVT::nxv4i32);
+ setOperationPromotedToType(ISD::VECTOR_SPLICE, MVT::nxv8i1, MVT::nxv8i16);
+ setOperationPromotedToType(ISD::VECTOR_SPLICE, MVT::nxv16i1, MVT::nxv16i8);
}
PredictableSelectIsExpensive = Subtarget->predictableSelectIsExpensive();
diff --git a/llvm/test/CodeGen/AArch64/named-vector-shuffles-neon.ll b/llvm/test/CodeGen/AArch64/named-vector-shuffles-neon.ll
new file mode 100644
index 000000000000..0993ca783b5d
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/named-vector-shuffles-neon.ll
@@ -0,0 +1,142 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -verify-machineinstrs < %s | FileCheck %s
+
+target triple = "aarch64-unknown-linux-gnu"
+
+;
+; VECTOR_SPLICE (index)
+;
+
+define <16 x i8> @splice_v16i8_idx(<16 x i8> %a, <16 x i8> %b) #0 {
+; CHECK-LABEL: splice_v16i8_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ext v0.16b, v0.16b, v1.16b, #1
+; CHECK-NEXT: ret
+ %res = call <16 x i8> @llvm.experimental.vector.splice.v16i8(<16 x i8> %a, <16 x i8> %b, i32 1)
+ ret <16 x i8> %res
+}
+
+define <2 x double> @splice_v2f64_idx(<2 x double> %a, <2 x double> %b) #0 {
+; CHECK-LABEL: splice_v2f64_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ext v0.16b, v0.16b, v1.16b, #8
+; CHECK-NEXT: ret
+ %res = call <2 x double> @llvm.experimental.vector.splice.v2f64(<2 x double> %a, <2 x double> %b, i32 1)
+ ret <2 x double> %res
+}
+
+; Verify promote type legalisation works as expected.
+define <2 x i8> @splice_v2i8_idx(<2 x i8> %a, <2 x i8> %b) #0 {
+; CHECK-LABEL: splice_v2i8_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ext v0.8b, v0.8b, v1.8b, #4
+; CHECK-NEXT: ret
+ %res = call <2 x i8> @llvm.experimental.vector.splice.v2i8(<2 x i8> %a, <2 x i8> %b, i32 1)
+ ret <2 x i8> %res
+}
+
+; Verify splitvec type legalisation works as expected.
+define <8 x i32> @splice_v8i32_idx(<8 x i32> %a, <8 x i32> %b) #0 {
+; CHECK-LABEL: splice_v8i32_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ext v0.16b, v1.16b, v2.16b, #4
+; CHECK-NEXT: ext v1.16b, v2.16b, v3.16b, #4
+; CHECK-NEXT: ret
+ %res = call <8 x i32> @llvm.experimental.vector.splice.v8i32(<8 x i32> %a, <8 x i32> %b, i32 5)
+ ret <8 x i32> %res
+}
+
+; Verify splitvec type legalisation works as expected.
+define <16 x float> @splice_v16f32_idx(<16 x float> %a, <16 x float> %b) #0 {
+; CHECK-LABEL: splice_v16f32_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ext v0.16b, v1.16b, v2.16b, #12
+; CHECK-NEXT: ext v1.16b, v2.16b, v3.16b, #12
+; CHECK-NEXT: ext v2.16b, v3.16b, v4.16b, #12
+; CHECK-NEXT: ext v3.16b, v4.16b, v5.16b, #12
+; CHECK-NEXT: ret
+ %res = call <16 x float> @llvm.experimental.vector.splice.v16f32(<16 x float> %a, <16 x float> %b, i32 7)
+ ret <16 x float> %res
+}
+
+; Verify out-of-bounds index results in undef vector.
+define <2 x double> @splice_v2f64_idx_out_of_bounds(<2 x double> %a, <2 x double> %b) #0 {
+; CHECK-LABEL: splice_v2f64_idx_out_of_bounds:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ret
+ %res = call <2 x double> @llvm.experimental.vector.splice.v2f64(<2 x double> %a, <2 x double> %b, i32 2)
+ ret <2 x double> %res
+}
+
+;
+; VECTOR_SPLICE (trailing elements)
+;
+
+define <16 x i8> @splice_v16i8(<16 x i8> %a, <16 x i8> %b) #0 {
+; CHECK-LABEL: splice_v16i8:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ext v0.16b, v0.16b, v1.16b, #1
+; CHECK-NEXT: ret
+ %res = call <16 x i8> @llvm.experimental.vector.splice.v16i8(<16 x i8> %a, <16 x i8> %b, i32 -15)
+ ret <16 x i8> %res
+}
+
+define <2 x double> @splice_v2f64(<2 x double> %a, <2 x double> %b) #0 {
+; CHECK-LABEL: splice_v2f64:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ext v0.16b, v0.16b, v1.16b, #8
+; CHECK-NEXT: ret
+ %res = call <2 x double> @llvm.experimental.vector.splice.v2f64(<2 x double> %a, <2 x double> %b, i32 -1)
+ ret <2 x double> %res
+}
+
+; Verify promote type legalisation works as expected.
+define <2 x i8> @splice_v2i8(<2 x i8> %a, <2 x i8> %b) #0 {
+; CHECK-LABEL: splice_v2i8:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ext v0.8b, v0.8b, v1.8b, #4
+; CHECK-NEXT: ret
+ %res = call <2 x i8> @llvm.experimental.vector.splice.v2i8(<2 x i8> %a, <2 x i8> %b, i32 -1)
+ ret <2 x i8> %res
+}
+
+; Verify splitvec type legalisation works as expected.
+define <8 x i32> @splice_v8i32(<8 x i32> %a, <8 x i32> %b) #0 {
+; CHECK-LABEL: splice_v8i32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ext v0.16b, v1.16b, v2.16b, #4
+; CHECK-NEXT: ext v1.16b, v2.16b, v3.16b, #4
+; CHECK-NEXT: ret
+ %res = call <8 x i32> @llvm.experimental.vector.splice.v8i32(<8 x i32> %a, <8 x i32> %b, i32 -3)
+ ret <8 x i32> %res
+}
+
+; Verify splitvec type legalisation works as expected.
+define <16 x float> @splice_v16f32(<16 x float> %a, <16 x float> %b) #0 {
+; CHECK-LABEL: splice_v16f32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ext v0.16b, v1.16b, v2.16b, #12
+; CHECK-NEXT: ext v1.16b, v2.16b, v3.16b, #12
+; CHECK-NEXT: ext v2.16b, v3.16b, v4.16b, #12
+; CHECK-NEXT: ext v3.16b, v4.16b, v5.16b, #12
+; CHECK-NEXT: ret
+ %res = call <16 x float> @llvm.experimental.vector.splice.v16f32(<16 x float> %a, <16 x float> %b, i32 -9)
+ ret <16 x float> %res
+}
+
+; Verify out-of-bounds trailing element count results in undef vector.
+define <2 x double> @splice_v2f64_out_of_bounds(<2 x double> %a, <2 x double> %b) #0 {
+; CHECK-LABEL: splice_v2f64_out_of_bounds:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ret
+ %res = call <2 x double> @llvm.experimental.vector.splice.v2f64(<2 x double> %a, <2 x double> %b, i32 -3)
+ ret <2 x double> %res
+}
+
+declare <2 x i8> @llvm.experimental.vector.splice.v2i8(<2 x i8>, <2 x i8>, i32)
+declare <16 x i8> @llvm.experimental.vector.splice.v16i8(<16 x i8>, <16 x i8>, i32)
+declare <8 x i32> @llvm.experimental.vector.splice.v8i32(<8 x i32>, <8 x i32>, i32)
+declare <16 x float> @llvm.experimental.vector.splice.v16f32(<16 x float>, <16 x float>, i32)
+declare <2 x double> @llvm.experimental.vector.splice.v2f64(<2 x double>, <2 x double>, i32)
+
+attributes #0 = { nounwind "target-features"="+neon" }
diff --git a/llvm/test/CodeGen/AArch64/named-vector-shuffles-sve.ll b/llvm/test/CodeGen/AArch64/named-vector-shuffles-sve.ll
new file mode 100644
index 000000000000..ab8818a1ad10
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/named-vector-shuffles-sve.ll
@@ -0,0 +1,1310 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc -verify-machineinstrs < %s | FileCheck %s
+
+target triple = "aarch64-unknown-linux-gnu"
+
+;
+; VECTOR_SPLICE (index)
+;
+
+define <vscale x 16 x i8> @splice_nxv16i8_first_idx(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) #0 {
+; CHECK-LABEL: splice_nxv16i8_first_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: rdvl x9, #1
+; CHECK-NEXT: sub x9, x9, #1 // =1
+; CHECK-NEXT: cmp x9, #0 // =0
+; CHECK-NEXT: ptrue p0.b
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x9, xzr, lo
+; CHECK-NEXT: st1b { z0.b }, p0, [sp]
+; CHECK-NEXT: st1b { z1.b }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9
+; CHECK-NEXT: ld1b { z0.b }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 16 x i8> @llvm.experimental.vector.splice.nxv16i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b, i32 0)
+ ret <vscale x 16 x i8> %res
+}
+
+define <vscale x 16 x i8> @splice_nxv16i8_last_idx(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) #0 {
+; CHECK-LABEL: splice_nxv16i8_last_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: rdvl x9, #1
+; CHECK-NEXT: sub x9, x9, #1 // =1
+; CHECK-NEXT: mov w10, #15
+; CHECK-NEXT: cmp x9, #15 // =15
+; CHECK-NEXT: ptrue p0.b
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x9, x10, lo
+; CHECK-NEXT: st1b { z0.b }, p0, [sp]
+; CHECK-NEXT: st1b { z1.b }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9
+; CHECK-NEXT: ld1b { z0.b }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 16 x i8> @llvm.experimental.vector.splice.nxv16i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b, i32 15)
+ ret <vscale x 16 x i8> %res
+}
+
+; Ensure index is clamped when we cannot prove it's less than VL-1.
+define <vscale x 16 x i8> @splice_nxv16i8_clamped_idx(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) #0 {
+; CHECK-LABEL: splice_nxv16i8_clamped_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: rdvl x9, #1
+; CHECK-NEXT: sub x9, x9, #1 // =1
+; CHECK-NEXT: mov w10, #16
+; CHECK-NEXT: cmp x9, #16 // =16
+; CHECK-NEXT: ptrue p0.b
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x9, x10, lo
+; CHECK-NEXT: st1b { z0.b }, p0, [sp]
+; CHECK-NEXT: st1b { z1.b }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9
+; CHECK-NEXT: ld1b { z0.b }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 16 x i8> @llvm.experimental.vector.splice.nxv16i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b, i32 16)
+ ret <vscale x 16 x i8> %res
+}
+
+define <vscale x 8 x i16> @splice_nxv8i16_first_idx(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) #0 {
+; CHECK-LABEL: splice_nxv8i16_first_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cnth x9
+; CHECK-NEXT: sub x9, x9, #1 // =1
+; CHECK-NEXT: cmp x9, #0 // =0
+; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x9, xzr, lo
+; CHECK-NEXT: st1h { z0.h }, p0, [sp]
+; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #1
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i16> @llvm.experimental.vector.splice.nxv8i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b, i32 0)
+ ret <vscale x 8 x i16> %res
+}
+
+define <vscale x 8 x i16> @splice_nxv8i16_last_idx(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) #0 {
+; CHECK-LABEL: splice_nxv8i16_last_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cnth x10
+; CHECK-NEXT: sub x10, x10, #1 // =1
+; CHECK-NEXT: mov w9, #7
+; CHECK-NEXT: cmp x10, #7 // =7
+; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x10, x9, lo
+; CHECK-NEXT: st1h { z0.h }, p0, [sp]
+; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #1
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i16> @llvm.experimental.vector.splice.nxv8i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b, i32 7)
+ ret <vscale x 8 x i16> %res
+}
+
+; Ensure index is clamped when we cannot prove it's less than VL-1.
+define <vscale x 8 x i16> @splice_nxv8i16_clamped_idx(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) #0 {
+; CHECK-LABEL: splice_nxv8i16_clamped_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cnth x10
+; CHECK-NEXT: sub x10, x10, #1 // =1
+; CHECK-NEXT: mov w9, #8
+; CHECK-NEXT: cmp x10, #8 // =8
+; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x10, x9, lo
+; CHECK-NEXT: st1h { z0.h }, p0, [sp]
+; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #1
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i16> @llvm.experimental.vector.splice.nxv8i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b, i32 8)
+ ret <vscale x 8 x i16> %res
+}
+
+define <vscale x 4 x i32> @splice_nxv4i32_first_idx(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) #0 {
+; CHECK-LABEL: splice_nxv4i32_first_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cntw x9
+; CHECK-NEXT: sub x9, x9, #1 // =1
+; CHECK-NEXT: cmp x9, #0 // =0
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x9, xzr, lo
+; CHECK-NEXT: st1w { z0.s }, p0, [sp]
+; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #2
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b, i32 0)
+ ret <vscale x 4 x i32> %res
+}
+
+define <vscale x 4 x i32> @splice_nxv4i32_last_idx(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) #0 {
+; CHECK-LABEL: splice_nxv4i32_last_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cntw x10
+; CHECK-NEXT: sub x10, x10, #1 // =1
+; CHECK-NEXT: mov w9, #3
+; CHECK-NEXT: cmp x10, #3 // =3
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x10, x9, lo
+; CHECK-NEXT: st1w { z0.s }, p0, [sp]
+; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #2
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b, i32 3)
+ ret <vscale x 4 x i32> %res
+}
+
+; Ensure index is clamped when we cannot prove it's less than VL-1.
+define <vscale x 4 x i32> @splice_nxv4i32_clamped_idx(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) #0 {
+; CHECK-LABEL: splice_nxv4i32_clamped_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cntw x10
+; CHECK-NEXT: sub x10, x10, #1 // =1
+; CHECK-NEXT: mov w9, #4
+; CHECK-NEXT: cmp x10, #4 // =4
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x10, x9, lo
+; CHECK-NEXT: st1w { z0.s }, p0, [sp]
+; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #2
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b, i32 4)
+ ret <vscale x 4 x i32> %res
+}
+
+define <vscale x 2 x i64> @splice_nxv2i64_first_idx(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) #0 {
+; CHECK-LABEL: splice_nxv2i64_first_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cntd x9
+; CHECK-NEXT: sub x9, x9, #1 // =1
+; CHECK-NEXT: cmp x9, #0 // =0
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x9, xzr, lo
+; CHECK-NEXT: st1d { z0.d }, p0, [sp]
+; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #3
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x i64> @llvm.experimental.vector.splice.nxv2i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b, i32 0)
+ ret <vscale x 2 x i64> %res
+}
+
+define <vscale x 2 x i64> @splice_nxv2i64_last_idx(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) #0 {
+; CHECK-LABEL: splice_nxv2i64_last_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cntd x9
+; CHECK-NEXT: sub x9, x9, #1 // =1
+; CHECK-NEXT: cmp x9, #1 // =1
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csinc x9, x9, xzr, lo
+; CHECK-NEXT: st1d { z0.d }, p0, [sp]
+; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #3
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x i64> @llvm.experimental.vector.splice.nxv2i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b, i32 1)
+ ret <vscale x 2 x i64> %res
+}
+
+; Ensure index is clamped when we cannot prove it's less than VL-1.
+define <vscale x 2 x i64> @splice_nxv2i64_clamped_idx(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) #0 {
+; CHECK-LABEL: splice_nxv2i64_clamped_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cntd x10
+; CHECK-NEXT: sub x10, x10, #1 // =1
+; CHECK-NEXT: mov w9, #2
+; CHECK-NEXT: cmp x10, #2 // =2
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x10, x9, lo
+; CHECK-NEXT: st1d { z0.d }, p0, [sp]
+; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #3
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x i64> @llvm.experimental.vector.splice.nxv2i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b, i32 2)
+ ret <vscale x 2 x i64> %res
+}
+
+define <vscale x 8 x half> @splice_nxv8f16_first_idx(<vscale x 8 x half> %a, <vscale x 8 x half> %b) #0 {
+; CHECK-LABEL: splice_nxv8f16_first_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cnth x9
+; CHECK-NEXT: sub x9, x9, #1 // =1
+; CHECK-NEXT: cmp x9, #0 // =0
+; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x9, xzr, lo
+; CHECK-NEXT: st1h { z0.h }, p0, [sp]
+; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #1
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x half> @llvm.experimental.vector.splice.nxv8f16(<vscale x 8 x half> %a, <vscale x 8 x half> %b, i32 0)
+ ret <vscale x 8 x half> %res
+}
+
+define <vscale x 8 x half> @splice_nxv8f16_last_idx(<vscale x 8 x half> %a, <vscale x 8 x half> %b) #0 {
+; CHECK-LABEL: splice_nxv8f16_last_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cnth x10
+; CHECK-NEXT: sub x10, x10, #1 // =1
+; CHECK-NEXT: mov w9, #7
+; CHECK-NEXT: cmp x10, #7 // =7
+; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x10, x9, lo
+; CHECK-NEXT: st1h { z0.h }, p0, [sp]
+; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #1
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x half> @llvm.experimental.vector.splice.nxv8f16(<vscale x 8 x half> %a, <vscale x 8 x half> %b, i32 7)
+ ret <vscale x 8 x half> %res
+}
+
+; Ensure index is clamped when we cannot prove it's less than VL-1.
+define <vscale x 8 x half> @splice_nxv8f16_clamped_idx(<vscale x 8 x half> %a, <vscale x 8 x half> %b) #0 {
+; CHECK-LABEL: splice_nxv8f16_clamped_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cnth x10
+; CHECK-NEXT: sub x10, x10, #1 // =1
+; CHECK-NEXT: mov w9, #8
+; CHECK-NEXT: cmp x10, #8 // =8
+; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x10, x9, lo
+; CHECK-NEXT: st1h { z0.h }, p0, [sp]
+; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #1
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x half> @llvm.experimental.vector.splice.nxv8f16(<vscale x 8 x half> %a, <vscale x 8 x half> %b, i32 8)
+ ret <vscale x 8 x half> %res
+}
+
+define <vscale x 4 x float> @splice_nxv4f32_first_idx(<vscale x 4 x float> %a, <vscale x 4 x float> %b) #0 {
+; CHECK-LABEL: splice_nxv4f32_first_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cntw x9
+; CHECK-NEXT: sub x9, x9, #1 // =1
+; CHECK-NEXT: cmp x9, #0 // =0
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x9, xzr, lo
+; CHECK-NEXT: st1w { z0.s }, p0, [sp]
+; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #2
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x float> @llvm.experimental.vector.splice.nxv4f32(<vscale x 4 x float> %a, <vscale x 4 x float> %b, i32 0)
+ ret <vscale x 4 x float> %res
+}
+
+define <vscale x 4 x float> @splice_nxv4f32_last_idx(<vscale x 4 x float> %a, <vscale x 4 x float> %b) #0 {
+; CHECK-LABEL: splice_nxv4f32_last_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cntw x10
+; CHECK-NEXT: sub x10, x10, #1 // =1
+; CHECK-NEXT: mov w9, #3
+; CHECK-NEXT: cmp x10, #3 // =3
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x10, x9, lo
+; CHECK-NEXT: st1w { z0.s }, p0, [sp]
+; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #2
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x float> @llvm.experimental.vector.splice.nxv4f32(<vscale x 4 x float> %a, <vscale x 4 x float> %b, i32 3)
+ ret <vscale x 4 x float> %res
+}
+
+; Ensure index is clamped when we cannot prove it's less than VL-1.
+define <vscale x 4 x float> @splice_nxv4f32_clamped_idx(<vscale x 4 x float> %a, <vscale x 4 x float> %b) #0 {
+; CHECK-LABEL: splice_nxv4f32_clamped_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cntw x10
+; CHECK-NEXT: sub x10, x10, #1 // =1
+; CHECK-NEXT: mov w9, #4
+; CHECK-NEXT: cmp x10, #4 // =4
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x10, x9, lo
+; CHECK-NEXT: st1w { z0.s }, p0, [sp]
+; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #2
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x float> @llvm.experimental.vector.splice.nxv4f32(<vscale x 4 x float> %a, <vscale x 4 x float> %b, i32 4)
+ ret <vscale x 4 x float> %res
+}
+
+define <vscale x 2 x double> @splice_nxv2f64_first_idx(<vscale x 2 x double> %a, <vscale x 2 x double> %b) #0 {
+; CHECK-LABEL: splice_nxv2f64_first_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cntd x9
+; CHECK-NEXT: sub x9, x9, #1 // =1
+; CHECK-NEXT: cmp x9, #0 // =0
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x9, xzr, lo
+; CHECK-NEXT: st1d { z0.d }, p0, [sp]
+; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #3
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x double> @llvm.experimental.vector.splice.nxv2f64(<vscale x 2 x double> %a, <vscale x 2 x double> %b, i32 0)
+ ret <vscale x 2 x double> %res
+}
+
+define <vscale x 2 x double> @splice_nxv2f64_last_idx(<vscale x 2 x double> %a, <vscale x 2 x double> %b) #0 {
+; CHECK-LABEL: splice_nxv2f64_last_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cntd x9
+; CHECK-NEXT: sub x9, x9, #1 // =1
+; CHECK-NEXT: cmp x9, #1 // =1
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csinc x9, x9, xzr, lo
+; CHECK-NEXT: st1d { z0.d }, p0, [sp]
+; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #3
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x double> @llvm.experimental.vector.splice.nxv2f64(<vscale x 2 x double> %a, <vscale x 2 x double> %b, i32 1)
+ ret <vscale x 2 x double> %res
+}
+
+; Ensure index is clamped when we cannot prove it's less than VL-1.
+define <vscale x 2 x double> @splice_nxv2f64_clamped_idx(<vscale x 2 x double> %a, <vscale x 2 x double> %b) #0 {
+; CHECK-LABEL: splice_nxv2f64_clamped_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cntd x10
+; CHECK-NEXT: sub x10, x10, #1 // =1
+; CHECK-NEXT: mov w9, #2
+; CHECK-NEXT: cmp x10, #2 // =2
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x10, x9, lo
+; CHECK-NEXT: st1d { z0.d }, p0, [sp]
+; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #3
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x double> @llvm.experimental.vector.splice.nxv2f64(<vscale x 2 x double> %a, <vscale x 2 x double> %b, i32 2)
+ ret <vscale x 2 x double> %res
+}
+
+; Ensure predicate based splice is promoted to use ZPRs.
+define <vscale x 2 x i1> @splice_nxv2i1_idx(<vscale x 2 x i1> %a, <vscale x 2 x i1> %b) #0 {
+; CHECK-LABEL: splice_nxv2i1_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cntd x9
+; CHECK-NEXT: sub x9, x9, #1 // =1
+; CHECK-NEXT: mov z0.d, p0/z, #1 // =0x1
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: cmp x9, #1 // =1
+; CHECK-NEXT: st1d { z0.d }, p0, [sp]
+; CHECK-NEXT: mov z0.d, p1/z, #1 // =0x1
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csinc x9, x9, xzr, lo
+; CHECK-NEXT: st1d { z0.d }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #3
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
+; CHECK-NEXT: and z0.d, z0.d, #0x1
+; CHECK-NEXT: cmpne p0.d, p0/z, z0.d, #0
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x i1> @llvm.experimental.vector.splice.nxv2i1(<vscale x 2 x i1> %a, <vscale x 2 x i1> %b, i32 1)
+ ret <vscale x 2 x i1> %res
+}
+
+; Ensure predicate based splice is promoted to use ZPRs.
+define <vscale x 4 x i1> @splice_nxv4i1_idx(<vscale x 4 x i1> %a, <vscale x 4 x i1> %b) #0 {
+; CHECK-LABEL: splice_nxv4i1_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cntw x10
+; CHECK-NEXT: sub x10, x10, #1 // =1
+; CHECK-NEXT: mov z0.s, p0/z, #1 // =0x1
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov w9, #2
+; CHECK-NEXT: cmp x10, #2 // =2
+; CHECK-NEXT: st1w { z0.s }, p0, [sp]
+; CHECK-NEXT: mov z0.s, p1/z, #1 // =0x1
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x10, x9, lo
+; CHECK-NEXT: st1w { z0.s }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #2
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT: and z0.s, z0.s, #0x1
+; CHECK-NEXT: cmpne p0.s, p0/z, z0.s, #0
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i1> @llvm.experimental.vector.splice.nxv4i1(<vscale x 4 x i1> %a, <vscale x 4 x i1> %b, i32 2)
+ ret <vscale x 4 x i1> %res
+}
+
+; Ensure predicate based splice is promoted to use ZPRs.
+define <vscale x 8 x i1> @splice_nxv8i1_idx(<vscale x 8 x i1> %a, <vscale x 8 x i1> %b) #0 {
+; CHECK-LABEL: splice_nxv8i1_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cnth x10
+; CHECK-NEXT: sub x10, x10, #1 // =1
+; CHECK-NEXT: mov z0.h, p0/z, #1 // =0x1
+; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: mov w9, #4
+; CHECK-NEXT: cmp x10, #4 // =4
+; CHECK-NEXT: st1h { z0.h }, p0, [sp]
+; CHECK-NEXT: mov z0.h, p1/z, #1 // =0x1
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x10, x9, lo
+; CHECK-NEXT: st1h { z0.h }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #1
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
+; CHECK-NEXT: and z0.h, z0.h, #0x1
+; CHECK-NEXT: cmpne p0.h, p0/z, z0.h, #0
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i1> @llvm.experimental.vector.splice.nxv8i1(<vscale x 8 x i1> %a, <vscale x 8 x i1> %b, i32 4)
+ ret <vscale x 8 x i1> %res
+}
+
+; Ensure predicate based splice is promoted to use ZPRs.
+define <vscale x 16 x i1> @splice_nxv16i1_idx(<vscale x 16 x i1> %a, <vscale x 16 x i1> %b) #0 {
+; CHECK-LABEL: splice_nxv16i1_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: rdvl x9, #1
+; CHECK-NEXT: sub x9, x9, #1 // =1
+; CHECK-NEXT: mov z0.b, p0/z, #1 // =0x1
+; CHECK-NEXT: ptrue p0.b
+; CHECK-NEXT: mov w10, #8
+; CHECK-NEXT: cmp x9, #8 // =8
+; CHECK-NEXT: st1b { z0.b }, p0, [sp]
+; CHECK-NEXT: mov z0.b, p1/z, #1 // =0x1
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x9, x10, lo
+; CHECK-NEXT: st1b { z0.b }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9
+; CHECK-NEXT: ld1b { z0.b }, p0/z, [x8]
+; CHECK-NEXT: and z0.b, z0.b, #0x1
+; CHECK-NEXT: cmpne p0.b, p0/z, z0.b, #0
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 16 x i1> @llvm.experimental.vector.splice.nxv16i1(<vscale x 16 x i1> %a, <vscale x 16 x i1> %b, i32 8)
+ ret <vscale x 16 x i1> %res
+}
+
+; Verify promote type legalisation works as expected.
+define <vscale x 2 x i8> @splice_nxv2i8_idx(<vscale x 2 x i8> %a, <vscale x 2 x i8> %b) #0 {
+; CHECK-LABEL: splice_nxv2i8_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: cntd x9
+; CHECK-NEXT: sub x9, x9, #1 // =1
+; CHECK-NEXT: cmp x9, #1 // =1
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csinc x9, x9, xzr, lo
+; CHECK-NEXT: st1d { z0.d }, p0, [sp]
+; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #3
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x i8> @llvm.experimental.vector.splice.nxv2i8(<vscale x 2 x i8> %a, <vscale x 2 x i8> %b, i32 1)
+ ret <vscale x 2 x i8> %res
+}
+
+; Verify splitvec type legalisation works as expected.
+define <vscale x 8 x i32> @splice_nxv8i32_idx(<vscale x 8 x i32> %a, <vscale x 8 x i32> %b) #0 {
+; CHECK-LABEL: splice_nxv8i32_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-4
+; CHECK-NEXT: cnth x10
+; CHECK-NEXT: sub x10, x10, #1 // =1
+; CHECK-NEXT: mov w9, #2
+; CHECK-NEXT: cmp x10, #2 // =2
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x10, x9, lo
+; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: st1w { z0.s }, p0, [sp]
+; CHECK-NEXT: st1w { z3.s }, p0, [x8, #3, mul vl]
+; CHECK-NEXT: st1w { z2.s }, p0, [x8, #2, mul vl]
+; CHECK-NEXT: orr x8, x8, x9, lsl #2
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT: ld1w { z1.s }, p0/z, [x8, #1, mul vl]
+; CHECK-NEXT: addvl sp, sp, #4
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i32> @llvm.experimental.vector.splice.nxv8i32(<vscale x 8 x i32> %a, <vscale x 8 x i32> %b, i32 2)
+ ret <vscale x 8 x i32> %res
+}
+
+; Verify splitvec type legalisation works as expected.
+define <vscale x 16 x float> @splice_nxv16f32_clamped_idx(<vscale x 16 x float> %a, <vscale x 16 x float> %b) #0 {
+; CHECK-LABEL: splice_nxv16f32_clamped_idx:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-8
+; CHECK-NEXT: rdvl x10, #1
+; CHECK-NEXT: sub x10, x10, #1 // =1
+; CHECK-NEXT: mov w9, #16
+; CHECK-NEXT: cmp x10, #16 // =16
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: csel x9, x10, x9, lo
+; CHECK-NEXT: st1w { z3.s }, p0, [x8, #3, mul vl]
+; CHECK-NEXT: st1w { z2.s }, p0, [x8, #2, mul vl]
+; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: st1w { z0.s }, p0, [sp]
+; CHECK-NEXT: st1w { z7.s }, p0, [x8, #7, mul vl]
+; CHECK-NEXT: st1w { z4.s }, p0, [x8, #4, mul vl]
+; CHECK-NEXT: st1w { z5.s }, p0, [x8, #5, mul vl]
+; CHECK-NEXT: st1w { z6.s }, p0, [x8, #6, mul vl]
+; CHECK-NEXT: add x8, x8, x9, lsl #2
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT: ld1w { z1.s }, p0/z, [x8, #1, mul vl]
+; CHECK-NEXT: ld1w { z2.s }, p0/z, [x8, #2, mul vl]
+; CHECK-NEXT: ld1w { z3.s }, p0/z, [x8, #3, mul vl]
+; CHECK-NEXT: addvl sp, sp, #8
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 16 x float> @llvm.experimental.vector.splice.nxv16f32(<vscale x 16 x float> %a, <vscale x 16 x float> %b, i32 16)
+ ret <vscale x 16 x float> %res
+}
+
+;
+; VECTOR_SPLICE (trailing elements)
+;
+
+define <vscale x 16 x i8> @splice_nxv16i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) #0 {
+; CHECK-LABEL: splice_nxv16i8:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: ptrue p0.b
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: st1b { z0.b }, p0, [sp]
+; CHECK-NEXT: st1b { z1.b }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, #16 // =16
+; CHECK-NEXT: ld1b { z0.b }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 16 x i8> @llvm.experimental.vector.splice.nxv16i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b, i32 -16)
+ ret <vscale x 16 x i8> %res
+}
+
+define <vscale x 16 x i8> @splice_nxv16i8_1(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) #0 {
+; CHECK-LABEL: splice_nxv16i8_1:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: ptrue p0.b
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: st1b { z0.b }, p0, [sp]
+; CHECK-NEXT: st1b { z1.b }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, #1 // =1
+; CHECK-NEXT: ld1b { z0.b }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 16 x i8> @llvm.experimental.vector.splice.nxv16i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b, i32 -1)
+ ret <vscale x 16 x i8> %res
+}
+
+; Ensure number of trailing elements is clamped when we cannot prove it's less than VL.
+define <vscale x 16 x i8> @splice_nxv16i8_clamped(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) #0 {
+; CHECK-LABEL: splice_nxv16i8_clamped:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: rdvl x9, #1
+; CHECK-NEXT: ptrue p0.b
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: mov w10, #17
+; CHECK-NEXT: cmp x9, #17 // =17
+; CHECK-NEXT: st1b { z0.b }, p0, [sp]
+; CHECK-NEXT: st1b { z1.b }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: csel x9, x9, x10, lo
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, x9
+; CHECK-NEXT: ld1b { z0.b }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 16 x i8> @llvm.experimental.vector.splice.nxv16i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b, i32 -17)
+ ret <vscale x 16 x i8> %res
+}
+
+define <vscale x 8 x i16> @splice_nxv8i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) #0 {
+; CHECK-LABEL: splice_nxv8i16:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: st1h { z0.h }, p0, [sp]
+; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, #16 // =16
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i16> @llvm.experimental.vector.splice.nxv8i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b, i32 -8)
+ ret <vscale x 8 x i16> %res
+}
+
+define <vscale x 8 x i16> @splice_nxv8i16_1(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) #0 {
+; CHECK-LABEL: splice_nxv8i16_1:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: st1h { z0.h }, p0, [sp]
+; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, #2 // =2
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i16> @llvm.experimental.vector.splice.nxv8i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b, i32 -1)
+ ret <vscale x 8 x i16> %res
+}
+
+; Ensure number of trailing elements is clamped when we cannot prove it's less than VL.
+define <vscale x 8 x i16> @splice_nxv8i16_clamped(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) #0 {
+; CHECK-LABEL: splice_nxv8i16_clamped:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: rdvl x9, #1
+; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: mov w10, #18
+; CHECK-NEXT: cmp x9, #18 // =18
+; CHECK-NEXT: st1h { z0.h }, p0, [sp]
+; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: csel x9, x9, x10, lo
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, x9
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i16> @llvm.experimental.vector.splice.nxv8i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b, i32 -9)
+ ret <vscale x 8 x i16> %res
+}
+
+define <vscale x 4 x i32> @splice_nxv4i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) #0 {
+; CHECK-LABEL: splice_nxv4i32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: st1w { z0.s }, p0, [sp]
+; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, #16 // =16
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b, i32 -4)
+ ret <vscale x 4 x i32> %res
+}
+
+define <vscale x 4 x i32> @splice_nxv4i32_1(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) #0 {
+; CHECK-LABEL: splice_nxv4i32_1:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: st1w { z0.s }, p0, [sp]
+; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, #4 // =4
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b, i32 -1)
+ ret <vscale x 4 x i32> %res
+}
+
+; Ensure number of trailing elements is clamped when we cannot prove it's less than VL.
+define <vscale x 4 x i32> @splice_nxv4i32_clamped(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) #0 {
+; CHECK-LABEL: splice_nxv4i32_clamped:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: rdvl x9, #1
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: mov w10, #20
+; CHECK-NEXT: cmp x9, #20 // =20
+; CHECK-NEXT: st1w { z0.s }, p0, [sp]
+; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: csel x9, x9, x10, lo
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, x9
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b, i32 -5)
+ ret <vscale x 4 x i32> %res
+}
+
+define <vscale x 2 x i64> @splice_nxv2i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) #0 {
+; CHECK-LABEL: splice_nxv2i64:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: st1d { z0.d }, p0, [sp]
+; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, #16 // =16
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x i64> @llvm.experimental.vector.splice.nxv2i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b, i32 -2)
+ ret <vscale x 2 x i64> %res
+}
+
+define <vscale x 2 x i64> @splice_nxv2i64_1(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) #0 {
+; CHECK-LABEL: splice_nxv2i64_1:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: st1d { z0.d }, p0, [sp]
+; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, #8 // =8
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x i64> @llvm.experimental.vector.splice.nxv2i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b, i32 -1)
+ ret <vscale x 2 x i64> %res
+}
+
+; Ensure number of trailing elements is clamped when we cannot prove it's less than VL.
+define <vscale x 2 x i64> @splice_nxv2i64_clamped(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) #0 {
+; CHECK-LABEL: splice_nxv2i64_clamped:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: rdvl x9, #1
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: mov w10, #24
+; CHECK-NEXT: cmp x9, #24 // =24
+; CHECK-NEXT: st1d { z0.d }, p0, [sp]
+; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: csel x9, x9, x10, lo
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, x9
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x i64> @llvm.experimental.vector.splice.nxv2i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b, i32 -3)
+ ret <vscale x 2 x i64> %res
+}
+
+define <vscale x 8 x half> @splice_nxv8f16(<vscale x 8 x half> %a, <vscale x 8 x half> %b) #0 {
+; CHECK-LABEL: splice_nxv8f16:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: st1h { z0.h }, p0, [sp]
+; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, #16 // =16
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x half> @llvm.experimental.vector.splice.nxv8f16(<vscale x 8 x half> %a, <vscale x 8 x half> %b, i32 -8)
+ ret <vscale x 8 x half> %res
+}
+
+define <vscale x 8 x half> @splice_nxv8f16_1(<vscale x 8 x half> %a, <vscale x 8 x half> %b) #0 {
+; CHECK-LABEL: splice_nxv8f16_1:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: st1h { z0.h }, p0, [sp]
+; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, #2 // =2
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x half> @llvm.experimental.vector.splice.nxv8f16(<vscale x 8 x half> %a, <vscale x 8 x half> %b, i32 -1)
+ ret <vscale x 8 x half> %res
+}
+
+; Ensure number of trailing elements is clamped when we cannot prove it's less than VL.
+define <vscale x 8 x half> @splice_nxv8f16_clamped(<vscale x 8 x half> %a, <vscale x 8 x half> %b) #0 {
+; CHECK-LABEL: splice_nxv8f16_clamped:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: rdvl x9, #1
+; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: mov w10, #18
+; CHECK-NEXT: cmp x9, #18 // =18
+; CHECK-NEXT: st1h { z0.h }, p0, [sp]
+; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: csel x9, x9, x10, lo
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, x9
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x half> @llvm.experimental.vector.splice.nxv8f16(<vscale x 8 x half> %a, <vscale x 8 x half> %b, i32 -9)
+ ret <vscale x 8 x half> %res
+}
+
+define <vscale x 4 x float> @splice_nxv4f32(<vscale x 4 x float> %a, <vscale x 4 x float> %b) #0 {
+; CHECK-LABEL: splice_nxv4f32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: st1w { z0.s }, p0, [sp]
+; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, #16 // =16
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x float> @llvm.experimental.vector.splice.nxv4f32(<vscale x 4 x float> %a, <vscale x 4 x float> %b, i32 -4)
+ ret <vscale x 4 x float> %res
+}
+
+define <vscale x 4 x float> @splice_nxv4f32_1(<vscale x 4 x float> %a, <vscale x 4 x float> %b) #0 {
+; CHECK-LABEL: splice_nxv4f32_1:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: st1w { z0.s }, p0, [sp]
+; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, #4 // =4
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x float> @llvm.experimental.vector.splice.nxv4f32(<vscale x 4 x float> %a, <vscale x 4 x float> %b, i32 -1)
+ ret <vscale x 4 x float> %res
+}
+
+; Ensure number of trailing elements is clamped when we cannot prove it's less than VL.
+define <vscale x 4 x float> @splice_nxv4f32_clamped(<vscale x 4 x float> %a, <vscale x 4 x float> %b) #0 {
+; CHECK-LABEL: splice_nxv4f32_clamped:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: rdvl x9, #1
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: mov w10, #20
+; CHECK-NEXT: cmp x9, #20 // =20
+; CHECK-NEXT: st1w { z0.s }, p0, [sp]
+; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: csel x9, x9, x10, lo
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, x9
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x float> @llvm.experimental.vector.splice.nxv4f32(<vscale x 4 x float> %a, <vscale x 4 x float> %b, i32 -5)
+ ret <vscale x 4 x float> %res
+}
+
+define <vscale x 2 x double> @splice_nxv2f64(<vscale x 2 x double> %a, <vscale x 2 x double> %b) #0 {
+; CHECK-LABEL: splice_nxv2f64:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: st1d { z0.d }, p0, [sp]
+; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, #16 // =16
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x double> @llvm.experimental.vector.splice.nxv2f64(<vscale x 2 x double> %a, <vscale x 2 x double> %b, i32 -2)
+ ret <vscale x 2 x double> %res
+}
+
+define <vscale x 2 x double> @splice_nxv2f64_1(<vscale x 2 x double> %a, <vscale x 2 x double> %b) #0 {
+; CHECK-LABEL: splice_nxv2f64_1:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: st1d { z0.d }, p0, [sp]
+; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, #8 // =8
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x double> @llvm.experimental.vector.splice.nxv2f64(<vscale x 2 x double> %a, <vscale x 2 x double> %b, i32 -1)
+ ret <vscale x 2 x double> %res
+}
+
+; Ensure number of trailing elements is clamped when we cannot prove it's less than VL.
+define <vscale x 2 x double> @splice_nxv2f64_clamped(<vscale x 2 x double> %a, <vscale x 2 x double> %b) #0 {
+; CHECK-LABEL: splice_nxv2f64_clamped:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: rdvl x9, #1
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: mov w10, #24
+; CHECK-NEXT: cmp x9, #24 // =24
+; CHECK-NEXT: st1d { z0.d }, p0, [sp]
+; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: csel x9, x9, x10, lo
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, x9
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x double> @llvm.experimental.vector.splice.nxv2f64(<vscale x 2 x double> %a, <vscale x 2 x double> %b, i32 -3)
+ ret <vscale x 2 x double> %res
+}
+
+; Ensure predicate based splice is promoted to use ZPRs.
+define <vscale x 2 x i1> @splice_nxv2i1(<vscale x 2 x i1> %a, <vscale x 2 x i1> %b) #0 {
+; CHECK-LABEL: splice_nxv2i1:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: mov z0.d, p0/z, #1 // =0x1
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov z1.d, p1/z, #1 // =0x1
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: st1d { z0.d }, p0, [sp]
+; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, #8 // =8
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
+; CHECK-NEXT: and z0.d, z0.d, #0x1
+; CHECK-NEXT: cmpne p0.d, p0/z, z0.d, #0
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x i1> @llvm.experimental.vector.splice.nxv2i1(<vscale x 2 x i1> %a, <vscale x 2 x i1> %b, i32 -1)
+ ret <vscale x 2 x i1> %res
+}
+
+; Ensure predicate based splice is promoted to use ZPRs.
+define <vscale x 4 x i1> @splice_nxv4i1(<vscale x 4 x i1> %a, <vscale x 4 x i1> %b) #0 {
+; CHECK-LABEL: splice_nxv4i1:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: mov z0.s, p0/z, #1 // =0x1
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov z1.s, p1/z, #1 // =0x1
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: st1w { z0.s }, p0, [sp]
+; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, #4 // =4
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT: and z0.s, z0.s, #0x1
+; CHECK-NEXT: cmpne p0.s, p0/z, z0.s, #0
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 4 x i1> @llvm.experimental.vector.splice.nxv4i1(<vscale x 4 x i1> %a, <vscale x 4 x i1> %b, i32 -1)
+ ret <vscale x 4 x i1> %res
+}
+
+; Ensure predicate based splice is promoted to use ZPRs.
+define <vscale x 8 x i1> @splice_nxv8i1(<vscale x 8 x i1> %a, <vscale x 8 x i1> %b) #0 {
+; CHECK-LABEL: splice_nxv8i1:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: mov z0.h, p0/z, #1 // =0x1
+; CHECK-NEXT: ptrue p0.h
+; CHECK-NEXT: mov z1.h, p1/z, #1 // =0x1
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: st1h { z0.h }, p0, [sp]
+; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, #2 // =2
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
+; CHECK-NEXT: and z0.h, z0.h, #0x1
+; CHECK-NEXT: cmpne p0.h, p0/z, z0.h, #0
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i1> @llvm.experimental.vector.splice.nxv8i1(<vscale x 8 x i1> %a, <vscale x 8 x i1> %b, i32 -1)
+ ret <vscale x 8 x i1> %res
+}
+
+; Ensure predicate based splice is promoted to use ZPRs.
+define <vscale x 16 x i1> @splice_nxv16i1(<vscale x 16 x i1> %a, <vscale x 16 x i1> %b) #0 {
+; CHECK-LABEL: splice_nxv16i1:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: mov z0.b, p0/z, #1 // =0x1
+; CHECK-NEXT: ptrue p0.b
+; CHECK-NEXT: mov z1.b, p1/z, #1 // =0x1
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: st1b { z0.b }, p0, [sp]
+; CHECK-NEXT: st1b { z1.b }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, #1 // =1
+; CHECK-NEXT: ld1b { z0.b }, p0/z, [x8]
+; CHECK-NEXT: and z0.b, z0.b, #0x1
+; CHECK-NEXT: cmpne p0.b, p0/z, z0.b, #0
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 16 x i1> @llvm.experimental.vector.splice.nxv16i1(<vscale x 16 x i1> %a, <vscale x 16 x i1> %b, i32 -1)
+ ret <vscale x 16 x i1> %res
+}
+
+; Verify promote type legalisation works as expected.
+define <vscale x 2 x i8> @splice_nxv2i8(<vscale x 2 x i8> %a, <vscale x 2 x i8> %b) #0 {
+; CHECK-LABEL: splice_nxv2i8:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-2
+; CHECK-NEXT: ptrue p0.d
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: st1d { z0.d }, p0, [sp]
+; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: addvl x8, x8, #1
+; CHECK-NEXT: sub x8, x8, #16 // =16
+; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
+; CHECK-NEXT: addvl sp, sp, #2
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 2 x i8> @llvm.experimental.vector.splice.nxv2i8(<vscale x 2 x i8> %a, <vscale x 2 x i8> %b, i32 -2)
+ ret <vscale x 2 x i8> %res
+}
+
+; Verify splitvec type legalisation works as expected.
+define <vscale x 8 x i32> @splice_nxv8i32(<vscale x 8 x i32> %a, <vscale x 8 x i32> %b) #0 {
+; CHECK-LABEL: splice_nxv8i32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-4
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: st1w { z0.s }, p0, [sp]
+; CHECK-NEXT: st1w { z3.s }, p0, [x8, #3, mul vl]
+; CHECK-NEXT: st1w { z2.s }, p0, [x8, #2, mul vl]
+; CHECK-NEXT: addvl x8, x8, #2
+; CHECK-NEXT: sub x8, x8, #32 // =32
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT: ld1w { z1.s }, p0/z, [x8, #1, mul vl]
+; CHECK-NEXT: addvl sp, sp, #4
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 8 x i32> @llvm.experimental.vector.splice.nxv8i32(<vscale x 8 x i32> %a, <vscale x 8 x i32> %b, i32 -8)
+ ret <vscale x 8 x i32> %res
+}
+
+; Verify splitvec type legalisation works as expected.
+define <vscale x 16 x float> @splice_nxv16f32_clamped(<vscale x 16 x float> %a, <vscale x 16 x float> %b) #0 {
+; CHECK-LABEL: splice_nxv16f32_clamped:
+; CHECK: // %bb.0:
+; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT: addvl sp, sp, #-8
+; CHECK-NEXT: rdvl x9, #4
+; CHECK-NEXT: ptrue p0.s
+; CHECK-NEXT: mov x8, sp
+; CHECK-NEXT: mov w10, #68
+; CHECK-NEXT: cmp x9, #68 // =68
+; CHECK-NEXT: st1w { z3.s }, p0, [x8, #3, mul vl]
+; CHECK-NEXT: st1w { z2.s }, p0, [x8, #2, mul vl]
+; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
+; CHECK-NEXT: st1w { z0.s }, p0, [sp]
+; CHECK-NEXT: st1w { z7.s }, p0, [x8, #7, mul vl]
+; CHECK-NEXT: st1w { z4.s }, p0, [x8, #4, mul vl]
+; CHECK-NEXT: st1w { z5.s }, p0, [x8, #5, mul vl]
+; CHECK-NEXT: st1w { z6.s }, p0, [x8, #6, mul vl]
+; CHECK-NEXT: addvl x8, x8, #4
+; CHECK-NEXT: csel x9, x9, x10, lo
+; CHECK-NEXT: sub x8, x8, x9
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
+; CHECK-NEXT: ld1w { z1.s }, p0/z, [x8, #1, mul vl]
+; CHECK-NEXT: ld1w { z2.s }, p0/z, [x8, #2, mul vl]
+; CHECK-NEXT: ld1w { z3.s }, p0/z, [x8, #3, mul vl]
+; CHECK-NEXT: addvl sp, sp, #8
+; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT: ret
+ %res = call <vscale x 16 x float> @llvm.experimental.vector.splice.nxv16f32(<vscale x 16 x float> %a, <vscale x 16 x float> %b, i32 -17)
+ ret <vscale x 16 x float> %res
+}
+
+declare <vscale x 2 x i1> @llvm.experimental.vector.splice.nxv2i1(<vscale x 2 x i1>, <vscale x 2 x i1>, i32)
+declare <vscale x 4 x i1> @llvm.experimental.vector.splice.nxv4i1(<vscale x 4 x i1>, <vscale x 4 x i1>, i32)
+declare <vscale x 8 x i1> @llvm.experimental.vector.splice.nxv8i1(<vscale x 8 x i1>, <vscale x 8 x i1>, i32)
+declare <vscale x 16 x i1> @llvm.experimental.vector.splice.nxv16i1(<vscale x 16 x i1>, <vscale x 16 x i1>, i32)
+declare <vscale x 2 x i8> @llvm.experimental.vector.splice.nxv2i8(<vscale x 2 x i8>, <vscale x 2 x i8>, i32)
+declare <vscale x 16 x i8> @llvm.experimental.vector.splice.nxv16i8(<vscale x 16 x i8>, <vscale x 16 x i8>, i32)
+declare <vscale x 8 x i16> @llvm.experimental.vector.splice.nxv8i16(<vscale x 8 x i16>, <vscale x 8 x i16>, i32)
+declare <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32>, <vscale x 4 x i32>, i32)
+declare <vscale x 8 x i32> @llvm.experimental.vector.splice.nxv8i32(<vscale x 8 x i32>, <vscale x 8 x i32>, i32)
+declare <vscale x 2 x i64> @llvm.experimental.vector.splice.nxv2i64(<vscale x 2 x i64>, <vscale x 2 x i64>, i32)
+declare <vscale x 8 x half> @llvm.experimental.vector.splice.nxv8f16(<vscale x 8 x half>, <vscale x 8 x half>, i32)
+declare <vscale x 4 x float> @llvm.experimental.vector.splice.nxv4f32(<vscale x 4 x float>, <vscale x 4 x float>, i32)
+declare <vscale x 16 x float> @llvm.experimental.vector.splice.nxv16f32(<vscale x 16 x float>, <vscale x 16 x float>, i32)
+declare <vscale x 2 x double> @llvm.experimental.vector.splice.nxv2f64(<vscale x 2 x double>, <vscale x 2 x double>, i32)
+
+attributes #0 = { nounwind "target-features"="+sve" }
More information about the llvm-commits
mailing list