[llvm] [IR][AArch64] Add llvm.masked.speculative.load intrinsic (PR #156470)

Tue Sep 2 08:02:34 PDT 2025

https://github.com/huntergr-arm created https://github.com/llvm/llvm-project/pull/156470

In order to support loading from addresses which may not be valid at
runtime without generating faults, we introduce the speculative load
intrinsic. Loading with this intrinsic will only generate a fault for
invalid accesses on the first element of the vector. Any subsequent
fault will be suppressed and the corresponding data will be poison.

This PR contains both target-independent and AArch64-specific codegen
for the intrinsic.


>From edd26c4387086800fdca1b9f240444f37b8a7a08 Mon Sep 17 00:00:00 2001
From: Graham Hunter <graham.hunter at arm.com>
Date: Fri, 8 Aug 2025 11:06:12 +0000
Subject: [PATCH] [IR][AArch64] Add llvm.masked.speculative.load intrinsic

In order to support loading from addresses which may not be valid at
runtime without generating faults, we introduce the speculative load
intrinsic. Loading with this intrinsic will only generate a fault for
invalid accesses on the first element of the vector. Any subsequent
fault will be suppressed and the corresponding data will be poison.

This PR contains both target-independent and AArch64-specific codegen
for the intrinsic.
---
 llvm/docs/LangRef.rst                         |  59 ++++++
 .../llvm/Analysis/TargetTransformInfo.h       |   5 +
 .../llvm/Analysis/TargetTransformInfoImpl.h   |   6 +
 llvm/include/llvm/CodeGen/ISDOpcodes.h        |   7 +
 llvm/include/llvm/CodeGen/SelectionDAG.h      |   4 +
 llvm/include/llvm/CodeGen/SelectionDAGNodes.h |  17 ++
 llvm/include/llvm/IR/Intrinsics.td            |   8 +
 llvm/lib/Analysis/TargetTransformInfo.cpp     |   8 +
 .../SelectionDAG/LegalizeIntegerTypes.cpp     |  27 +++
 llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h |   5 +
 .../SelectionDAG/LegalizeVectorTypes.cpp      |  81 ++++++++
 .../lib/CodeGen/SelectionDAG/SelectionDAG.cpp |  28 +++
 .../SelectionDAG/SelectionDAGBuilder.cpp      |  38 ++++
 .../SelectionDAG/SelectionDAGBuilder.h        |   1 +
 .../SelectionDAG/SelectionDAGDumper.cpp       |   3 +
 llvm/lib/CodeGen/TargetLoweringBase.cpp       |   2 +
 .../Target/AArch64/AArch64ISelLowering.cpp    |  46 ++++-
 llvm/lib/Target/AArch64/AArch64ISelLowering.h |   1 +
 .../AArch64/AArch64TargetTransformInfo.h      |  17 ++
 .../Scalar/ScalarizeMaskedMemIntrin.cpp       |  76 +++++++
 .../AArch64/masked-speculative-load.ll        | 191 ++++++++++++++++++
 21 files changed, 628 insertions(+), 2 deletions(-)
 create mode 100644 llvm/test/CodeGen/AArch64/masked-speculative-load.ll

diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index d32666678caf1..2ae8978b19f63 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -26685,6 +26685,65 @@ The '``llvm.masked.compressstore``' intrinsic is designed for compressing data i
 
 Other targets may support this intrinsic differently, for example, by lowering it into a sequence of branches that guard scalar store operations.
 
+.. _int_mspecload:
+
+'``llvm.masked.speculative.load.*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic. The loaded data is a vector of any integer, floating-point or pointer data type.
+
+::
+
+      declare { <16 x float>, <16 x i1> } @llvm.masked.speculative.load.v16f32.p0(ptr <ptr>, i32 <alignment>, <16 x i1> <mask>)
+      declare { <2 x double>, <2 x i1> } @llvm.masked.speculative.load.v2f64.p0(ptr <ptr>, i32 <alignment>, <2 x i1> <mask>)
+      ;; The data is a vector of pointers
+      declare { <8 x ptr>, <8 x i1> } @llvm.masked.speculative.load.v8p0.p0(ptr <ptr>, i32 <alignment>, <8 x i1> <mask>)
+
+Overview:
+"""""""""
+
+Reads a vector from memory according to the provided mask, suppressing faults
+for any lane beyond the first. The mask holds a bit for each vector lane, and
+is used to prevent memory accesses to the masked-off lanes. Inactive lanes will
+be zero in the result vector.
+
+Returns the loaded data and a mask indicating which lanes are valid, which may
+not be the same as the input mask depending on whether the processor encountered
+a reason to avoid loading from that address.
+
+Arguments:
+""""""""""
+
+The first argument is the base pointer for the load. The second argument is the
+alignment of the source location. It must be a power of two constant integer
+value. The third argument, mask, is a vector of boolean values with the same
+number of elements as the return type.
+
+Semantics:
+""""""""""
+
+The '``llvm.masked.speculative.load``' intrinsic is similar to the
+'``llvm.masked.load``' intrinsic, in that it conditionally loads values from
+memory into a vector based on a mask. However, it allows loading from addresses
+which may not be entirely safe. If the memory corresponding to the first element
+of the vector is inaccessible, then a fault will be raised as normal. For all
+subsequent lanes faults will be suppressed and the corresponding bit in the
+output mask will be marked inactive. The remaining elements in the output mask
+after a suppressed fault will also be marked inactive. Elements with active bits
+in the input mask will be poison values if the corresponding bit is inactive in
+the output mask.
+
+Reasons for marking output elements inactive are processor dependent; it may be
+a genuine fault, e.g. if the range of the data being loaded spans a page
+boundary and the page at the higher address is not mapped. But a given
+processor may also mark elements as inactive for other reasons, such as a cache
+miss. Code using this intrinsic must take this into account and not assume that
+inactive lanes signal the end of accessible memory. If more data should be
+loaded based on the semantics of the user code, then the base pointer should be
+advanced to the address of the first inactive element and a new speculative load
+attempted.
 
 Memory Use Markers
 ------------------
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index af78e0c1e4799..56ddab64bd64f 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -840,6 +840,11 @@ class TargetTransformInfo {
   /// Return true if the target supports masked expand load.
   LLVM_ABI bool isLegalMaskedExpandLoad(Type *DataType, Align Alignment) const;
 
+  /// Return true if the target supports masked speculative load.
+  LLVM_ABI bool isLegalMaskedSpeculativeLoad(Type *DataType, Align Alignment,
+                                             unsigned AddressSpace,
+                                             bool AllTrueMask) const;
+
   /// Return true if the target supports strided load.
   LLVM_ABI bool isLegalStridedLoadStore(Type *DataType, Align Alignment) const;
 
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 9c2ebb1891cac..27e57987919bc 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -370,6 +370,12 @@ class TargetTransformInfoImplBase {
     return false;
   }
 
+  virtual bool isLegalMaskedSpeculativeLoad(Type *DataType, Align Alignment,
+                                            unsigned AddressSpace,
+                                            bool AllTrueMask) const {
+    return false;
+  }
+
   virtual bool isLegalStridedLoadStore(Type *DataType, Align Alignment) const {
     return false;
   }
diff --git a/llvm/include/llvm/CodeGen/ISDOpcodes.h b/llvm/include/llvm/CodeGen/ISDOpcodes.h
index 465e4a0a9d0d8..e9bc63b4fad52 100644
--- a/llvm/include/llvm/CodeGen/ISDOpcodes.h
+++ b/llvm/include/llvm/CodeGen/ISDOpcodes.h
@@ -1558,6 +1558,13 @@ enum NodeType {
   // bits conform to getBooleanContents similar to the SETCC operator.
   GET_ACTIVE_LANE_MASK,
 
+  /// Represents the llvm.masked.speculative.load intrinsic. Performs a load
+  /// where only the first lane may generate a fault; any subsequent lanes will
+  /// suppress the fault. Returns the loaded data and a mask to indicate which
+  /// lanes are valid. Invalid lanes are poison data.
+  /// Operands: Chain, Base, Mask
+  MASKED_SPECULATIVE_LOAD,
+
   // llvm.clear_cache intrinsic
   // Operands: Input Chain, Start Addres, End Address
   // Outputs: Output Chain
diff --git a/llvm/include/llvm/CodeGen/SelectionDAG.h b/llvm/include/llvm/CodeGen/SelectionDAG.h
index 8a834315646a1..b40c5e7e83c68 100644
--- a/llvm/include/llvm/CodeGen/SelectionDAG.h
+++ b/llvm/include/llvm/CodeGen/SelectionDAG.h
@@ -1663,6 +1663,10 @@ class SelectionDAG {
                                     MachineMemOperand *MMO,
                                     ISD::MemIndexType IndexType,
                                     bool IsTruncating = false);
+  LLVM_ABI SDValue getMaskedSpeculativeLoad(SDVTList VTs, EVT MemVT,
+                                            const SDLoc &dl,
+                                            ArrayRef<SDValue> Ops,
+                                            MachineMemOperand *MMO);
   LLVM_ABI SDValue getMaskedHistogram(SDVTList VTs, EVT MemVT, const SDLoc &dl,
                                       ArrayRef<SDValue> Ops,
                                       MachineMemOperand *MMO,
diff --git a/llvm/include/llvm/CodeGen/SelectionDAGNodes.h b/llvm/include/llvm/CodeGen/SelectionDAGNodes.h
index 65528b3050fe5..1bcf9f47e36c1 100644
--- a/llvm/include/llvm/CodeGen/SelectionDAGNodes.h
+++ b/llvm/include/llvm/CodeGen/SelectionDAGNodes.h
@@ -585,6 +585,7 @@ BEGIN_TWO_BYTE_PACK()
     friend class MaskedLoadSDNode;
     friend class MaskedGatherSDNode;
     friend class VPGatherSDNode;
+    friend class MaskedSpeculativeLoadSDNode;
     friend class MaskedHistogramSDNode;
 
     uint16_t : NumLSBaseSDNodeBits;
@@ -3099,6 +3100,22 @@ class MaskedHistogramSDNode : public MaskedGatherScatterSDNode {
   }
 };
 
+class MaskedSpeculativeLoadSDNode : public MemSDNode {
+public:
+  friend class SelectionDAG;
+
+  MaskedSpeculativeLoadSDNode(unsigned Order, const DebugLoc &DL, SDVTList VTs,
+                              EVT MemVT, MachineMemOperand *MMO)
+      : MemSDNode(ISD::MASKED_SPECULATIVE_LOAD, Order, DL, VTs, MemVT, MMO) {}
+
+  const SDValue &getBasePtr() const { return getOperand(1); }
+  const SDValue &getMask() const { return getOperand(2); }
+
+  static bool classof(const SDNode *N) {
+    return N->getOpcode() == ISD::MASKED_SPECULATIVE_LOAD;
+  }
+};
+
 class VPLoadFFSDNode : public MemSDNode {
 public:
   friend class SelectionDAG;
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index e0ee12391b31d..86c665b087eb0 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -2508,6 +2508,14 @@ def int_masked_compressstore:
             [IntrWriteMem, IntrArgMemOnly,
              NoCapture<ArgIndex<1>>]>;
 
+def int_masked_speculative_load
+    : DefaultAttrsIntrinsic<[llvm_anyvector_ty,
+                             LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],
+                            [llvm_anyptr_ty, llvm_i32_ty,
+                             LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],
+                            [IntrReadMem, IntrArgMemOnly, ImmArg<ArgIndex<1>>,
+                             NoCapture<ArgIndex<0>>]>;
+
 def int_experimental_vector_compress:
     DefaultAttrsIntrinsic<[llvm_anyvector_ty],
               [LLVMMatchType<0>, LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>, LLVMMatchType<0>],
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index b4fa0d5964cb6..50282545e9f1b 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -526,6 +526,14 @@ bool TargetTransformInfo::isLegalMaskedExpandLoad(Type *DataType,
   return TTIImpl->isLegalMaskedExpandLoad(DataType, Alignment);
 }
 
+bool TargetTransformInfo::isLegalMaskedSpeculativeLoad(Type *DataType,
+                                                       Align Alignment,
+                                                       unsigned AddressSpace,
+                                                       bool AllTrueMask) const {
+  return TTIImpl->isLegalMaskedSpeculativeLoad(DataType, Alignment,
+                                               AddressSpace, AllTrueMask);
+}
+
 bool TargetTransformInfo::isLegalStridedLoadStore(Type *DataType,
                                                   Align Alignment) const {
   return TTIImpl->isLegalStridedLoadStore(DataType, Alignment);
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
index 90d62e6da8e94..3830181b55ed8 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
@@ -91,6 +91,10 @@ void DAGTypeLegalizer::PromoteIntegerResult(SDNode *N, unsigned ResNo) {
     break;
   case ISD::MGATHER:     Res = PromoteIntRes_MGATHER(cast<MaskedGatherSDNode>(N));
     break;
+  case ISD::MASKED_SPECULATIVE_LOAD:
+    Res = PromoteIntRes_MASKED_SPECULATIVE_LOAD(
+        cast<MaskedSpeculativeLoadSDNode>(N), ResNo);
+    break;
   case ISD::VECTOR_COMPRESS:
     Res = PromoteIntRes_VECTOR_COMPRESS(N);
     break;
@@ -1041,6 +1045,29 @@ SDValue DAGTypeLegalizer::PromoteIntRes_MGATHER(MaskedGatherSDNode *N) {
   return Res;
 }
 
+SDValue DAGTypeLegalizer::PromoteIntRes_MASKED_SPECULATIVE_LOAD(
+    MaskedSpeculativeLoadSDNode *N, unsigned ResNo) {
+  EVT DataVT = N->getValueType(0);
+  EVT MaskVT = N->getValueType(1);
+
+  if (ResNo == 1)
+    MaskVT = TLI.getTypeToTransformTo(*DAG.getContext(), MaskVT);
+  else {
+    assert(ResNo == 0 && "Tried to legalize unexpected operand");
+    DataVT = TLI.getTypeToTransformTo(*DAG.getContext(), DataVT);
+  }
+
+  SDLoc dl(N);
+  SDValue Res = DAG.getMaskedSpeculativeLoad(
+      DAG.getVTList(DataVT, MaskVT, MVT::Other), N->getMemoryVT(), dl,
+      {N->getChain(), N->getBasePtr(), N->getMask()}, N->getMemOperand());
+
+  // Use the updated mask and chain values.
+  ReplaceValueWith(SDValue(N, 1), Res.getValue(1));
+  ReplaceValueWith(SDValue(N, 2), Res.getValue(2));
+  return Res;
+}
+
 SDValue DAGTypeLegalizer::PromoteIntRes_VECTOR_COMPRESS(SDNode *N) {
   SDValue Vec = GetPromotedInteger(N->getOperand(0));
   SDValue Passthru = GetPromotedInteger(N->getOperand(2));
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h b/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h
index 65fd863e55ac9..d043bc93d5c06 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h
@@ -341,6 +341,8 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
   SDValue PromoteIntRes_VP_LOAD(VPLoadSDNode *N);
   SDValue PromoteIntRes_MLOAD(MaskedLoadSDNode *N);
   SDValue PromoteIntRes_MGATHER(MaskedGatherSDNode *N);
+  SDValue PromoteIntRes_MASKED_SPECULATIVE_LOAD(MaskedSpeculativeLoadSDNode *N,
+                                                unsigned ResNo);
   SDValue PromoteIntRes_VECTOR_COMPRESS(SDNode *N);
   SDValue PromoteIntRes_Overflow(SDNode *N);
   SDValue PromoteIntRes_FFREXP(SDNode *N);
@@ -979,6 +981,8 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
   void SplitVecRes_MLOAD(MaskedLoadSDNode *MLD, SDValue &Lo, SDValue &Hi);
   void SplitVecRes_Gather(MemSDNode *VPGT, SDValue &Lo, SDValue &Hi,
                           bool SplitSETCC = false);
+  void SplitVecRes_MASKED_SPECULATIVE_LOAD(MaskedSpeculativeLoadSDNode *N,
+                                           SDValue &Lo, SDValue &Hi);
   void SplitVecRes_VECTOR_COMPRESS(SDNode *N, SDValue &Lo, SDValue &Hi);
   void SplitVecRes_ScalarOp(SDNode *N, SDValue &Lo, SDValue &Hi);
   void SplitVecRes_VP_SPLAT(SDNode *N, SDValue &Lo, SDValue &Hi);
@@ -1084,6 +1088,7 @@ class LLVM_LIBRARY_VISIBILITY DAGTypeLegalizer {
   SDValue WidenVecRes_MLOAD(MaskedLoadSDNode* N);
   SDValue WidenVecRes_MGATHER(MaskedGatherSDNode* N);
   SDValue WidenVecRes_VP_GATHER(VPGatherSDNode* N);
+  SDValue WidenVecRes_MASKED_SPECULATIVE_LOAD(MaskedSpeculativeLoadSDNode *N);
   SDValue WidenVecRes_ScalarOp(SDNode* N);
   SDValue WidenVecRes_Select(SDNode *N);
   SDValue WidenVSELECTMask(SDNode *N);
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
index e8f6167f45572..d827ba3a43e1f 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
@@ -1206,6 +1206,10 @@ void DAGTypeLegalizer::SplitVectorResult(SDNode *N, unsigned ResNo) {
   case ISD::VP_GATHER:
     SplitVecRes_Gather(cast<MemSDNode>(N), Lo, Hi, /*SplitSETCC*/ true);
     break;
+  case ISD::MASKED_SPECULATIVE_LOAD:
+    SplitVecRes_MASKED_SPECULATIVE_LOAD(cast<MaskedSpeculativeLoadSDNode>(N),
+                                        Lo, Hi);
+    break;
   case ISD::VECTOR_COMPRESS:
     SplitVecRes_VECTOR_COMPRESS(N, Lo, Hi);
     break;
@@ -2566,6 +2570,55 @@ void DAGTypeLegalizer::SplitVecRes_Gather(MemSDNode *N, SDValue &Lo,
   ReplaceValueWith(SDValue(N, 1), Ch);
 }
 
+void DAGTypeLegalizer::SplitVecRes_MASKED_SPECULATIVE_LOAD(
+    MaskedSpeculativeLoadSDNode *N, SDValue &Lo, SDValue &Hi) {
+
+  SDLoc dl(N);
+  auto [LoDVT, HiDVT] = DAG.GetSplitDestVTs(N->getValueType(0));
+  EVT MaskVT = N->getValueType(1);
+  auto [LoMVT, HiMVT] = DAG.GetSplitDestVTs(MaskVT);
+
+  SDValue Chain = N->getChain();
+  SDValue BasePtr = N->getBasePtr();
+  SDValue Mask = N->getMask();
+  Align Alignment = N->getBaseAlign();
+
+  // Split Mask operand
+  SDValue MaskLo, MaskHi;
+  if (getTypeAction(Mask.getValueType()) == TargetLowering::TypeSplitVector)
+    GetSplitVector(Mask, MaskLo, MaskHi);
+  else
+    std::tie(MaskLo, MaskHi) = DAG.SplitVector(Mask, dl);
+
+  EVT MemoryVT = N->getMemoryVT();
+  bool HiIsEmpty = false;
+  auto [LoMemVT, HiMemVT] =
+      DAG.GetDependentSplitDestVTs(MemoryVT, LoDVT, &HiIsEmpty);
+
+  MachineMemOperand *MMO = DAG.getMachineFunction().getMachineMemOperand(
+      N->getPointerInfo(), MachineMemOperand::MOLoad,
+      LocationSize::beforeOrAfterPointer(), Alignment, N->getAAInfo(),
+      N->getRanges());
+
+  Lo = DAG.getMaskedSpeculativeLoad(DAG.getVTList(LoDVT, LoMVT, MVT::Other),
+                                    LoMemVT, dl, {Chain, BasePtr, MaskLo}, MMO);
+
+  // Hi half can't just use another speculative load, since that would introduce
+  // a potentially faulting lane in the middle of the overall speculative load.
+  // So generate poison data and an all-false mask.
+  Hi = DAG.getSplat(HiDVT, dl, DAG.getPOISON(HiDVT.getVectorElementType()));
+  SDValue FalseMask = DAG.getSplat(
+      HiMVT, dl, DAG.getConstant(0, dl, HiMVT.getVectorElementType()));
+
+  // We need to combine the split output masks into one for the replacement.
+  SDValue OutMask =
+      DAG.getNode(ISD::CONCAT_VECTORS, dl, MaskVT, Lo.getValue(1), FalseMask);
+
+  // Update mask and chain outputs.
+  ReplaceValueWith(SDValue(N, 1), OutMask);
+  ReplaceValueWith(SDValue(N, 2), Lo.getValue(2));
+}
+
 void DAGTypeLegalizer::SplitVecRes_VECTOR_COMPRESS(SDNode *N, SDValue &Lo,
                                                    SDValue &Hi) {
   // This is not "trivial", as there is a dependency between the two subvectors.
@@ -4840,6 +4893,10 @@ void DAGTypeLegalizer::WidenVectorResult(SDNode *N, unsigned ResNo) {
   case ISD::VP_GATHER:
     Res = WidenVecRes_VP_GATHER(cast<VPGatherSDNode>(N));
     break;
+  case ISD::MASKED_SPECULATIVE_LOAD:
+    Res = WidenVecRes_MASKED_SPECULATIVE_LOAD(
+        cast<MaskedSpeculativeLoadSDNode>(N));
+    break;
   case ISD::VECTOR_REVERSE:
     Res = WidenVecRes_VECTOR_REVERSE(N);
     break;
@@ -6466,6 +6523,30 @@ SDValue DAGTypeLegalizer::WidenVecRes_VP_GATHER(VPGatherSDNode *N) {
   return Res;
 }
 
+SDValue DAGTypeLegalizer::WidenVecRes_MASKED_SPECULATIVE_LOAD(
+    MaskedSpeculativeLoadSDNode *N) {
+  EVT DataVT = N->getValueType(0);
+  EVT WideDataVT = TLI.getTypeToTransformTo(*DAG.getContext(), DataVT);
+  SDValue Mask = N->getMask();
+  EVT MaskVT = Mask.getValueType();
+  SDLoc dl(N);
+
+  EVT WideMaskVT =
+      EVT::getVectorVT(*DAG.getContext(), MaskVT.getVectorElementType(),
+                       WideDataVT.getVectorElementCount());
+
+  Mask = DAG.getInsertSubvector(dl, DAG.getPOISON(WideMaskVT), Mask, 0);
+  SDValue Load = DAG.getMaskedSpeculativeLoad(
+      DAG.getVTList(WideDataVT, WideMaskVT, MVT::Other), N->getMemoryVT(), dl,
+      {N->getChain(), N->getBasePtr(), N->getMask()}, N->getMemOperand());
+
+  // Update mask and chain outputs.
+  ReplaceValueWith(SDValue(N, 1), Load.getValue(1));
+  ReplaceValueWith(SDValue(N, 2), Load.getValue(2));
+
+  return Load;
+}
+
 SDValue DAGTypeLegalizer::WidenVecRes_ScalarOp(SDNode *N) {
   EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
   if (N->isVPOpcode())
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
index 9668d253d52ae..42d6af15a6d81 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
@@ -10572,6 +10572,34 @@ SDValue SelectionDAG::getMaskedScatter(SDVTList VTs, EVT MemVT, const SDLoc &dl,
   return V;
 }
 
+SDValue SelectionDAG::getMaskedSpeculativeLoad(SDVTList VTs, EVT MemVT,
+                                               const SDLoc &dl,
+                                               ArrayRef<SDValue> Ops,
+                                               MachineMemOperand *MMO) {
+  // FIXME: Should we include extra addressing features like mload?
+  FoldingSetNodeID ID;
+  AddNodeIDNode(ID, ISD::MASKED_SPECULATIVE_LOAD, VTs, Ops);
+  ID.AddInteger(MemVT.getRawBits());
+  ID.AddInteger(getSyntheticNodeSubclassData<MaskedSpeculativeLoadSDNode>(
+      dl.getIROrder(), VTs, MemVT, MMO));
+  ID.AddInteger(MMO->getPointerInfo().getAddrSpace());
+  ID.AddInteger(MMO->getFlags());
+  void *IP = nullptr;
+  if (SDNode *E = FindNodeOrInsertPos(ID, dl, IP)) {
+    cast<MaskedSpeculativeLoadSDNode>(E)->refineAlignment(MMO);
+    return SDValue(E, 0);
+  }
+  auto *N = newSDNode<MaskedSpeculativeLoadSDNode>(
+      dl.getIROrder(), dl.getDebugLoc(), VTs, MemVT, MMO);
+  createOperands(N, Ops);
+
+  CSEMap.InsertNode(N, IP);
+  InsertNode(N);
+  SDValue V(N, 0);
+  NewSDValueDbgMsg(V, "Creating new node: ", this);
+  return V;
+}
+
 SDValue SelectionDAG::getMaskedHistogram(SDVTList VTs, EVT MemVT,
                                          const SDLoc &dl, ArrayRef<SDValue> Ops,
                                          MachineMemOperand *MMO,
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 5ccd58c069c9f..35393540f70bc 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -5119,6 +5119,41 @@ void SelectionDAGBuilder::visitMaskedGather(const CallInst &I) {
   setValue(&I, Gather);
 }
 
+void SelectionDAGBuilder::visitMaskedSpeculativeLoad(const CallInst &I) {
+  SDLoc sdl = getCurSDLoc();
+  Value *PtrVal = I.getArgOperand(0);
+  SDValue Ptr = getValue(PtrVal);
+  Align Alignment = cast<ConstantInt>(I.getArgOperand(1))->getAlignValue();
+  SDValue Mask = getValue(I.getArgOperand(2));
+
+  StructType *RetTy = cast<StructType>(I.getType());
+
+  EVT DataVT = EVT::getEVT(RetTy->getElementType(0));
+  EVT MaskVT = EVT::getEVT(RetTy->getElementType(1));
+  AAMDNodes AAInfo = I.getAAMetadata();
+  const MDNode *Ranges = getRangeMetadata(I);
+
+  MemoryLocation ML = MemoryLocation::getAfter(PtrVal, AAInfo);
+  bool AddToChain = !BatchAA || !BatchAA->pointsToConstantMemory(ML);
+
+  SDValue InChain = AddToChain ? DAG.getRoot() : DAG.getEntryNode();
+  auto MMOFlags = MachineMemOperand::MOLoad;
+  MachineMemOperand *MMO = DAG.getMachineFunction().getMachineMemOperand(
+      MachinePointerInfo(PtrVal), MMOFlags,
+      LocationSize::beforeOrAfterPointer(), Alignment, AAInfo, Ranges);
+
+  const auto &TLI = DAG.getTargetLoweringInfo();
+  const auto &TTI =
+      TLI.getTargetMachine().getTargetTransformInfo(*I.getFunction());
+
+  SDValue Ops[3] = {InChain, Ptr, Mask};
+  SDValue Res = DAG.getMaskedSpeculativeLoad(
+      DAG.getVTList(DataVT, MaskVT, MVT::Other), DataVT, sdl, Ops, MMO);
+  if (AddToChain)
+    PendingLoads.push_back(Res.getValue(2));
+  setValue(&I, Res);
+}
+
 void SelectionDAGBuilder::visitAtomicCmpXchg(const AtomicCmpXchgInst &I) {
   SDLoc dl = getCurSDLoc();
   AtomicOrdering SuccessOrdering = I.getSuccessOrdering();
@@ -6769,6 +6804,9 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
   case Intrinsic::masked_compressstore:
     visitMaskedStore(I, true /* IsCompressing */);
     return;
+  case Intrinsic::masked_speculative_load:
+    visitMaskedSpeculativeLoad(I);
+    return;
   case Intrinsic::powi:
     setValue(&I, ExpandPowI(sdl, getValue(I.getArgOperand(0)),
                             getValue(I.getArgOperand(1)), DAG));
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
index e0835e6310357..2a64c5c739be2 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
@@ -598,6 +598,7 @@ class SelectionDAGBuilder {
   void visitMaskedStore(const CallInst &I, bool IsCompressing = false);
   void visitMaskedGather(const CallInst &I);
   void visitMaskedScatter(const CallInst &I);
+  void visitMaskedSpeculativeLoad(const CallInst &I);
   void visitAtomicCmpXchg(const AtomicCmpXchgInst &I);
   void visitAtomicRMW(const AtomicRMWInst &I);
   void visitFence(const FenceInst &I);
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
index 900da7645504f..2a21efd7d964c 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
@@ -581,6 +581,9 @@ std::string SDNode::getOperationName(const SelectionDAG *G) const {
   case ISD::GET_ACTIVE_LANE_MASK:
     return "get_active_lane_mask";
 
+  case ISD::MASKED_SPECULATIVE_LOAD:
+    return "masked_speculative_load";
+
   case ISD::PARTIAL_REDUCE_UMLA:
     return "partial_reduce_umla";
   case ISD::PARTIAL_REDUCE_SMLA:
diff --git a/llvm/lib/CodeGen/TargetLoweringBase.cpp b/llvm/lib/CodeGen/TargetLoweringBase.cpp
index 0549a947600dc..a3197774cf52b 100644
--- a/llvm/lib/CodeGen/TargetLoweringBase.cpp
+++ b/llvm/lib/CodeGen/TargetLoweringBase.cpp
@@ -784,6 +784,8 @@ void TargetLoweringBase::initActions() {
       setIndexedMaskedStoreAction(IM, VT, Expand);
     }
 
+    setOperationAction(ISD::MASKED_SPECULATIVE_LOAD, VT, Expand);
+
     // Most backends expect to see the node which just returns the value loaded.
     setOperationAction(ISD::ATOMIC_CMP_SWAP_WITH_SUCCESS, VT, Expand);
 
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index b7011e0ea1669..32eb9277cf794 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -1161,8 +1161,9 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
                        ISD::INSERT_VECTOR_ELT, ISD::EXTRACT_VECTOR_ELT,
                        ISD::VECREDUCE_ADD, ISD::STEP_VECTOR});
 
-  setTargetDAGCombine(
-      {ISD::MGATHER, ISD::MSCATTER, ISD::EXPERIMENTAL_VECTOR_HISTOGRAM});
+  setTargetDAGCombine({ISD::MGATHER, ISD::MSCATTER,
+                       ISD::EXPERIMENTAL_VECTOR_HISTOGRAM,
+                       ISD::MASKED_SPECULATIVE_LOAD});
 
   setTargetDAGCombine(ISD::FP_EXTEND);
 
@@ -1929,6 +1930,7 @@ AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
                     MVT::v4i32,    MVT::v1i64,   MVT::v2i64}) {
       setOperationAction(ISD::MGATHER, VT, Custom);
       setOperationAction(ISD::MSCATTER, VT, Custom);
+      setOperationAction(ISD::MASKED_SPECULATIVE_LOAD, VT, Custom);
     }
 
     for (auto VT : {MVT::nxv2f16, MVT::nxv4f16, MVT::nxv8f16, MVT::nxv2f32,
@@ -6860,6 +6862,44 @@ SDValue AArch64TargetLowering::LowerMLOAD(SDValue Op, SelectionDAG &DAG) const {
   return DAG.getMergeValues({Result, Load.getValue(1)}, DL);
 }
 
+SDValue
+AArch64TargetLowering::LowerMASKED_SPECULATIVE_LOAD(SDValue Op,
+                                                    SelectionDAG &DAG) const {
+  SDLoc DL(Op);
+  auto *SpecLoad = cast<MaskedSpeculativeLoadSDNode>(Op);
+  assert(SpecLoad && "Expected custom lowering of a masked load node");
+  EVT DataVT = Op->getValueType(0);
+  EVT MaskVT = Op->getValueType(1);
+
+  assert(DataVT.isScalableVT() && MaskVT.isScalableVT() &&
+         "Implement fixed-length masked speculative load");
+
+  SDValue Chain = SpecLoad->getOperand(0);
+  SDValue BasePtr = SpecLoad->getOperand(1);
+  SDValue Mask = SpecLoad->getOperand(2);
+
+  // Set FFR to all-true.
+  SDValue SetFFR =
+      DAG.getNode(ISD::INTRINSIC_VOID, DL, MVT::Other, Chain,
+                  DAG.getConstant(Intrinsic::aarch64_sve_setffr, DL, MVT::i64));
+
+  // Perform first-faulting load.
+  SDVTList VTs = DAG.getVTList(DataVT, MVT::Other);
+  SDValue Ops[] = {SetFFR, Mask, BasePtr, DAG.getValueType(DataVT)};
+  SDValue FFLoad = DAG.getNode(AArch64ISD::LDFF1_MERGE_ZERO, DL, VTs, Ops);
+  SDValue FFChain = SDValue(FFLoad.getNode(), 1);
+
+  // Retrieve the FFR.
+  // FIXME: Should this be combined with the input mask here?
+  SDValue RdFF = DAG.getNode(
+      ISD::INTRINSIC_W_CHAIN, DL, {MaskVT, MVT::Other},
+      {FFChain, DAG.getConstant(Intrinsic::aarch64_sve_rdffr_z, DL, MVT::i64),
+       Mask});
+  SDValue RdFFChain = SDValue(RdFF.getNode(), 1);
+
+  return DAG.getMergeValues({FFLoad, RdFF, RdFFChain}, DL);
+}
+
 // Custom lower trunc store for v4i8 vectors, since it is promoted to v4i16.
 static SDValue LowerTruncateVectorStore(SDLoc DL, StoreSDNode *ST,
                                         EVT VT, EVT MemVT,
@@ -7641,6 +7681,8 @@ SDValue AArch64TargetLowering::LowerOperation(SDValue Op,
                                      !Subtarget->isNeonAvailable()))
       return LowerFixedLengthVectorLoadToSVE(Op, DAG);
     return LowerLOAD(Op, DAG);
+  case ISD::MASKED_SPECULATIVE_LOAD:
+    return LowerMASKED_SPECULATIVE_LOAD(Op, DAG);
   case ISD::ADD:
   case ISD::AND:
   case ISD::SUB:
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.h b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
index 46738365080f9..6b208d69bd8f7 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.h
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.h
@@ -603,6 +603,7 @@ class AArch64TargetLowering : public TargetLowering {
   SDValue LowerMSCATTER(SDValue Op, SelectionDAG &DAG) const;
 
   SDValue LowerMLOAD(SDValue Op, SelectionDAG &DAG) const;
+  SDValue LowerMASKED_SPECULATIVE_LOAD(SDValue Op, SelectionDAG &DAG) const;
 
   SDValue LowerVECTOR_COMPRESS(SDValue Op, SelectionDAG &DAG) const;
 
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index b994ca74aa222..9796669263d9f 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -341,6 +341,23 @@ class AArch64TTIImpl final : public BasicTTIImplBase<AArch64TTIImpl> {
     return isLegalMaskedGatherScatter(DataType);
   }
 
+  bool isLegalMaskedSpeculativeLoad(Type *DataType, Align Alignment,
+                                    unsigned AddressSpace,
+                                    bool AllTrueMask) const override {
+    // FIXME: If the load is fully aligned to the size of the vector type and
+    //        the mask is all-true, we should be able to use a regular load,
+    //        unless the load is >16B in size and MTE is enabled.
+    // FIXME: Support FixedLength SVE masked loads.
+    // FIXME: Use legalization instead of hardcoding 128b?
+    StructType *RetTy = cast<StructType>(DataType);
+    Type *VecTy = RetTy->getElementType(0);
+    // TODO: Are FF/NF loads available for streaming SVE?
+    if (ST->isSVEAvailable() && VecTy->isScalableTy())
+      return is_contained<unsigned>(
+          {8, 16, 32, 64}, VecTy->getScalarType()->getScalarSizeInBits());
+    return false;
+  }
+
   bool isLegalBroadcastLoad(Type *ElementTy,
                             ElementCount NumElements) const override {
     // Return true if we can generate a `ld1r` splat load instruction.
diff --git a/llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp b/llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp
index 42d6680c3cb7d..b0e072d926f9b 100644
--- a/llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp
+++ b/llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp
@@ -1035,6 +1035,68 @@ static void scalarizeMaskedVectorHistogram(const DataLayout &DL, CallInst *CI,
   ModifiedDT = true;
 }
 
+static void scalarizeMaskedSpeculativeLoad(const DataLayout &DL, CallInst *CI,
+                                           DomTreeUpdater *DTU,
+                                           bool &ModifiedDT) {
+  // For a target without speculative/first-faulting load support, we can't
+  // actually scalarize accesses for all lanes. However, lanes beyond the
+  // first may be considered inactive due to reasons beyond a fault, so for
+  // generic 'scalarization' we can just load the first lane (if the
+  // corresponding input mask bit is active), then mark all other lanes as
+  // inactive in the output mask and embed the first lane into a vector of
+  // poison.
+  Value *Ptr = CI->getArgOperand(0);
+  Value *Align = CI->getArgOperand(1);
+  Value *Mask = CI->getArgOperand(2);
+  StructType *RetTy = cast<StructType>(CI->getType());
+  VectorType *DataTy = cast<VectorType>(RetTy->getElementType(0));
+  VectorType *MaskTy = cast<VectorType>(RetTy->getElementType(1));
+  Type *ScalarTy = DataTy->getScalarType();
+
+  MaybeAlign AlignVal = cast<ConstantInt>(Align)->getMaybeAlignValue();
+
+  IRBuilder<> Builder(CI->getContext());
+  BasicBlock *IfBlock = CI->getParent();
+  Builder.SetInsertPoint(CI);
+  Builder.SetCurrentDebugLocation(CI->getDebugLoc());
+  Value *EmptyMask = Constant::getNullValue(MaskTy);
+  Value *PoisonData = PoisonValue::get(DataTy);
+
+  // FIXME: If the mask is a constant, we can skip the extract.
+  Value *FirstActive =
+      Builder.CreateExtractElement(Mask, 0ul, Twine("first.active"));
+  Instruction *ThenTerm =
+      SplitBlockAndInsertIfThen(FirstActive, CI,
+                                /*Unreachable=*/false,
+                                /*BranchWeights=*/nullptr, DTU);
+
+  BasicBlock *ThenBlock = ThenTerm->getParent();
+  ThenBlock->setName("speculative.load.first.lane");
+  Builder.SetInsertPoint(ThenBlock->getTerminator());
+  LoadInst *Load = Builder.CreateAlignedLoad(ScalarTy, Ptr, AlignVal);
+  Value *OneLaneData = Builder.CreateInsertElement(PoisonData, Load, 0ul);
+  Value *OneLaneMask = Builder.CreateInsertElement(
+      EmptyMask, Constant::getAllOnesValue(MaskTy->getElementType()), 0ul);
+
+  Builder.SetInsertPoint(CI);
+  PHINode *ResData = Builder.CreatePHI(DataTy, 2);
+  ResData->addIncoming(PoisonData, IfBlock);
+  ResData->addIncoming(OneLaneData, ThenBlock);
+  PHINode *ResMask = Builder.CreatePHI(MaskTy, 2);
+  ResMask->addIncoming(EmptyMask, IfBlock);
+  ResMask->addIncoming(OneLaneMask, ThenBlock);
+
+  Value *Result = PoisonValue::get(RetTy);
+  Result = Builder.CreateInsertValue(Result, ResData, 0ul);
+  Result = Builder.CreateInsertValue(Result, ResMask, 1ul);
+  if (CI->hasName())
+    Result->setName(CI->getName() + ".first.lane.only");
+  CI->getParent()->setName("speculative.result");
+  CI->replaceAllUsesWith(Result);
+  CI->eraseFromParent();
+  ModifiedDT = true;
+}
+
 static bool runImpl(Function &F, const TargetTransformInfo &TTI,
                     DominatorTree *DT) {
   std::optional<DomTreeUpdater> DTU;
@@ -1181,8 +1243,22 @@ static bool optimizeCallInst(CallInst *CI, bool &ModifiedDT,
       scalarizeMaskedCompressStore(DL, HasBranchDivergence, CI, DTU,
                                    ModifiedDT);
       return true;
+    case Intrinsic::masked_speculative_load: {
+      bool AllTrueMask = false;
+      if (Constant *CMask = dyn_cast<Constant>(CI->getArgOperand(2)))
+        AllTrueMask = CMask->isAllOnesValue();
+      if (TTI.isLegalMaskedSpeculativeLoad(
+              CI->getType(),
+              cast<ConstantInt>(CI->getArgOperand(1))->getAlignValue(),
+              cast<PointerType>(CI->getArgOperand(0)->getType())
+                  ->getAddressSpace(),
+              AllTrueMask))
+        return false;
+      scalarizeMaskedSpeculativeLoad(DL, CI, DTU, ModifiedDT);
+      return true;
     }
   }
+  }
 
   return false;
 }
diff --git a/llvm/test/CodeGen/AArch64/masked-speculative-load.ll b/llvm/test/CodeGen/AArch64/masked-speculative-load.ll
new file mode 100644
index 0000000000000..b01048276f92e
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/masked-speculative-load.ll
@@ -0,0 +1,191 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=aarch64-linux-gnu < %s | FileCheck %s
+
+define { <4 x i32>, <4 x i1> } @speculative_load_v4i32_neon(ptr %p, <4 x i1> %mask) {
+; CHECK-LABEL: speculative_load_v4i32_neon:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $q0
+; CHECK-NEXT:    umov w8, v0.h[0]
+; CHECK-NEXT:    tbz w8, #0, .LBB0_2
+; CHECK-NEXT:  // %bb.1: // %speculative.load.first.lane
+; CHECK-NEXT:    adrp x8, .LCPI0_0
+; CHECK-NEXT:    ldr s0, [x0]
+; CHECK-NEXT:    ldr d1, [x8, :lo12:.LCPI0_0]
+; CHECK-NEXT:    // kill: def $d1 killed $d1 killed $q1
+; CHECK-NEXT:    ret
+; CHECK-NEXT:  .LBB0_2:
+; CHECK-NEXT:    movi v1.2d, #0000000000000000
+; CHECK-NEXT:    // implicit-def: $q0
+; CHECK-NEXT:    // kill: def $d1 killed $d1 killed $q1
+; CHECK-NEXT:    ret
+  %res = call { <4 x i32>, <4 x i1> } @llvm.masked.speculative.load.v4i32.p0(ptr %p, i32 16, <4 x i1> %mask)
+  ret { <4 x i32>, <4 x i1> } %res
+}
+
+;; FIXME: If we know the input mask is all-true and the vector is fully aligned,
+;;        we should be able to use a normal NEON load here.
+define { <2 x double>, <2 x i1> } @speculative_load_v2f64_all_true_fully_aligned_neon(ptr %p) {
+; CHECK-LABEL: speculative_load_v2f64_all_true_fully_aligned_neon:
+; CHECK:       // %bb.0: // %speculative.load.first.lane
+; CHECK-NEXT:    adrp x8, .LCPI1_0
+; CHECK-NEXT:    ldr d0, [x0]
+; CHECK-NEXT:    ldr d1, [x8, :lo12:.LCPI1_0]
+; CHECK-NEXT:    ret
+  %res = call { <2 x double>, <2 x i1> } @llvm.masked.speculative.load.v2f64.p0(ptr %p, i32 16, <2 x i1> <i1 true, i1 true>)
+  ret { <2 x double>, <2 x i1> } %res
+}
+
+define { <2 x double>, <2 x i1> } @speculative_load_v2f64_all_true_partially_aligned_neon(ptr %p) {
+; CHECK-LABEL: speculative_load_v2f64_all_true_partially_aligned_neon:
+; CHECK:       // %bb.0: // %speculative.load.first.lane
+; CHECK-NEXT:    adrp x8, .LCPI2_0
+; CHECK-NEXT:    ldr d0, [x0]
+; CHECK-NEXT:    ldr d1, [x8, :lo12:.LCPI2_0]
+; CHECK-NEXT:    ret
+  %res = call { <2 x double>, <2 x i1> } @llvm.masked.speculative.load.v2f64.p0(ptr %p, i32 8, <2 x i1> <i1 true, i1 true>)
+  ret { <2 x double>, <2 x i1> } %res
+}
+
+define { <vscale x 16 x i8>, <vscale x 16 x i1> } @speculative_load_nxv16i8(ptr %p, <vscale x 16 x i1> %mask) #0 {
+; CHECK-LABEL: speculative_load_nxv16i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    setffr
+; CHECK-NEXT:    ldff1b { z0.b }, p0/z, [x0]
+; CHECK-NEXT:    rdffr p0.b, p0/z
+; CHECK-NEXT:    ret
+  %res = call { <vscale x 16 x i8>, <vscale x 16 x i1> } @llvm.masked.speculative.load.nxv16i8.p0(ptr %p, i32 1, <vscale x 16 x i1> %mask)
+  ret { <vscale x 16 x i8>, <vscale x 16 x i1> } %res
+}
+
+define { <vscale x 8 x i16>, <vscale x 8 x i1> } @speculative_load_nxv8i16(ptr %p, <vscale x 8 x i1> %mask) #0 {
+; CHECK-LABEL: speculative_load_nxv8i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    setffr
+; CHECK-NEXT:    ldff1h { z0.h }, p0/z, [x0]
+; CHECK-NEXT:    rdffr p0.b, p0/z
+; CHECK-NEXT:    ret
+  %res = call { <vscale x 8 x i16>, <vscale x 8 x i1> } @llvm.masked.speculative.load.nxv8i16.p0(ptr %p, i32 4, <vscale x 8 x i1> %mask)
+  ret { <vscale x 8 x i16>, <vscale x 8 x i1> } %res
+}
+
+define { <vscale x 4 x i32>, <vscale x 4 x i1> } @speculative_load_nxv4i32(ptr %p, <vscale x 4 x i1> %mask) #0 {
+; CHECK-LABEL: speculative_load_nxv4i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    setffr
+; CHECK-NEXT:    ldff1w { z0.s }, p0/z, [x0]
+; CHECK-NEXT:    rdffr p0.b, p0/z
+; CHECK-NEXT:    ret
+  %res = call { <vscale x 4 x i32>, <vscale x 4 x i1> } @llvm.masked.speculative.load.nxv4i32.p0(ptr %p, i32 4, <vscale x 4 x i1> %mask)
+  ret { <vscale x 4 x i32>, <vscale x 4 x i1> } %res
+}
+
+define { <vscale x 2 x i64>, <vscale x 2 x i1> } @speculative_load_nxv2i64(ptr %p, <vscale x 2 x i1> %mask) #0 {
+; CHECK-LABEL: speculative_load_nxv2i64:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    setffr
+; CHECK-NEXT:    ldff1d { z0.d }, p0/z, [x0]
+; CHECK-NEXT:    rdffr p0.b, p0/z
+; CHECK-NEXT:    ret
+  %res = call { <vscale x 2 x i64>, <vscale x 2 x i1> } @llvm.masked.speculative.load.nxv2i64.p0(ptr %p, i32 8, <vscale x 2 x i1> %mask)
+  ret { <vscale x 2 x i64>, <vscale x 2 x i1> } %res
+}
+
+define { <vscale x 8 x half>, <vscale x 8 x i1> } @speculative_load_nxv8f16(ptr %p, <vscale x 8 x i1> %mask) #0 {
+; CHECK-LABEL: speculative_load_nxv8f16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    setffr
+; CHECK-NEXT:    ldff1h { z0.h }, p0/z, [x0]
+; CHECK-NEXT:    rdffr p0.b, p0/z
+; CHECK-NEXT:    ret
+  %res = call { <vscale x 8 x half>, <vscale x 8 x i1> } @llvm.masked.speculative.load.nxv8f16.p0(ptr %p, i32 4, <vscale x 8 x i1> %mask)
+  ret { <vscale x 8 x half>, <vscale x 8 x i1> } %res
+}
+
+define { <vscale x 8 x bfloat>, <vscale x 8 x i1> } @speculative_load_nxv8bf16(ptr %p, <vscale x 8 x i1> %mask) #0 {
+; CHECK-LABEL: speculative_load_nxv8bf16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    setffr
+; CHECK-NEXT:    ldff1h { z0.h }, p0/z, [x0]
+; CHECK-NEXT:    rdffr p0.b, p0/z
+; CHECK-NEXT:    ret
+  %res = call { <vscale x 8 x bfloat>, <vscale x 8 x i1> } @llvm.masked.speculative.load.nxv8bf16.p0(ptr %p, i32 4, <vscale x 8 x i1> %mask)
+  ret { <vscale x 8 x bfloat>, <vscale x 8 x i1> } %res
+}
+
+define { <vscale x 4 x float>, <vscale x 4 x i1> } @speculative_load_nxv4f32(ptr %p, <vscale x 4 x i1> %mask) #0 {
+; CHECK-LABEL: speculative_load_nxv4f32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    setffr
+; CHECK-NEXT:    ldff1w { z0.s }, p0/z, [x0]
+; CHECK-NEXT:    rdffr p0.b, p0/z
+; CHECK-NEXT:    ret
+  %res = call { <vscale x 4 x float>, <vscale x 4 x i1> } @llvm.masked.speculative.load.nxv4f32.p0(ptr %p, i32 4, <vscale x 4 x i1> %mask)
+  ret { <vscale x 4 x float>, <vscale x 4 x i1> } %res
+}
+
+define { <vscale x 2 x double>, <vscale x 2 x i1> } @speculative_load_nxv2f64(ptr %p, <vscale x 2 x i1> %mask) #0 {
+; CHECK-LABEL: speculative_load_nxv2f64:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    setffr
+; CHECK-NEXT:    ldff1d { z0.d }, p0/z, [x0]
+; CHECK-NEXT:    rdffr p0.b, p0/z
+; CHECK-NEXT:    ret
+  %res = call { <vscale x 2 x double>, <vscale x 2 x i1> } @llvm.masked.speculative.load.nxv2f64.p0(ptr %p, i32 8, <vscale x 2 x i1> %mask)
+  ret { <vscale x 2 x double>, <vscale x 2 x i1> } %res
+}
+
+;; Test basic legalization.
+
+define { <vscale x 2 x i32>, <vscale x 2 x i1> } @speculative_load_nxv2i32_promote(ptr %p, <vscale x 2 x i1> %mask) #0 {
+; CHECK-LABEL: speculative_load_nxv2i32_promote:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    setffr
+; CHECK-NEXT:    ldff1d { z0.d }, p0/z, [x0]
+; CHECK-NEXT:    rdffr p0.b, p0/z
+; CHECK-NEXT:    ret
+  %res = call { <vscale x 2 x i32>, <vscale x 2 x i1> } @llvm.masked.speculative.load.nxv2i32.p0(ptr %p, i32 4, <vscale x 2 x i1> %mask)
+  ret { <vscale x 2 x i32>, <vscale x 2 x i1> } %res
+}
+
+;; FIXME: We can do an AArch64-specific split and use ldnf1w + brkn here.
+define { <vscale x 8 x i32>, <vscale x 8 x i1> } @speculative_load_nxv8i32_split(ptr %p, <vscale x 8 x i1> %mask) #0 {
+; CHECK-LABEL: speculative_load_nxv8i32_split:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    punpklo p0.h, p0.b
+; CHECK-NEXT:    setffr
+; CHECK-NEXT:    pfalse p1.b
+; CHECK-NEXT:    ldff1w { z0.s }, p0/z, [x0]
+; CHECK-NEXT:    rdffr p0.b, p0/z
+; CHECK-NEXT:    uzp1 p0.h, p0.h, p1.h
+; CHECK-NEXT:    ret
+  %res = call { <vscale x 8 x i32>, <vscale x 8 x i1> } @llvm.masked.speculative.load.nxv8i32.p0(ptr %p, i32 4, <vscale x 8 x i1> %mask)
+  ret { <vscale x 8 x i32>, <vscale x 8 x i1> } %res
+}
+
+define { <vscale x 1 x i64>, <vscale x 1 x i1> } @speculative_load_nxv1i64_widen(ptr %p, <vscale x 1 x i1> %mask) #0 {
+; CHECK-LABEL: speculative_load_nxv1i64_widen:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    setffr
+; CHECK-NEXT:    ldff1d { z0.d }, p0/z, [x0]
+; CHECK-NEXT:    rdffr p0.b, p0/z
+; CHECK-NEXT:    ret
+  %res = call { <vscale x 1 x i64>, <vscale x 1 x i1> } @llvm.masked.speculative.load.nxv1i64.p0(ptr %p, i32 8, <vscale x 1 x i1> %mask)
+  ret { <vscale x 1 x i64>, <vscale x 1 x i1> } %res
+}
+
+declare { <4 x i32>, <4 x i1> } @llvm.masked.speculative.load.v4i32.p0(ptr, i32, <4 x i1>)
+declare { <2 x double>, <2 x i1> } @llvm.masked.speculative.load.v2f64.p0(ptr, i32, <2 x i1>)
+declare { <vscale x 16 x i8>, <vscale x 16 x i1> } @llvm.masked.speculative.load.nxv16i8.p0(ptr, i32, <vscale x 16 x i1>)
+declare { <vscale x 8 x i16>, <vscale x 8 x i1> } @llvm.masked.speculative.load.nxv8i16.p0(ptr, i32, <vscale x 8 x i1>)
+declare { <vscale x 2 x i32>, <vscale x 2 x i1> } @llvm.masked.speculative.load.nxv2i32.p0(ptr, i32, <vscale x 2 x i1>)
+declare { <vscale x 4 x i32>, <vscale x 4 x i1> } @llvm.masked.speculative.load.nxv4i32.p0(ptr, i32, <vscale x 4 x i1>)
+declare { <vscale x 8 x i32>, <vscale x 8 x i1> } @llvm.masked.speculative.load.nxv8i32.p0(ptr, i32, <vscale x 8 x i1>)
+declare { <vscale x 1 x i64>, <vscale x 1 x i1> } @llvm.masked.speculative.load.nxv1i64.p0(ptr, i32, <vscale x 1 x i1>)
+declare { <vscale x 2 x i64>, <vscale x 2 x i1> } @llvm.masked.speculative.load.nxv2i64.p0(ptr, i32, <vscale x 2 x i1>)
+declare { <vscale x 8 x half>, <vscale x 8 x i1> } @llvm.masked.speculative.load.nxv8f16.p0(ptr, i32, <vscale x 8 x i1>)
+declare { <vscale x 8 x bfloat>, <vscale x 8 x i1> } @llvm.masked.speculative.load.nxv8bf16.p0(ptr, i32, <vscale x 8 x i1>)
+declare { <vscale x 4 x float>, <vscale x 4 x i1> } @llvm.masked.speculative.load.nxv4f32.p0(ptr, i32, <vscale x 4 x i1>)
+declare { <vscale x 2 x double>, <vscale x 2 x i1> } @llvm.masked.speculative.load.nxv2f64.p0(ptr, i32, <vscale x 2 x i1>)
+
+
+attributes #0 = { "target-features"="+sve" }