[llvm] [Vectorize] Vectorization for __builtin_prefetch (PR #66160)

Wed Jan 3 16:17:46 PST 2024

https://github.com/m-saito-fj updated https://github.com/llvm/llvm-project/pull/66160

>From 278a313e045654b80b5119aecf789420a70f6427 Mon Sep 17 00:00:00 2001
From: Moriyuki Saito <saitou.moriyuki at fujitsu.com>
Date: Tue, 12 Sep 2023 19:24:15 +0900
Subject: [PATCH] [Vectorize] Vectorization for __builtin_prefetch

Allow vectorization of loops containing __builtin_prefetch. Add
masked_prefetch intrinsic and masked_gather_prefetch intrinsic for this
purpose. Also, add a process to vectorize prefetch intrinsic in
LoopVectorize.
---
 llvm/docs/LangRef.rst                         |  79 ++++++++
 .../llvm/Analysis/TargetTransformInfo.h       |  12 ++
 .../llvm/Analysis/TargetTransformInfoImpl.h   |   8 +
 llvm/include/llvm/CodeGen/BasicTTIImpl.h      |  17 ++
 llvm/include/llvm/IR/IRBuilder.h              |  12 ++
 llvm/include/llvm/IR/IntrinsicInst.h          | 112 +++++++++++
 llvm/include/llvm/IR/Intrinsics.td            |  15 ++
 llvm/lib/Analysis/TargetTransformInfo.cpp     |  10 +
 llvm/lib/Analysis/VectorUtils.cpp             |   1 +
 llvm/lib/IR/IRBuilder.cpp                     |  50 +++++
 .../Vectorize/LoopVectorizationLegality.cpp   |   8 +-
 .../Transforms/Vectorize/LoopVectorize.cpp    | 175 ++++++++++++++----
 llvm/lib/Transforms/Vectorize/VPlan.h         |  15 +-
 .../lib/Transforms/Vectorize/VPlanRecipes.cpp |   8 +-
 14 files changed, 481 insertions(+), 41 deletions(-)

diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index b5918e3063d868..1311606af380ca 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -23893,6 +23893,85 @@ The '``llvm.masked.compressstore``' intrinsic is designed for compressing data i
 
 Other targets may support this intrinsic differently, for example, by lowering it into a sequence of branches that guard scalar store operations.
 
+Masked Vector Prefetch and Gather Prefetch Intrinsics
+-----------------------------------------------------
+
+LLVM provides intrinsics for predicated vector prefetch and gather prefetch operations. The predicate is specified by a mask operand, which holds one bit per vector pointer element, switching a prefetch to the associated pointer on or off.  Masked vector prefetch Intrinsics are designed for  sequential memory access, and masked gather prefetch intrinsics are designed for arbitrary memory accesses.
+
+.. _int_mprefetch:
+
+'``llvm.masked.prefetch.*``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+       declare void @llvm.masked.prefetch.p0.v16i1(ptr <address>, i32 <element-size>, i32 <rw>, i32 <locality>, <16 x i1> <mask>)
+       declare void @llvm.masked.prefetch.p0.v8i1(ptr <address>, i32 <element-size>, i32 <rw>, i32 <locality>, <8 x i1> <mask>)
+       declare void @llvm.masked.prefetch.p0.v4i1(ptr <address>, i32 <element-size>, i32 <rw>, i32 <locality>, <4 x i1> <mask>)
+       declare void @llvm.masked.prefetch.p0.v2i1(ptr <address>, i32 <element-size>, i32 <rw>, i32 <locality>, <2 x i1> <mask>)
+
+Overview:
+"""""""""
+The '``llvm.masked.prefetch``' intrinsic is a hint to the code generator to insert a masked prefetch instruction if supported.
+Masked prefetches have no effect on the behavior of the program but can change its performance characteristics.
+
+Arguments:
+""""""""""
+address is the address to be prefetched, element-size is the byte size of the data pointed to by ptr, rw is the specifier determining if the fetch should be for a read (0) or write (1), and locality is a temporal locality specifier ranging from (0) - no locality, to (3) - extremely local keep in cache. The rw and locality type arguments must be constant integers.
+
+Semantics:
+""""""""""
+The '``llvm.masked.prefetch``' intrinsic is designed to prefetch to a selected vector address in a single IR operation. It is useful for targets that support vector masked prefetches and allows vectorizing predicated basic blocks on these targets. Other targets may support this intrinsic differently, for example by lowering it into a sequence of branches that guard scalar prefetch operations.
+This intrinsic does not modify the behavior of the program. In particular, prefetches cannot trap and do not produce a value. On targets that support this intrinsic, the prefetch can provide hints to the processor cache for better performance.
+
+.. _int_gather_prefetch:
+
+'``llvm.masked.gather.prefetch.*``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+        declare void @llvm.masked.gather.prefetch.v2p0(<2 x ptr> <ptrs>, i32 <element-size>, i32 <rw>, i32 <locality>, <2 x i1> <mask>)
+        declare void @llvm.masked.gather.prefetch.v4p0(<4 x ptr> <ptrs>, i32 <element-size>, i32 <rw>, i32 <locality>, <4 x i1> <mask>)
+	
+Overview:
+"""""""""
+The '``llvm.masked.gather.prefetch``' intrinsic is a hint to the code generator to insert a masked gather prefetch instruction if supported.
+Masked gather prefetches have no effect on the behavior of the program but can change its performance characteristics.
+
+Arguments:
+""""""""""
+Ptrs is the address to be prefetched, element-size is the byte size of the data element pointed to by ptrs, rw is the specifier determining if the fetch should be for a read (0) or write (1), and locality is a temporal locality specifier ranging from (0) - no locality, to (3) - extremely local keep in cache. The rw and locality type arguments must be constant integers.
+
+Semantics:
+""""""""""
+The '``llvm.masked.gather.prefetch``' intrinsic is designed for conditional prefetch to arbitrary memory locations in a single IR operation. It is useful for targets that support vector masked gather prefetches and allows vectorizing basic blocks with data and control divergence. Other targets may support this intrinsic differently, for example by lowering it into a sequence of scalar prefetch operations. The semantics of this operation are equivalent to a sequence of conditional scalar prefetch. The mask restricts memory access to certain addresses.
+
+::
+
+       call void @llvm.masked.gather.prefetch.v4p0(<4 x ptr> %ptrs, i32 8, i32 0, i32 3, <4 x i1> <i1 true, i1 true, i1 true, i1 true>)
+
+       ;; The gather with all-true mask is equivalent to the following instruction sequence
+       %ptr0 = extractelement <4 x ptr> %ptrs, i32 0
+       %ptr1 = extractelement <4 x ptr> %ptrs, i32 1
+       %ptr2 = extractelement <4 x ptr> %ptrs, i32 2
+       %ptr3 = extractelement <4 x ptr> %ptrs, i32 3
+
+       call void prefetch(ptr %ptr0, i32 0, i32 3, i32 1)
+       call void prefetch(ptr %ptr1, i32 0, i32 3, i32 1)
+       call void prefetch(ptr %ptr2, i32 0, i32 3, i32 1)
+       call void prefetch(ptr %ptr3, i32 0, i32 3, i32 1)
+
+This intrinsic does not modify the behavior of the program. In particular, prefetches cannot trap and do not produce a value. On targets that support this intrinsic, the prefetch can provide hints to the processor cache for better performance.
+ 
 
 Memory Use Markers
 ------------------
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 048912beaba5a1..ede941b68fd21b 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -755,6 +755,8 @@ class TargetTransformInfo {
   bool isLegalMaskedStore(Type *DataType, Align Alignment) const;
   /// Return true if the target supports masked load.
   bool isLegalMaskedLoad(Type *DataType, Align Alignment) const;
+  /// Return true if the target supports masked load.
+  bool isLegalMaskedPrefetch(Type *DataType, Align Alignment) const;
 
   /// Return true if the target supports nontemporal store.
   bool isLegalNTStore(Type *DataType, Align Alignment) const;
@@ -769,6 +771,8 @@ class TargetTransformInfo {
   bool isLegalMaskedScatter(Type *DataType, Align Alignment) const;
   /// Return true if the target supports masked gather.
   bool isLegalMaskedGather(Type *DataType, Align Alignment) const;
+  /// Return true if the target supports masked gather prefetch.
+  bool isLegalMaskedGatherPrefetch(Type *DataType, Align Alignment) const;
   /// Return true if the target forces scalarizing of llvm.masked.gather
   /// intrinsics.
   bool forceScalarizeMaskedGather(VectorType *Type, Align Alignment) const;
@@ -1827,12 +1831,14 @@ class TargetTransformInfo::Concept {
     getPreferredAddressingMode(const Loop *L, ScalarEvolution *SE) const = 0;
   virtual bool isLegalMaskedStore(Type *DataType, Align Alignment) = 0;
   virtual bool isLegalMaskedLoad(Type *DataType, Align Alignment) = 0;
+  virtual bool isLegalMaskedPrefetch(Type *DataType, Align Alignment) = 0;
   virtual bool isLegalNTStore(Type *DataType, Align Alignment) = 0;
   virtual bool isLegalNTLoad(Type *DataType, Align Alignment) = 0;
   virtual bool isLegalBroadcastLoad(Type *ElementTy,
                                     ElementCount NumElements) const = 0;
   virtual bool isLegalMaskedScatter(Type *DataType, Align Alignment) = 0;
   virtual bool isLegalMaskedGather(Type *DataType, Align Alignment) = 0;
+  virtual bool isLegalMaskedGatherPrefetch(Type *DataType, Align Alignment) = 0;
   virtual bool forceScalarizeMaskedGather(VectorType *DataType,
                                           Align Alignment) = 0;
   virtual bool forceScalarizeMaskedScatter(VectorType *DataType,
@@ -2300,6 +2306,9 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
   bool isLegalMaskedLoad(Type *DataType, Align Alignment) override {
     return Impl.isLegalMaskedLoad(DataType, Alignment);
   }
+  bool isLegalMaskedPrefetch(Type *DataType, Align Alignment) override {
+    return Impl.isLegalMaskedPrefetch(DataType, Alignment);
+  }
   bool isLegalNTStore(Type *DataType, Align Alignment) override {
     return Impl.isLegalNTStore(DataType, Alignment);
   }
@@ -2316,6 +2325,9 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
   bool isLegalMaskedGather(Type *DataType, Align Alignment) override {
     return Impl.isLegalMaskedGather(DataType, Alignment);
   }
+  bool isLegalMaskedGatherPrefetch(Type *DataType, Align Alignment) override {
+    return Impl.isLegalMaskedGatherPrefetch(DataType, Alignment);
+  }
   bool forceScalarizeMaskedGather(VectorType *DataType,
                                   Align Alignment) override {
     return Impl.forceScalarizeMaskedGather(DataType, Alignment);
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 7ad3ce512a3552..d8cd1729bfa54b 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -260,6 +260,10 @@ class TargetTransformInfoImplBase {
     return false;
   }
 
+  bool isLegalMaskedPrefetch(Type *DataType, Align Alignment) const {
+    return false;
+  }
+
   bool isLegalNTStore(Type *DataType, Align Alignment) const {
     // By default, assume nontemporal memory stores are available for stores
     // that are aligned and have a size that is a power of 2.
@@ -286,6 +290,10 @@ class TargetTransformInfoImplBase {
     return false;
   }
 
+  bool isLegalMaskedGatherPrefetch(Type *DataType, Align Alignment) const {
+    return false;
+  }
+
   bool forceScalarizeMaskedGather(VectorType *DataType, Align Alignment) const {
     return false;
   }
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index 5e7bdcdf72a49f..d61c0067188b14 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -1595,6 +1595,16 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
       return thisT()->getGatherScatterOpCost(Instruction::Load, RetTy, Args[0],
                                              VarMask, Alignment, CostKind, I);
     }
+    case Intrinsic::masked_gather_prefetch: {
+      const Value *Mask = Args[4];
+      bool VarMask = !isa<Constant>(Mask);
+      Align Alignment = cast<ConstantInt>(Args[1])->getAlignValue();
+      auto *MaskVT = cast<VectorType>(Mask->getType());
+      auto *PsudoDataTy = MaskVT->getWithNewBitWidth(Alignment.value() * 8);
+      return thisT()->getGatherScatterOpCost(Instruction::Call, PsudoDataTy,
+                                             Args[0], VarMask, Alignment,
+                                             CostKind, I);
+    }
     case Intrinsic::experimental_stepvector: {
       if (isa<ScalableVectorType>(RetTy))
         return BaseT::getIntrinsicInstrCost(ICA, CostKind);
@@ -1980,6 +1990,13 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
       return thisT()->getMaskedMemoryOpCost(Instruction::Load, Ty, TyAlign, 0,
                                             CostKind);
     }
+    case Intrinsic::masked_prefetch: {
+      auto *MaskVT = cast<VectorType>(ICA.getArgTypes()[4]);
+      Type *PsudoTy = MaskVT->getWithNewBitWidth(32);
+      Align TyAlign = thisT()->DL.getABITypeAlign(PsudoTy);
+      return thisT()->getMaskedMemoryOpCost(Instruction::Call, PsudoTy, TyAlign,
+                                            0, CostKind);
+    }
     case Intrinsic::vector_reduce_add:
       return thisT()->getArithmeticReductionCost(Instruction::Add, VecOpTy,
                                                  std::nullopt, CostKind);
diff --git a/llvm/include/llvm/IR/IRBuilder.h b/llvm/include/llvm/IR/IRBuilder.h
index 8863ca8eba47ef..fdaa3c3d0c9b43 100644
--- a/llvm/include/llvm/IR/IRBuilder.h
+++ b/llvm/include/llvm/IR/IRBuilder.h
@@ -820,6 +820,11 @@ class IRBuilderBase {
   CallInst *CreateMaskedStore(Value *Val, Value *Ptr, Align Alignment,
                               Value *Mask);
 
+  /// Create a call to Masked Prefetch intrinsic
+  CallInst *CreateMaskedPrefetch(Value *Ptr, Value *ElemSize, Value *Mask,
+                                 Value *RW = nullptr, Value *Locality = nullptr,
+                                 const Twine &Name = "");
+
   /// Create a call to Masked Gather intrinsic
   CallInst *CreateMaskedGather(Type *Ty, Value *Ptrs, Align Alignment,
                                Value *Mask = nullptr, Value *PassThru = nullptr,
@@ -829,6 +834,13 @@ class IRBuilderBase {
   CallInst *CreateMaskedScatter(Value *Val, Value *Ptrs, Align Alignment,
                                 Value *Mask = nullptr);
 
+  /// Create a call to Masked Gather Prefetch intrinsic
+  CallInst *CreateMaskedGatherPrefetch(Value *Ptrs, Value *ElemSize,
+                                       Value *Mask = nullptr,
+                                       Value *RW = nullptr,
+                                       Value *Locality = nullptr,
+                                       const Twine &Name = "");
+
   /// Create a call to Masked Expand Load intrinsic
   CallInst *CreateMaskedExpandLoad(Type *Ty, Value *Ptr, Value *Mask = nullptr,
                                    Value *PassThru = nullptr,
diff --git a/llvm/include/llvm/IR/IntrinsicInst.h b/llvm/include/llvm/IR/IntrinsicInst.h
index b8d578d0fee082..9f8deb6dde057f 100644
--- a/llvm/include/llvm/IR/IntrinsicInst.h
+++ b/llvm/include/llvm/IR/IntrinsicInst.h
@@ -1365,6 +1365,118 @@ class AnyMemCpyInst : public AnyMemTransferInst {
   }
 };
 
+/// This class prefetch intrinsic
+/// i.e. llvm.prefetch
+class PrefetchInst : public IntrinsicInst {
+public:
+  static bool classof(const IntrinsicInst *I) {
+    return I->getIntrinsicID() == Intrinsic::prefetch;
+  }
+  static bool classof(const Value *V) {
+    return isa<IntrinsicInst>(V) && classof(cast<IntrinsicInst>(V));
+  }
+
+  Value *getPointerOperand() { return getOperand(0); }
+  const Value *getPointerOperand() const { return getOperand(0); }
+  static unsigned getPointerOperandIndex() { return 0U; }
+  Type *getPointerOperandType() const { return getPointerOperand()->getType(); }
+};
+
+/// A helper function that returns the pointer operand of a prefetch
+/// instruction. Returns nullptr if not prefetch.
+inline const Value *getPrefetchPointerOperand(const Value *V) {
+  if (auto *Prefetch = dyn_cast<PrefetchInst>(V))
+    return Prefetch->getPointerOperand();
+  return nullptr;
+}
+inline Value *getPrefetchPointerOperand(Value *V) {
+  return const_cast<Value *>(
+      getPrefetchPointerOperand(static_cast<const Value *>(V)));
+}
+
+/// A helper function that returns the address space of the pointer operand of
+/// prefetch instruction.
+inline unsigned getPrefetchAddressSpace(Value *I) {
+  assert(isa<PrefetchInst>(I) && "Expected prefetch instruction");
+  auto *PtrTy = dyn_cast<PrefetchInst>(I)->getPointerOperandType();
+  return dyn_cast<PointerType>(PtrTy)->getAddressSpace();
+}
+
+/// A helper function that psuedo returns type of a prefetch instruction.
+inline Type *getPrefetchPseudoType(Value *I) {
+  assert(isa<PrefetchInst>(I) && "Expected Prefetch instruction");
+  auto *Prefetch = dyn_cast<PrefetchInst>(I);
+
+  // Get type for the following pattern
+  // ex) %1 = add nuw nsw i64 %indvars.iv, 8
+  //     %arrayidx = getelementptr inbounds double, ptr %b, i64 %1
+  //     tail call void @llvm.prefetch.p0(ptr nonnull %arrayidx, i32 0, i32 3,
+  //     i32 1)
+  auto *GEP = dyn_cast<GetElementPtrInst>(Prefetch->getPointerOperand());
+  if (GEP) {
+    auto *ElemTy = GEP->getSourceElementType();
+    if (isa<ArrayType>(ElemTy) || isa<StructType>(ElemTy))
+      return Type::getInt64Ty(I->getContext());
+    return ElemTy;
+  }
+
+  // Get type for the following pattern
+  // ex) %a = alloca [100 x double], align 8
+  //     tail call void @llvm.prefetch.p0(ptr nonnull %a, i32 0, i32 3, i32 1)
+  auto *Alloca = dyn_cast<AllocaInst>(Prefetch->getPointerOperand());
+  if (Alloca) {
+    auto *ElemTy = Alloca->getAllocatedType()->getArrayElementType();
+    if (isa<ArrayType>(ElemTy) || isa<StructType>(ElemTy))
+      return Type::getInt64Ty(I->getContext());
+    return ElemTy;
+  }
+
+  return Type::getInt64Ty(I->getContext());
+}
+
+/// A helper function that returns the pseudo-alignment of prefetch instruction.
+inline Align getPrefetchPseudoAlignment(Value *I) {
+  assert(isa<PrefetchInst>(I) && "Expected Prefetch instruction");
+  auto *Ty = getPrefetchPseudoType(I);
+  return Ty ? Align(Ty->getScalarSizeInBits() >> 3) : Align(1ULL);
+}
+
+/// A helper function that returns the alignment of load/store/prefetch
+/// instruction.
+inline Align getLdStPfAlignment(Value *I) {
+  if (isa<PrefetchInst>(I))
+    return getPrefetchPseudoAlignment(I);
+  return getLoadStoreAlignment(I);
+}
+
+/// A helper function that returns the pointer operand of a load/store/prefetch
+/// instruction. Returns nullptr if not prefetch.
+inline const Value *getLdStPfPointerOperand(const Value *I) {
+  if (isa<PrefetchInst>(I))
+    return getPrefetchPointerOperand(I);
+  return getLoadStorePointerOperand(I);
+}
+inline Value *getLdStPfPointerOperand(Value *V) {
+  return const_cast<Value *>(
+      getLdStPfPointerOperand(static_cast<const Value *>(V)));
+}
+
+/// A helper function that returns the address space of the pointer operand of
+/// load/store/prefetch instruction.
+inline unsigned getLdStPfAddressSpace(Value *I) {
+  if (isa<PrefetchInst>(I))
+    return getPrefetchAddressSpace(I);
+  return getLoadStoreAddressSpace(I);
+}
+
+/// A helper function that returns the type of a load/store/prefetch
+/// instruction.
+inline Type *getLdStPfType(Value *I) {
+  if (isa<PrefetchInst>(I))
+    return getPrefetchPseudoType(I);
+  return getLoadStoreType(I);
+}
+
 /// This class represents any memmove intrinsic
 /// i.e. llvm.element.unordered.atomic.memmove
 ///  and llvm.memmove
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index b54c697296b20a..139c1bad7e9596 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -2258,6 +2258,21 @@ def int_masked_compressstore:
             [IntrWriteMem, IntrArgMemOnly, IntrWillReturn,
              NoCapture<ArgIndex<1>>]>;
 
+def int_masked_prefetch:
+  DefaultAttrsIntrinsic<[],
+            [llvm_anyptr_ty,
+             llvm_i32_ty, llvm_i32_ty, llvm_i32_ty, llvm_anyvector_ty],
+            [IntrInaccessibleMemOrArgMemOnly, IntrWillReturn,
+             ImmArg<ArgIndex<1>>, ImmArg<ArgIndex<2>>]>;
+
+def int_masked_gather_prefetch:
+  DefaultAttrsIntrinsic<[],
+            [llvm_anyvector_ty,
+             llvm_i32_ty, llvm_i32_ty, llvm_i32_ty,
+             LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],
+            [IntrInaccessibleMemOrArgMemOnly, IntrWillReturn,
+             ImmArg<ArgIndex<1>>, ImmArg<ArgIndex<2>>]>;
+
 // Test whether a pointer is associated with a type metadata identifier.
 def int_type_test : DefaultAttrsIntrinsic<[llvm_i1_ty], [llvm_ptr_ty, llvm_metadata_ty],
                               [IntrNoMem, IntrWillReturn, IntrSpeculatable]>;
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 67246afa23147a..2051cfc983e147 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -448,6 +448,11 @@ bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType,
   return TTIImpl->isLegalMaskedLoad(DataType, Alignment);
 }
 
+bool TargetTransformInfo::isLegalMaskedPrefetch(Type *DataType,
+                                                Align Alignment) const {
+  return TTIImpl->isLegalMaskedPrefetch(DataType, Alignment);
+}
+
 bool TargetTransformInfo::isLegalNTStore(Type *DataType,
                                          Align Alignment) const {
   return TTIImpl->isLegalNTStore(DataType, Alignment);
@@ -496,6 +501,11 @@ bool TargetTransformInfo::isLegalMaskedExpandLoad(Type *DataType) const {
   return TTIImpl->isLegalMaskedExpandLoad(DataType);
 }
 
+bool TargetTransformInfo::isLegalMaskedGatherPrefetch(Type *DataType,
+                                                      Align Alignment) const {
+  return TTIImpl->isLegalMaskedGatherPrefetch(DataType, Alignment);
+}
+
 bool TargetTransformInfo::enableOrderedReductions() const {
   return TTIImpl->enableOrderedReductions();
 }
diff --git a/llvm/lib/Analysis/VectorUtils.cpp b/llvm/lib/Analysis/VectorUtils.cpp
index 5b57f0a25cec81..f4dfdd55c47b11 100644
--- a/llvm/lib/Analysis/VectorUtils.cpp
+++ b/llvm/lib/Analysis/VectorUtils.cpp
@@ -95,6 +95,7 @@ bool llvm::isTriviallyVectorizable(Intrinsic::ID ID) {
   case Intrinsic::fptoui_sat:
   case Intrinsic::lrint:
   case Intrinsic::llrint:
+  case Intrinsic::prefetch:
     return true;
   default:
     return false;
diff --git a/llvm/lib/IR/IRBuilder.cpp b/llvm/lib/IR/IRBuilder.cpp
index b09b80f95871a1..1a539e68743212 100644
--- a/llvm/lib/IR/IRBuilder.cpp
+++ b/llvm/lib/IR/IRBuilder.cpp
@@ -606,6 +606,27 @@ CallInst *IRBuilderBase::CreateMaskedStore(Value *Val, Value *Ptr,
   return CreateMaskedIntrinsic(Intrinsic::masked_store, Ops, OverloadedTypes);
 }
 
+/// Create a call to a Masked Prefetch intrinsic.
+/// \p Ptr      - base pointer for the prefetch
+/// \p ElemSize - element size for memory address generation
+/// \p Mask     - vector of booleans which indicates what vector lanes should
+///               be accessed in memory
+/// \p RW       - Read or Write
+/// \p Locality - Cache Level
+/// \p Name     - name of the result variable
+CallInst *IRBuilderBase::CreateMaskedPrefetch(Value *Ptr, Value *ElemSize,
+                                              Value *Mask, Value *RW,
+                                              Value *Locality,
+                                              const Twine &Name) {
+  auto *PtrTy = cast<PointerType>(Ptr->getType());
+
+  assert(Mask && "Mask should not be null");
+  Type *OverloadedTypes[] = {PtrTy, Mask->getType()};
+  Value *Ops[] = {Ptr, ElemSize, RW, Locality, Mask};
+  return CreateMaskedIntrinsic(Intrinsic::masked_prefetch, Ops, OverloadedTypes,
+                               Name);
+}
+
 /// Create a call to a Masked intrinsic, with given intrinsic Id,
 /// an array of operands - Ops, and an array of overloaded types -
 /// OverloadedTypes.
@@ -712,6 +733,35 @@ CallInst *IRBuilderBase::CreateMaskedCompressStore(Value *Val, Value *Ptr,
                                OverloadedTypes);
 }
 
+/// Create a call to a Masked Gather Prefetch intrinsic.
+/// \p Ptrs     - vector of pointers for prefetch
+/// \p ElemSize - element size for memory address generation
+/// \p Mask     - vector of booleans which indicates what vector lanes should
+///               be accessed in memory
+/// \p RW       - Read or Write
+/// \p Locality - Cache Level
+/// \p Name     - name of the result variable
+CallInst *IRBuilderBase::CreateMaskedGatherPrefetch(Value *Ptrs,
+                                                    Value *ElemSize,
+                                                    Value *Mask, Value *RW,
+                                                    Value *Locality,
+                                                    const Twine &Name) {
+  auto *PtrsTy = cast<VectorType>(Ptrs->getType());
+  ElementCount NumElts = PtrsTy->getElementCount();
+
+  if (!Mask)
+    Mask = Constant::getAllOnesValue(
+        VectorType::get(Type::getInt1Ty(Context), NumElts));
+
+  Type *OverloadedTypes[] = {PtrsTy};
+  Value *Ops[] = {Ptrs, ElemSize, RW, Locality, Mask};
+
+  // We specify only one type when we create this intrinsic. Types of other
+  // arguments are derived from this type.
+  return CreateMaskedIntrinsic(Intrinsic::masked_gather_prefetch, Ops,
+                               OverloadedTypes, Name);
+}
+
 template <typename T0>
 static std::vector<Value *>
 getStatepointArgs(IRBuilderBase &B, uint64_t ID, uint32_t NumPatchBytes,
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
index 37a356c43e29a4..991e9853017287 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
@@ -600,7 +600,7 @@ bool LoopVectorizationLegality::isUniform(Value *V, ElementCount VF) const {
 
 bool LoopVectorizationLegality::isUniformMemOp(Instruction &I,
                                                ElementCount VF) const {
-  Value *Ptr = getLoadStorePointerOperand(&I);
+  Value *Ptr = getLdStPfPointerOperand(&I);
   if (!Ptr)
     return false;
   // Note: There's nothing inherent which prevents predicated loads and
@@ -1289,6 +1289,12 @@ bool LoopVectorizationLegality::blockCanBePredicated(
       continue;
     }
 
+    // Prefetchs are handeled via masking
+    if (auto *PF = dyn_cast<PrefetchInst>(&I)) {
+      MaskedOp.insert(PF);
+      continue;
+    }
+
     if (I.mayReadFromMemory() || I.mayWriteToMemory() || I.mayThrow())
       return false;
   }
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index f5f04615eedee4..cc9bfa74a674e8 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -1519,19 +1519,29 @@ class LoopVectorizationCostModel {
            TTI.isLegalMaskedLoad(DataType, Alignment);
   }
 
+  /// Returns true if the target machine supports masked prefetch operation
+  /// for the given \p DataType and kind of access to \p Ptr.
+  bool isLegalMaskedPrefetch(Type *DataType, Value *Ptr,
+                             Align Alignment) const {
+    return Legal->isConsecutivePtr(DataType, Ptr) &&
+           TTI.isLegalMaskedPrefetch(DataType, Alignment);
+  }
+
   /// Returns true if the target machine can represent \p V as a masked gather
   /// or scatter operation.
   bool isLegalGatherOrScatter(Value *V, ElementCount VF) {
     bool LI = isa<LoadInst>(V);
     bool SI = isa<StoreInst>(V);
-    if (!LI && !SI)
+    bool PF = isa<PrefetchInst>(V);
+    if (!LI && !SI && !PF)
       return false;
-    auto *Ty = getLoadStoreType(V);
-    Align Align = getLoadStoreAlignment(V);
+    auto *Ty = getLdStPfType(V);
+    Align Align = getLdStPfAlignment(V);
     if (VF.isVector())
       Ty = VectorType::get(Ty, VF);
     return (LI && TTI.isLegalMaskedGather(Ty, Align)) ||
-           (SI && TTI.isLegalMaskedScatter(Ty, Align));
+           (SI && TTI.isLegalMaskedScatter(Ty, Align)) ||
+           (PF && TTI.isLegalMaskedPrefetch(Ty, Align));
   }
 
   /// Returns true if the target machine supports all of the reduction
@@ -4103,11 +4113,21 @@ bool LoopVectorizationCostModel::isScalarWithPredication(
   switch(I->getOpcode()) {
   default:
     return true;
-  case Instruction::Call:
+  case Instruction::Call: {
     if (VF.isScalar())
       return true;
-    return CallWideningDecisions.at(std::make_pair(cast<CallInst>(I), VF))
-               .Kind == CM_Scalarize;
+    if (!isa<PrefetchInst>(I))
+      return CallWideningDecisions.at(std::make_pair(cast<CallInst>(I), VF))
+         .Kind == CM_Scalarize;
+    auto *Ptr = getPrefetchPointerOperand(I);
+    auto *Ty = getPrefetchPseudoType(I);
+    Type *VTy = Ty;
+    if (VF.isVector())
+      VTy = VectorType::get(Ty, VF);
+    const Align Alignment = getPrefetchPseudoAlignment(I);
+    return !(isLegalMaskedPrefetch(Ty, Ptr, Alignment) ||
+             TTI.isLegalMaskedGatherPrefetch(VTy, Alignment));
+  }
   case Instruction::Load:
   case Instruction::Store: {
     auto *Ptr = getLoadStorePointerOperand(I);
@@ -4314,10 +4334,11 @@ bool LoopVectorizationCostModel::interleavedAccessCanBeWidened(
 bool LoopVectorizationCostModel::memoryInstructionCanBeWidened(
     Instruction *I, ElementCount VF) {
   // Get and ensure we have a valid memory instruction.
-  assert((isa<LoadInst, StoreInst>(I)) && "Invalid memory instruction");
+  assert((isa<LoadInst, StoreInst, PrefetchInst>(I)) &&
+         "Invalid memory instruction");
 
-  auto *Ptr = getLoadStorePointerOperand(I);
-  auto *ScalarTy = getLoadStoreType(I);
+  auto *Ptr = getLdStPfPointerOperand(I);
+  auto *ScalarTy = getLdStPfType(I);
 
   // In order to be widened, the pointer should be consecutive, first of all.
   if (!Legal->isConsecutivePtr(ScalarTy, Ptr))
@@ -4334,6 +4355,14 @@ bool LoopVectorizationCostModel::memoryInstructionCanBeWidened(
   if (hasIrregularType(ScalarTy, DL))
     return false;
 
+  // If the instruction is a prefetch, check if it is supported by the target
+  // machine.
+  if (isa<PrefetchInst>(I)) {
+    auto *Ptr = getPrefetchPointerOperand(I);
+    auto *Ty = getPrefetchPseudoType(I);
+    const Align Alignment = getPrefetchPseudoAlignment(I);
+    return isLegalMaskedPrefetch(Ty, Ptr, Alignment);
+  }
   return true;
 }
 
@@ -6181,11 +6210,11 @@ LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,
   if (VF.isScalable())
     return InstructionCost::getInvalid();
 
-  Type *ValTy = getLoadStoreType(I);
+  Type *ValTy = getLdStPfType(I);
   auto SE = PSE.getSE();
 
-  unsigned AS = getLoadStoreAddressSpace(I);
-  Value *Ptr = getLoadStorePointerOperand(I);
+  unsigned AS = getLdStPfAddressSpace(I);
+  Value *Ptr = getLdStPfPointerOperand(I);
   Type *PtrTy = ToVectorTy(Ptr->getType(), VF);
   // NOTE: PtrTy is a vector to signal `TTI::getAddressComputationCost`
   //       that it is being called from this specific place.
@@ -6201,7 +6230,7 @@ LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,
   // Don't pass *I here, since it is scalar but will actually be part of a
   // vectorized loop where the user of it is a vectorized instruction.
   TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
-  const Align Alignment = getLoadStoreAlignment(I);
+  const Align Alignment = getLdStPfAlignment(I);
   Cost += VF.getKnownMinValue() * TTI.getMemoryOpCost(I->getOpcode(),
                                                       ValTy->getScalarType(),
                                                       Alignment, AS, CostKind);
@@ -6236,16 +6265,16 @@ LoopVectorizationCostModel::getMemInstScalarizationCost(Instruction *I,
 InstructionCost
 LoopVectorizationCostModel::getConsecutiveMemOpCost(Instruction *I,
                                                     ElementCount VF) {
-  Type *ValTy = getLoadStoreType(I);
+  Type *ValTy = getLdStPfType(I);
   auto *VectorTy = cast<VectorType>(ToVectorTy(ValTy, VF));
-  Value *Ptr = getLoadStorePointerOperand(I);
-  unsigned AS = getLoadStoreAddressSpace(I);
+  Value *Ptr = getLdStPfPointerOperand(I);
+  unsigned AS = getLdStPfAddressSpace(I);
   int ConsecutiveStride = Legal->isConsecutivePtr(ValTy, Ptr);
   enum TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
 
   assert((ConsecutiveStride == 1 || ConsecutiveStride == -1) &&
          "Stride should be 1 or -1 for consecutive memory access");
-  const Align Alignment = getLoadStoreAlignment(I);
+  const Align Alignment = getLdStPfAlignment(I);
   InstructionCost Cost = 0;
   if (Legal->isMaskRequired(I)) {
     Cost += TTI.getMaskedMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS,
@@ -6268,11 +6297,16 @@ LoopVectorizationCostModel::getUniformMemOpCost(Instruction *I,
                                                 ElementCount VF) {
   assert(Legal->isUniformMemOp(*I, VF));
 
-  Type *ValTy = getLoadStoreType(I);
+  Type *ValTy = getLdStPfType(I);
   auto *VectorTy = cast<VectorType>(ToVectorTy(ValTy, VF));
-  const Align Alignment = getLoadStoreAlignment(I);
-  unsigned AS = getLoadStoreAddressSpace(I);
+  const Align Alignment = getLdStPfAlignment(I);
+  unsigned AS = getLdStPfAddressSpace(I);
   enum TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
+  if (isa<PrefetchInst>(I)) {
+    return TTI.getAddressComputationCost(ValTy) +
+           TTI.getMemoryOpCost(Instruction::Call, ValTy, Alignment, AS,
+                               CostKind);
+  }
   if (isa<LoadInst>(I)) {
     return TTI.getAddressComputationCost(ValTy) +
            TTI.getMemoryOpCost(Instruction::Load, ValTy, Alignment, AS,
@@ -6294,10 +6328,10 @@ LoopVectorizationCostModel::getUniformMemOpCost(Instruction *I,
 InstructionCost
 LoopVectorizationCostModel::getGatherScatterCost(Instruction *I,
                                                  ElementCount VF) {
-  Type *ValTy = getLoadStoreType(I);
+  Type *ValTy = getLdStPfType(I);
   auto *VectorTy = cast<VectorType>(ToVectorTy(ValTy, VF));
-  const Align Alignment = getLoadStoreAlignment(I);
-  const Value *Ptr = getLoadStorePointerOperand(I);
+  const Align Alignment = getLdStPfAlignment(I);
+  const Value *Ptr = getLdStPfPointerOperand(I);
 
   return TTI.getAddressComputationCost(VectorTy) +
          TTI.getGatherScatterOpCost(
@@ -6524,9 +6558,9 @@ LoopVectorizationCostModel::getMemoryInstructionCost(Instruction *I,
   // Calculate scalar cost only. Vectorization cost should be ready at this
   // moment.
   if (VF.isScalar()) {
-    Type *ValTy = getLoadStoreType(I);
-    const Align Alignment = getLoadStoreAlignment(I);
-    unsigned AS = getLoadStoreAddressSpace(I);
+    Type *ValTy = getLdStPfType(I);
+    const Align Alignment = getLdStPfAlignment(I);
+    unsigned AS = getLdStPfAddressSpace(I);
 
     TTI::OperandValueInfo OpInfo = TTI::getOperandInfo(I->getOperand(0));
     return TTI.getAddressComputationCost(ValTy) +
@@ -6627,7 +6661,7 @@ void LoopVectorizationCostModel::setCostBasedWideningDecision(ElementCount VF) {
   for (BasicBlock *BB : TheLoop->blocks()) {
     // For each instruction in the old loop.
     for (Instruction &I : *BB) {
-      Value *Ptr =  getLoadStorePointerOperand(&I);
+      Value *Ptr = getLdStPfPointerOperand(&I);
       if (!Ptr)
         continue;
 
@@ -6687,7 +6721,7 @@ void LoopVectorizationCostModel::setCostBasedWideningDecision(ElementCount VF) {
       if (memoryInstructionCanBeWidened(&I, VF)) {
         InstructionCost Cost = getConsecutiveMemOpCost(&I, VF);
         int ConsecutiveStride = Legal->isConsecutivePtr(
-            getLoadStoreType(&I), getLoadStorePointerOperand(&I));
+            getLdStPfType(&I), getLdStPfPointerOperand(&I));
         assert((ConsecutiveStride == 1 || ConsecutiveStride == -1) &&
                "Expected consecutive stride.");
         InstWidening Decision =
@@ -7285,8 +7319,23 @@ LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF,
 
     return TTI.getCastInstrCost(Opcode, VectorTy, SrcVecTy, CCH, CostKind, I);
   }
-  case Instruction::Call:
+  case Instruction::Call: {
+    if (isa<PrefetchInst>(I)) {
+      ElementCount Width = VF;
+      if (Width.isVector()) {
+        InstWidening Decision = getWideningDecision(I, Width);
+        assert(Decision != CM_Unknown &&
+               "CM decision should be taken at this point");
+        if (getWideningCost(I, VF) == InstructionCost::getInvalid())
+          return InstructionCost::getInvalid();
+        if (Decision == CM_Scalarize)
+          Width = ElementCount::getFixed(1);
+      }
+      VectorTy = ToVectorTy(getLdStPfType(I), Width);
+      return getMemoryInstructionCost(I, VF);
+    }
     return getVectorCallCost(cast<CallInst>(I), VF);
+  }
   case Instruction::ExtractValue:
     return TTI.getInstructionCost(I, TTI::TCK_RecipThroughput);
   case Instruction::Alloca:
@@ -8143,7 +8192,7 @@ VPRecipeBase *VPRecipeBuilder::tryToWidenMemory(Instruction *I,
                                                 ArrayRef<VPValue *> Operands,
                                                 VFRange &Range,
                                                 VPlanPtr &Plan) {
-  assert((isa<LoadInst>(I) || isa<StoreInst>(I)) &&
+  assert((isa<LoadInst>(I) || isa<StoreInst>(I) || isa<PrefetchInst>(I)) &&
          "Must be called with either a load or store");
 
   auto willWiden = [&](ElementCount VF) -> bool {
@@ -8185,6 +8234,10 @@ VPRecipeBase *VPRecipeBuilder::tryToWidenMemory(Instruction *I,
     return new VPWidenMemoryInstructionRecipe(*Load, Ptr, Mask, Consecutive,
                                               Reverse);
 
+  if (PrefetchInst *Prefetch = dyn_cast<PrefetchInst>(I))
+    return new VPWidenMemoryInstructionRecipe(*Prefetch, Operands[0], Mask,
+                                              Consecutive, Reverse);
+
   StoreInst *Store = cast<StoreInst>(I);
   return new VPWidenMemoryInstructionRecipe(*Store, Ptr, Operands[0], Mask,
                                             Consecutive, Reverse);
@@ -8590,10 +8643,12 @@ VPRecipeBuilder::tryToCreateWidenRecipe(Instruction *Instr,
           [&](ElementCount VF) { return VF.isScalar(); }, Range))
     return nullptr;
 
-  if (auto *CI = dyn_cast<CallInst>(Instr))
+  if (isa<CallInst>(Instr) && !isa<PrefetchInst>(Instr)) {
+    auto *CI = dyn_cast<CallInst>(Instr);
     return toVPRecipeResult(tryToWidenCall(CI, Operands, Range, Plan));
+  }
 
-  if (isa<LoadInst>(Instr) || isa<StoreInst>(Instr))
+  if (isa<LoadInst>(Instr) || isa<StoreInst>(Instr) || isa<PrefetchInst>(Instr))
     return toVPRecipeResult(tryToWidenMemory(Instr, Operands, Range, Plan));
 
   if (!shouldWiden(Instr, Range))
@@ -9422,7 +9477,7 @@ void VPReplicateRecipe::execute(VPTransformState &State) {
   if (IsUniform) {
     // If the recipe is uniform across all parts (instead of just per VF), only
     // generate a single instance.
-    if ((isa<LoadInst>(UI) || isa<StoreInst>(UI)) &&
+    if ((isa<LoadInst>(UI) || isa<StoreInst>(UI) || isa<PrefetchInst>(UI)) &&
         all_of(operands(), [](VPValue *Op) {
           return Op->isDefinedOutsideVectorRegions();
         })) {
@@ -9452,6 +9507,16 @@ void VPReplicateRecipe::execute(VPTransformState &State) {
     return;
   }
 
+  // A prefetch of a loop varying value to a uniform address only needs the last
+  // copy of the store.
+  if (isa<PrefetchInst>(UI) &&
+      vputils::isUniformAfterVectorization(getOperand(0))) {
+    auto Lane = VPLane::getLastLaneForVF(State.VF);
+    State.ILV->scalarizeInstruction(UI, this, VPIteration(State.UF - 1, Lane),
+                                    State);
+    return;
+  }
+
   // Generate scalar instances for all VF lanes of all UF parts.
   assert(!State.VF.isScalable() && "Can't scalarize a scalable vector");
   const unsigned EndLane = State.VF.getKnownMinValue();
@@ -9466,15 +9531,17 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
   // Attempt to issue a wide load.
   LoadInst *LI = dyn_cast<LoadInst>(&Ingredient);
   StoreInst *SI = dyn_cast<StoreInst>(&Ingredient);
+  PrefetchInst *PF = dyn_cast<PrefetchInst>(&Ingredient);
 
-  assert((LI || SI) && "Invalid Load/Store instruction");
+  assert((LI || SI || PF) && "Invalid Load/Store/Prefetch instruction");
   assert((!SI || StoredValue) && "No stored value provided for widened store");
   assert((!LI || !StoredValue) && "Stored value provided for widened load");
+  assert((!PF || !StoredValue) && "Stored value provided for widened prefetch");
 
-  Type *ScalarDataTy = getLoadStoreType(&Ingredient);
+  Type *ScalarDataTy = getLdStPfType(&Ingredient);
 
   auto *DataTy = VectorType::get(ScalarDataTy, State.VF);
-  const Align Alignment = getLoadStoreAlignment(&Ingredient);
+  const Align Alignment = getLdStPfAlignment(&Ingredient);
   bool CreateGatherScatter = !isConsecutive();
 
   auto &Builder = State.Builder;
@@ -9523,6 +9590,40 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
     return;
   }
 
+  if (PF) {
+    State.setDebugLocFrom(PF->getDebugLoc());
+
+    Type *ESizeTy = Type::getInt32Ty(PF->getContext());
+    int32_t ESize = ScalarDataTy->getScalarSizeInBits() >> 3;
+    Value *ElemSize = ConstantInt::get(ESizeTy, ESize);
+    Value *RW = PF->getArgOperand(1);
+    Value *Locality = PF->getArgOperand(2);
+
+    for (unsigned Part = 0; Part < State.UF; ++Part) {
+      Instruction *NewPF = nullptr;
+      if (CreateGatherScatter) {
+        Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
+        Value *VectorGep = State.get(getAddr(), Part);
+        NewPF = Builder.CreateMaskedGatherPrefetch(VectorGep, ElemSize,
+                                                   MaskPart, RW, Locality);
+      } else {
+        auto *VecPtr =
+            CreateVecPtr(Part, State.get(getAddr(), VPIteration(0, 0)));
+        if (isMaskRequired)
+          NewPF = Builder.CreateMaskedPrefetch(
+              VecPtr, ElemSize, BlockInMaskParts[Part], RW, Locality);
+        else {
+          auto *MaskPart = Constant::getAllOnesValue(
+              VectorType::get(Type::getInt1Ty(DataTy->getContext()), DataTy));
+          NewPF = Builder.CreateMaskedPrefetch(VecPtr, ElemSize, MaskPart, RW,
+                                               Locality);
+        }
+      }
+      State.addMetadata(NewPF, PF);
+    }
+    return;
+  }
+
   // Handle loads.
   assert(LI && "Must have a load instruction");
   State.setDebugLocFrom(LI->getDebugLoc());
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 7d33baac52c9e6..3f96fe7610d990 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -2028,7 +2028,9 @@ class VPWidenMemoryInstructionRecipe : public VPRecipeBase {
   }
 
   bool isMasked() const {
-    return isStore() ? getNumOperands() == 3 : getNumOperands() == 2;
+    return isPrefetch() ? getNumOperands() == 5
+           : isStore()  ? getNumOperands() == 3
+                        : getNumOperands() == 2;
   }
 
 public:
@@ -2050,6 +2052,14 @@ class VPWidenMemoryInstructionRecipe : public VPRecipeBase {
     setMask(Mask);
   }
 
+  VPWidenMemoryInstructionRecipe(PrefetchInst &Prefetch, VPValue *Addr,
+                                 VPValue *Mask, bool Consecutive, bool Reverse)
+      : VPRecipeBase(VPDef::VPWidenMemoryInstructionSC, {Addr}),
+        Ingredient(Prefetch), Consecutive(Consecutive), Reverse(Reverse) {
+    assert((Consecutive || !Reverse) && "Reverse implies consecutive");
+    setMask(Mask);
+  }
+
   VP_CLASSOF_IMPL(VPDef::VPWidenMemoryInstructionSC)
 
   /// Return the address accessed by this recipe.
@@ -2067,6 +2077,9 @@ class VPWidenMemoryInstructionRecipe : public VPRecipeBase {
   /// Returns true if this recipe is a store.
   bool isStore() const { return isa<StoreInst>(Ingredient); }
 
+  /// Returns true if this recipe is a prefetch.
+  bool isPrefetch() const { return isa<PrefetchInst>(Ingredient); }
+
   /// Return the address accessed by this recipe.
   VPValue *getStoredValue() const {
     assert(isStore() && "Stored value only available for store instructions");
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 76961629aeceb7..a5763a1277c057 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -150,13 +150,17 @@ bool VPRecipeBase::mayHaveSideEffects() const {
   }
   case VPInterleaveSC:
     return mayWriteToMemory();
-  case VPWidenMemoryInstructionSC:
+  case VPWidenMemoryInstructionSC: {
+    auto *R = cast<VPWidenMemoryInstructionRecipe>(this);
+    if (isa<PrefetchInst>(R->getIngredient()))
+      return true;
     assert(cast<VPWidenMemoryInstructionRecipe>(this)
                    ->getIngredient()
                    .mayHaveSideEffects() == mayWriteToMemory() &&
            "mayHaveSideffects result for ingredient differs from this "
            "implementation");
     return mayWriteToMemory();
+  }
   case VPReplicateSC: {
     auto *R = cast<VPReplicateRecipe>(this);
     return R->getUnderlyingInstr()->mayHaveSideEffects();
@@ -1472,7 +1476,7 @@ void VPWidenMemoryInstructionRecipe::print(raw_ostream &O, const Twine &Indent,
                                            VPSlotTracker &SlotTracker) const {
   O << Indent << "WIDEN ";
 
-  if (!isStore()) {
+  if (!isStore() && !isPrefetch()) {
     getVPSingleValue()->printAsOperand(O, SlotTracker);
     O << " = ";
   }