[llvm] [LoadStoreVectorizer] Fill gaps in load/store chains to enable vectorization (PR #159388)
via llvm-commits
llvm-commits at lists.llvm.org
Wed Sep 17 08:46:35 PDT 2025
llvmbot wrote:
<!--LLVM PR SUMMARY COMMENT-->
@llvm/pr-subscribers-backend-nvptx
Author: Drew Kersnar (dakersnar)
<details>
<summary>Changes</summary>
This change introduces Gap Filling, an optimization that aims to fill in holes in otherwise contiguous load/store chains to enable vectorization. It also introduces Chain Extending, which extends the end of a chain to the closest power of 2.
This was originally motivated by the NVPTX target, but I tried to generalize it to be universally applicable to all targets that may use the LSV. One way I did so was introducing a new TTI API called "isLegalToWidenLoads", allowing targets to opt in to these optimizations. I'm more than willing to make adjustments to improve the target-agnostic-ness of this change. I fully expect there are some issues and encourage feedback on how to improve things.
For stores, which unlike loads, cannot be filled/extended without consequence, we only perform the optimization when we can generate a legal llvm masked store intrinsic, masking off the additional elements. Determining legality for stores is a little tricky from the NVPTX side, because these intrinsics are only supported for 256-bit vectors. See the other PR I opened for the implementation of the NVPTX lowering of masked store intrinsics, which include NVPTX TTI changes that return true for isLegalMaskedStore under certain conditions: https://github.com/llvm/llvm-project/pull/159387. This change is dependent on that backend change, but I predict this change will require more discussion, so I am putting them both up at the same time. The backend change will be merged first assuming both are approved.
---
Patch is 95.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/159388.diff
15 Files Affected:
- (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+6)
- (modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
- (modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+4)
- (modified) llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h (+2)
- (modified) llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp (+383-52)
- (modified) llvm/test/CodeGen/NVPTX/LoadStoreVectorizer.ll (+21-19)
- (modified) llvm/test/CodeGen/NVPTX/param-vectorize-device.ll (+2-4)
- (modified) llvm/test/CodeGen/NVPTX/variadics-backend.ll (+1-1)
- (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/extend-chain.ll (+81)
- (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-cleanup.ll (+37)
- (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-invariant.ll (+83)
- (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill-vectors.ll (+186)
- (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/gap-fill.ll (+194)
- (added) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/masked-store.ll (+541)
- (modified) llvm/test/Transforms/LoadStoreVectorizer/NVPTX/vectorize_i8.ll (+1-2)
``````````diff
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 41ff54f0781a2..f8f134c833ea2 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -817,6 +817,12 @@ class TargetTransformInfo {
LLVM_ABI bool isLegalMaskedLoad(Type *DataType, Align Alignment,
unsigned AddressSpace) const;
+ /// Return true if it is legal to widen loads beyond their current width,
+ /// assuming the result is still well-aligned. For example, converting a load
+ /// i32 to a load i64, or vectorizing three continuous load i32s into a load
+ /// <4 x i32>.
+ LLVM_ABI bool isLegalToWidenLoads() const;
+
/// Return true if the target supports nontemporal store.
LLVM_ABI bool isLegalNTStore(Type *DataType, Align Alignment) const;
/// Return true if the target supports nontemporal load.
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 566e1cf51631a..55bd4bd709589 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -318,6 +318,8 @@ class TargetTransformInfoImplBase {
return false;
}
+ virtual bool isLegalToWidenLoads() const { return false; }
+
virtual bool isLegalNTStore(Type *DataType, Align Alignment) const {
// By default, assume nontemporal memory stores are available for stores
// that are aligned and have a size that is a power of 2.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 09b50c5270e57..89cda79558057 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -476,6 +476,10 @@ bool TargetTransformInfo::isLegalMaskedLoad(Type *DataType, Align Alignment,
return TTIImpl->isLegalMaskedLoad(DataType, Alignment, AddressSpace);
}
+bool TargetTransformInfo::isLegalToWidenLoads() const {
+ return TTIImpl->isLegalToWidenLoads();
+}
+
bool TargetTransformInfo::isLegalNTStore(Type *DataType,
Align Alignment) const {
return TTIImpl->isLegalNTStore(DataType, Alignment);
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
index b32d931bd3074..d56cff1ce3695 100644
--- a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
+++ b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
@@ -72,6 +72,8 @@ class NVPTXTTIImpl final : public BasicTTIImplBase<NVPTXTTIImpl> {
return isLegalToVectorizeLoadChain(ChainSizeInBytes, Alignment, AddrSpace);
}
+ bool isLegalToWidenLoads() const override { return true; };
+
// NVPTX has infinite registers of all kinds, but the actual machine doesn't.
// We conservatively return 1 here which is just enough to enable the
// vectorizers but disables heuristics based on the number of registers.
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 7b5137b0185ab..04f4e92826a52 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -119,6 +119,29 @@ using namespace llvm;
#define DEBUG_TYPE "load-store-vectorizer"
+cl::opt<bool>
+ ExtendLoads("vect-extend-loads", cl::Hidden,
+ cl::desc("Load more elements if the target VF is higher "
+ "than the chain length."),
+ cl::init(true));
+
+cl::opt<bool> ExtendStores(
+ "vect-extend-stores", cl::Hidden,
+ cl::desc("Store more elements if the target VF is higher "
+ "than the chain length and we have access to masked stores."),
+ cl::init(true));
+
+cl::opt<bool> FillLoadGaps(
+ "vect-fill-load-gaps", cl::Hidden,
+ cl::desc("Should Loads be introduced in gaps to enable vectorization."),
+ cl::init(true));
+
+cl::opt<bool>
+ FillStoreGaps("vect-fill-store-gaps", cl::Hidden,
+ cl::desc("Should Stores be introduced in gaps to enable "
+ "vectorization into masked stores."),
+ cl::init(true));
+
STATISTIC(NumVectorInstructions, "Number of vector accesses generated");
STATISTIC(NumScalarsVectorized, "Number of scalar accesses vectorized");
@@ -246,12 +269,16 @@ class Vectorizer {
const DataLayout &DL;
IRBuilder<> Builder;
- // We could erase instrs right after vectorizing them, but that can mess up
- // our BB iterators, and also can make the equivalence class keys point to
- // freed memory. This is fixable, but it's simpler just to wait until we're
- // done with the BB and erase all at once.
+ /// We could erase instrs right after vectorizing them, but that can mess up
+ /// our BB iterators, and also can make the equivalence class keys point to
+ /// freed memory. This is fixable, but it's simpler just to wait until we're
+ /// done with the BB and erase all at once.
SmallVector<Instruction *, 128> ToErase;
+ /// We insert load/store instructions and GEPs to fill gaps and extend chains
+ /// to enable vectorization. Keep track and delete them later.
+ DenseSet<Instruction *> ExtraElements;
+
public:
Vectorizer(Function &F, AliasAnalysis &AA, AssumptionCache &AC,
DominatorTree &DT, ScalarEvolution &SE, TargetTransformInfo &TTI)
@@ -344,6 +371,28 @@ class Vectorizer {
/// Postcondition: For all i, ret[i][0].second == 0, because the first instr
/// in the chain is the leader, and an instr touches distance 0 from itself.
std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
+
+ /// Is a load/store with this alignment allowed by TTI and at least as fast
+ /// as an unvectorized load/store.
+ bool accessIsAllowedAndFast(unsigned SizeBytes, unsigned AS, Align Alignment,
+ unsigned VecElemBits) const;
+
+ /// Before attempting to fill gaps, check if the chain is a candidate for
+ /// a masked store, to save compile time if it is not possible for the address
+ /// space and element type.
+ bool shouldAttemptMaskedStore(const ArrayRef<ChainElem> C) const;
+
+ /// Create a new GEP and a new Load/Store instruction such that the GEP
+ /// is pointing at PrevElem + Offset. In the case of stores, store poison.
+ /// Extra elements will either be combined into a vector/masked store or
+ /// deleted before the end of the pass.
+ ChainElem createExtraElementAfter(const ChainElem &PrevElem, APInt Offset,
+ StringRef Prefix,
+ Align Alignment = Align(1));
+
+ /// Delete dead GEPs and extra Load/Store instructions created by
+ /// createExtraElementAfter
+ void deleteExtraElements();
};
class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -457,12 +506,21 @@ bool Vectorizer::run() {
Changed |= runOnPseudoBB(*It, *std::next(It));
for (Instruction *I : ToErase) {
+ // These will get deleted in deleteExtraElements.
+ // This is because ExtraElements will include both extra elements
+ // that *were* vectorized and extra elements that *were not*
+ // vectorized. ToErase will only include extra elements that *were*
+ // vectorized, so in order to avoid double deletion we skip them here and
+ // handle them in deleteExtraElements.
+ if (ExtraElements.contains(I))
+ continue;
auto *PtrOperand = getLoadStorePointerOperand(I);
if (I->use_empty())
I->eraseFromParent();
RecursivelyDeleteTriviallyDeadInstructions(PtrOperand);
}
ToErase.clear();
+ deleteExtraElements();
}
return Changed;
@@ -623,6 +681,29 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
dumpChain(C);
});
+ // If the chain is not contiguous, we try to fill the gap with "extra"
+ // elements to artificially make it contiguous, to try to enable
+ // vectorization.
+ // - Filling gaps in loads is always ok if the target supports widening loads.
+ // - For stores, we only fill gaps if there is a potentially legal masked
+ // store for the target. If later on, we don't end up with a chain that
+ // could be vectorized into a legal masked store, the chains with extra
+ // elements will be filtered out in splitChainByAlignment.
+ bool TryFillGaps = isa<LoadInst>(C[0].Inst)
+ ? (FillLoadGaps && TTI.isLegalToWidenLoads())
+ : (FillStoreGaps && shouldAttemptMaskedStore(C));
+
+ unsigned ASPtrBits =
+ DL.getIndexSizeInBits(getLoadStoreAddressSpace(C[0].Inst));
+
+ // Compute the alignment of the leader of the chain (which every stored offset
+ // is based on) using the current first element of the chain. This is
+ // conservative, we may be able to derive better alignment by iterating over
+ // the chain and finding the leader.
+ Align LeaderOfChainAlign =
+ commonAlignment(getLoadStoreAlignment(C[0].Inst),
+ C[0].OffsetFromLeader.abs().getLimitedValue());
+
std::vector<Chain> Ret;
Ret.push_back({C.front()});
@@ -633,7 +714,8 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
unsigned SzBits = DL.getTypeSizeInBits(getLoadStoreType(&*Prev.Inst));
assert(SzBits % 8 == 0 && "Non-byte sizes should have been filtered out by "
"collectEquivalenceClass");
- APInt PrevReadEnd = Prev.OffsetFromLeader + SzBits / 8;
+ APInt PrevSzBytes = APInt(ASPtrBits, SzBits / 8);
+ APInt PrevReadEnd = Prev.OffsetFromLeader + PrevSzBytes;
// Add this instruction to the end of the current chain, or start a new one.
bool AreContiguous = It->OffsetFromLeader == PrevReadEnd;
@@ -642,10 +724,54 @@ std::vector<Chain> Vectorizer::splitChainByContiguity(Chain &C) {
<< *Prev.Inst << " (ends at offset " << PrevReadEnd
<< ") -> " << *It->Inst << " (starts at offset "
<< It->OffsetFromLeader << ")\n");
- if (AreContiguous)
+
+ if (AreContiguous) {
CurChain.push_back(*It);
- else
- Ret.push_back({*It});
+ continue;
+ }
+
+ // For now, we aren't filling gaps between load/stores of different sizes.
+ // Additionally, as a conservative heuristic, we only fill gaps of 1-2
+ // elements. Generating loads/stores with too many unused bytes has a side
+ // effect of increasing register pressure (on NVIDIA targets at least),
+ // which could cancel out the benefits of reducing number of load/stores.
+ if (TryFillGaps &&
+ SzBits == DL.getTypeSizeInBits(getLoadStoreType(It->Inst))) {
+ APInt OffsetOfGapStart = Prev.OffsetFromLeader + PrevSzBytes;
+ APInt GapSzBytes = It->OffsetFromLeader - OffsetOfGapStart;
+ if (GapSzBytes == PrevSzBytes) {
+ // There is a single gap between Prev and Curr, create one extra element
+ ChainElem NewElem = createExtraElementAfter(
+ Prev, PrevSzBytes, "GapFill",
+ commonAlignment(LeaderOfChainAlign,
+ OffsetOfGapStart.abs().getLimitedValue()));
+ CurChain.push_back(NewElem);
+ CurChain.push_back(*It);
+ continue;
+ }
+ // There are two gaps between Prev and Curr, only create two extra
+ // elements if Prev is the first element in a sequence of four.
+ // This has the highest chance of resulting in a beneficial vectorization.
+ if ((GapSzBytes == 2 * PrevSzBytes) && (CurChain.size() % 4 == 1)) {
+ ChainElem NewElem1 = createExtraElementAfter(
+ Prev, PrevSzBytes, "GapFill",
+ commonAlignment(LeaderOfChainAlign,
+ OffsetOfGapStart.abs().getLimitedValue()));
+ ChainElem NewElem2 = createExtraElementAfter(
+ NewElem1, PrevSzBytes, "GapFill",
+ commonAlignment(
+ LeaderOfChainAlign,
+ (OffsetOfGapStart + PrevSzBytes).abs().getLimitedValue()));
+ CurChain.push_back(NewElem1);
+ CurChain.push_back(NewElem2);
+ CurChain.push_back(*It);
+ continue;
+ }
+ }
+
+ // The chain is not contiguous and cannot be made contiguous with gap
+ // filling, so we need to start a new chain.
+ Ret.push_back({*It});
}
// Filter out length-1 chains, these are uninteresting.
@@ -721,6 +847,14 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
unsigned AS = getLoadStoreAddressSpace(C[0].Inst);
unsigned VecRegBytes = TTI.getLoadStoreVecRegBitWidth(AS) / 8;
+ // For compile time reasons, we cache whether or not the superset
+ // of all candidate chains contains any extra stores from earlier gap
+ // filling.
+ bool CandidateChainsMayContainExtraStores =
+ !IsLoadChain && any_of(C, [this](const ChainElem &E) {
+ return ExtraElements.contains(E.Inst);
+ });
+
std::vector<Chain> Ret;
for (unsigned CBegin = 0; CBegin < C.size(); ++CBegin) {
// Find candidate chains of size not greater than the largest vector reg.
@@ -769,41 +903,6 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
continue;
}
- // Is a load/store with this alignment allowed by TTI and at least as fast
- // as an unvectorized load/store?
- //
- // TTI and F are passed as explicit captures to WAR an MSVC misparse (??).
- auto IsAllowedAndFast = [&, SizeBytes = SizeBytes, &TTI = TTI,
- &F = F](Align Alignment) {
- if (Alignment.value() % SizeBytes == 0)
- return true;
- unsigned VectorizedSpeed = 0;
- bool AllowsMisaligned = TTI.allowsMisalignedMemoryAccesses(
- F.getContext(), SizeBytes * 8, AS, Alignment, &VectorizedSpeed);
- if (!AllowsMisaligned) {
- LLVM_DEBUG(dbgs()
- << "LSV: Access of " << SizeBytes << "B in addrspace "
- << AS << " with alignment " << Alignment.value()
- << " is misaligned, and therefore can't be vectorized.\n");
- return false;
- }
-
- unsigned ElementwiseSpeed = 0;
- (TTI).allowsMisalignedMemoryAccesses((F).getContext(), VecElemBits, AS,
- Alignment, &ElementwiseSpeed);
- if (VectorizedSpeed < ElementwiseSpeed) {
- LLVM_DEBUG(dbgs()
- << "LSV: Access of " << SizeBytes << "B in addrspace "
- << AS << " with alignment " << Alignment.value()
- << " has relative speed " << VectorizedSpeed
- << ", which is lower than the elementwise speed of "
- << ElementwiseSpeed
- << ". Therefore this access won't be vectorized.\n");
- return false;
- }
- return true;
- };
-
// If we're loading/storing from an alloca, align it if possible.
//
// FIXME: We eagerly upgrade the alignment, regardless of whether TTI
@@ -818,8 +917,7 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
isa<AllocaInst>(PtrOperand->stripPointerCasts());
Align Alignment = getLoadStoreAlignment(C[CBegin].Inst);
Align PrefAlign = Align(StackAdjustedAlignment);
- if (IsAllocaAccess && Alignment.value() % SizeBytes != 0 &&
- IsAllowedAndFast(PrefAlign)) {
+ if (IsAllocaAccess && Alignment.value() % SizeBytes != 0) {
Align NewAlign = getOrEnforceKnownAlignment(
PtrOperand, PrefAlign, DL, C[CBegin].Inst, nullptr, &DT);
if (NewAlign >= Alignment) {
@@ -831,7 +929,59 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
}
}
- if (!IsAllowedAndFast(Alignment)) {
+ Chain ExtendingLoadsStores;
+ bool ExtendChain = IsLoadChain
+ ? ExtendLoads
+ : ExtendStores;
+ if (ExtendChain && NumVecElems < TargetVF && NumVecElems % 2 != 0 &&
+ VecElemBits >= 8) {
+ // TargetVF may be a lot higher than NumVecElems,
+ // so only extend to the next power of 2.
+ assert(VecElemBits % 8 == 0);
+ unsigned VecElemBytes = VecElemBits / 8;
+ unsigned NewNumVecElems = PowerOf2Ceil(NumVecElems);
+ unsigned NewSizeBytes = VecElemBytes * NewNumVecElems;
+
+ assert(NewNumVecElems <= TargetVF);
+
+ LLVM_DEBUG(dbgs() << "LSV: attempting to extend chain of "
+ << NumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores") << " to "
+ << NewNumVecElems << " elements\n");
+ // Do not artificially increase the chain if it becomes misaligned,
+ // otherwise we may unnecessary split the chain when the target actually
+ // supports non-pow2 VF.
+ if (accessIsAllowedAndFast(NewSizeBytes, AS, Alignment, VecElemBits) &&
+ ((IsLoadChain ? TTI.isLegalToWidenLoads()
+ : TTI.isLegalMaskedStore(
+ FixedVectorType::get(VecElemTy, NewNumVecElems),
+ Alignment, AS, /*IsMaskConstant=*/true)))) {
+ LLVM_DEBUG(dbgs()
+ << "LSV: extending " << (IsLoadChain ? "load" : "store")
+ << " chain of " << NumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores")
+ << " with total byte size of " << SizeBytes << " to "
+ << NewNumVecElems << " "
+ << (IsLoadChain ? "loads" : "stores")
+ << " with total byte size of " << NewSizeBytes
+ << ", TargetVF=" << TargetVF << " \n");
+
+ unsigned ASPtrBits = DL.getIndexSizeInBits(AS);
+ ChainElem Prev = C[CEnd];
+ for (unsigned i = 0; i < (NewNumVecElems - NumVecElems); i++) {
+ ChainElem NewElem = createExtraElementAfter(
+ Prev, APInt(ASPtrBits, VecElemBytes), "Extend");
+ ExtendingLoadsStores.push_back(NewElem);
+ Prev = ExtendingLoadsStores.back();
+ }
+
+ // Update the size and number of elements for upcoming checks.
+ SizeBytes = NewSizeBytes;
+ NumVecElems = NewNumVecElems;
+ }
+ }
+
+ if (!accessIsAllowedAndFast(SizeBytes, AS, Alignment, VecElemBits)) {
LLVM_DEBUG(
dbgs() << "LSV: splitChainByAlignment discarding candidate chain "
"because its alignment is not AllowedAndFast: "
@@ -849,10 +999,41 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
continue;
}
+ if (CandidateChainsMayContainExtraStores) {
+ // The legality of adding extra stores to ExtendingLoadsStores has
+ // already been checked, but if the candidate chain contains extra
+ // stores from an earlier optimization, confirm legality now.
+ // This filter is essential because, when filling gaps in
+ // splitChainByContinuity, we queried the API to check that (for a given
+ // element type and address space) there *may* be a legal masked store
+ // we can try to create. Now, we need to check if the actual chain we
+ // ended up with is legal to turn into a masked store.
+ // This is relevant for NVPTX targets, for example, where a masked store
+ // is only legal if we have ended up with a 256-bit vector.
+ bool CandidateChainContainsExtraStores = llvm::any_of(
+ ArrayRef<ChainElem>(C).slice(CBegin, CEnd - CBegin + 1),
+ [this](const ChainElem &E) {
+ return ExtraElements.contains(E.Inst);
+ });
+
+ if (CandidateChainContainsExtraStores &&
+ !TTI.isLegalMaskedStore(
+ FixedVectorType::get(VecElemTy, NumVecElems), Alignment, AS,
+ /*IsMaskConstant=*/true)) {
+ LLVM_DEBUG(dbgs()
+ << "LSV: splitChainByAlignment discarding candidate chain "
+ "because it contains extra stores that we cannot "
+ "legally vectorize into a masked store \n");
+ continue;
+ }
+ }
+
// Hooray, we can vect...
[truncated]
``````````
</details>
https://github.com/llvm/llvm-project/pull/159388
More information about the llvm-commits
mailing list