[llvm] [SLP]Reduce number of alternate instruction, where possible (PR #123360)
via llvm-commits
llvm-commits at lists.llvm.org
Fri Jan 17 07:58:52 PST 2025
llvmbot wrote:
<!--LLVM PR SUMMARY COMMENT-->
@llvm/pr-subscribers-llvm-transforms
Author: Alexey Bataev (alexey-bataev)
<details>
<summary>Changes</summary>
Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:
```
%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32
transformes to
%v1 = add <8 x i32>
%v2 = sub <8 x i32>
%res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
```
i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.
Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:
```
%v1 = add <4 x i32>
%v2 = sub <4 x i32>
%res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
```
It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.
-O3+LTO, AVX512
Metric: size..text
Program size..text
results results0 diff
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 277800.00 280536.00 1.0%
test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 81802.00 82426.00 0.8%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 790552.00 790952.00 0.1%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 383795.00 383987.00 0.1%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 2075541.00 2076501.00 0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 2075541.00 2076501.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 312702.00 312766.00 0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12569783.00 12569751.00 -0.0%
test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 2049374.00 2049358.00 -0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1091836.00 1091772.00 -0.0%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 852339.00 852211.00 -0.0%
test-suite :: MultiSource/Applications/oggenc/oggenc.test 190651.00 190523.00 -0.1%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 44203.00 44155.00 -0.1%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test 12997.00 12981.00 -0.1%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 668971.00 658427.00 -1.6%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 668971.00 658427.00 -1.6%
Prolangs-C/TimberWolfMC/timberwolfmc - small variations, some code not
inlined
FreeBench/pifft - extra stores <8 x double> vectorized, some other extra
vectorizations
CINT2006/464.h264ref - some smaller code + changes similar to x264
JM/ldecod - changes similar x264
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - significantly compact vector code
Benchmarks/Bullet - small variations
CFP2017rate/526.blender_r - small variations
CFP2017rate/510.parest_r - small variations
CINT2006/400.perlbench - extra vector code
JM/lencod - extra store <16 x i32> and other changes similar x264
Applications/oggenc - extra store <16 x i8>, small variations
DOE-ProxyApps-C/miniGMG - small variations
Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - better vector code
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.
-O3+LTO, march=rva32u64
CINT2017rate/525.x264_r - similar to x86, extra code in pixel_hadamard_ac
function vectorized, idct4x4dc stopped being vectorized (looks like
issue with shuffles cost)
CINT2006/400.perlbench - better vector code
CINT2006/445.gobmk - some variations in vector code
CINT2006/464.h264ref - extra code vectorized
CINT2017rate/500.perlbench_r - small variations
-O3+LTO, mcpu=sifive-p470
Metric: size..text
Program size..text
results results0 diff
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 587336.00 587668.00 0.1%
test-suite :: MultiSource/Applications/JM/lencod/lencod.test 643308.00 643614.00 0.0%
test-suite :: MultiSource/Applications/d/make_dparser.test 79678.00 79710.00 0.0%
test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 277322.00 277420.00 0.0%
test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 933660.00 933682.00 0.0%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9497722.00 9497682.00 -0.0%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1767806.00 1767772.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1767806.00 1767772.00 -0.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 148038.00 148024.00 -0.0%
test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 283036.00 283008.00 -0.0%
test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 4776.00 4772.00 -0.1%
test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 540582.00 511772.00 -5.3%
test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 540582.00 511772.00 -5.3%
CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized, idct4x4dc not vectorized (issue with some
TTI costs)
---
Patch is 243.92 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/123360.diff
21 Files Affected:
- (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+540-49)
- (modified) llvm/test/Transforms/PhaseOrdering/AArch64/slpordering.ll (+102-76)
- (modified) llvm/test/Transforms/SLPVectorizer/AArch64/gather-with-minbith-user.ll (+5-1)
- (modified) llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll (+96-64)
- (modified) llvm/test/Transforms/SLPVectorizer/AArch64/tsc-s116.ll (+5-8)
- (modified) llvm/test/Transforms/SLPVectorizer/RISCV/complex-loads.ll (+232-616)
- (modified) llvm/test/Transforms/SLPVectorizer/RISCV/reductions.ll (+2-4)
- (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-cast-inseltpoison.ll (+54-20)
- (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-cast.ll (+54-20)
- (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-fp-inseltpoison.ll (+64-16)
- (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-fp.ll (+64-16)
- (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll (+86-20)
- (modified) llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll (+86-20)
- (modified) llvm/test/Transforms/SLPVectorizer/X86/buildvector-schedule-for-subvector.ll (+3-1)
- (modified) llvm/test/Transforms/SLPVectorizer/X86/long-full-reg-stores.ll (+3-3)
- (modified) llvm/test/Transforms/SLPVectorizer/X86/matched-shuffled-entries.ll (+16-13)
- (modified) llvm/test/Transforms/SLPVectorizer/X86/non-load-reduced-as-part-of-bv.ll (+5-5)
- (modified) llvm/test/Transforms/SLPVectorizer/X86/scatter-vectorize-reused-pointer.ll (+2-8)
- (modified) llvm/test/Transforms/SLPVectorizer/X86/splat-score-adjustment.ll (+9-13)
- (modified) llvm/test/Transforms/SLPVectorizer/addsub.ll (+4-8)
- (modified) llvm/test/Transforms/SLPVectorizer/resized-alt-shuffle-after-minbw.ll (+18-17)
``````````diff
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index b0b8f8249d657b..59063d6b4c9bc4 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -1461,6 +1461,7 @@ class BoUpSLP {
VectorizableTree.clear();
ScalarToTreeEntry.clear();
MultiNodeScalars.clear();
+ ScalarsInSplitNodes.clear();
MustGather.clear();
NonScheduledFirst.clear();
EntryToLastInstruction.clear();
@@ -3196,12 +3197,30 @@ class BoUpSLP {
/// \returns Common mask for reorder indices and reused scalars.
SmallVector<int> getCommonMask() const {
+ if (State == TreeEntry::SplitVectorize)
+ return {};
SmallVector<int> Mask;
inversePermutation(ReorderIndices, Mask);
::addMask(Mask, ReuseShuffleIndices);
return Mask;
}
+ /// \returns The mask for split nodes.
+ SmallVector<int> getSplitMask() const {
+ assert(State == TreeEntry::SplitVectorize && !ReorderIndices.empty() &&
+ "Expected only split vectorize node.");
+ SmallVector<int> Mask(getVectorFactor(), PoisonMaskElem);
+ unsigned CommonVF = std::max<unsigned>(
+ CombinedEntriesWithIndices.back().second,
+ Scalars.size() - CombinedEntriesWithIndices.back().second);
+ for (auto [Idx, I] : enumerate(ReorderIndices))
+ Mask[I] =
+ Idx + (Idx >= CombinedEntriesWithIndices.back().second
+ ? CommonVF - CombinedEntriesWithIndices.back().second
+ : 0);
+ return Mask;
+ }
+
/// \returns true if the scalars in VL are equal to this entry.
bool isSame(ArrayRef<Value *> VL) const {
auto &&IsSame = [VL](ArrayRef<Value *> Scalars, ArrayRef<int> Mask) {
@@ -3293,6 +3312,8 @@ class BoUpSLP {
///< complex node like select/cmp to minmax, mul/add to
///< fma, etc. Must be used for the following nodes in
///< the pattern, not the very first one.
+ SplitVectorize, ///< Splits the node into 2 subnodes, vectorizes them
+ ///< independently and then combines back.
};
EntryState State;
@@ -3324,7 +3345,7 @@ class BoUpSLP {
/// The index of this treeEntry in VectorizableTree.
unsigned Idx = 0;
- /// For gather/buildvector/alt opcode (TODO) nodes, which are combined from
+ /// For gather/buildvector/alt opcode nodes, which are combined from
/// other nodes as a series of insertvector instructions.
SmallVector<std::pair<unsigned, unsigned>, 2> CombinedEntriesWithIndices;
@@ -3471,8 +3492,9 @@ class BoUpSLP {
SmallVectorImpl<Value *> *AltScalars = nullptr) const;
/// Return true if this is a non-power-of-2 node.
- bool isNonPowOf2Vec() const {
- bool IsNonPowerOf2 = !has_single_bit(Scalars.size());
+ bool isNonPowOf2Vec(const TargetTransformInfo &TTI) const {
+ bool IsNonPowerOf2 = !hasFullVectorsOrPowerOf2(
+ TTI, getValueType(Scalars.front()), Scalars.size());
return IsNonPowerOf2;
}
@@ -3530,6 +3552,9 @@ class BoUpSLP {
case CombinedVectorize:
dbgs() << "CombinedVectorize\n";
break;
+ case SplitVectorize:
+ dbgs() << "SplitVectorize\n";
+ break;
}
dbgs() << "MainOp: ";
if (MainOp)
@@ -3611,8 +3636,10 @@ class BoUpSLP {
const EdgeInfo &UserTreeIdx,
ArrayRef<int> ReuseShuffleIndices = {},
ArrayRef<unsigned> ReorderIndices = {}) {
- assert(((!Bundle && EntryState == TreeEntry::NeedToGather) ||
- (Bundle && EntryState != TreeEntry::NeedToGather)) &&
+ assert(((!Bundle && (EntryState == TreeEntry::NeedToGather ||
+ EntryState == TreeEntry::SplitVectorize)) ||
+ (Bundle && EntryState != TreeEntry::NeedToGather &&
+ EntryState != TreeEntry::SplitVectorize)) &&
"Need to vectorize gather entry?");
// Gathered loads still gathered? Do not create entry, use the original one.
if (GatheredLoadsEntriesFirst.has_value() &&
@@ -3646,12 +3673,29 @@ class BoUpSLP {
return VL[Idx];
});
InstructionsState S = getSameOpcode(Last->Scalars, *TLI);
- if (S)
+ if (S) {
Last->setOperations(S);
+ } else if (EntryState == TreeEntry::SplitVectorize) {
+ auto *MainOp =
+ cast<Instruction>(*find_if(Last->Scalars, IsaPred<Instruction>));
+ auto *AltOp = cast<Instruction>(*find_if(Last->Scalars, [=](Value *V) {
+ auto *I = dyn_cast<Instruction>(V);
+ return I && I->getOpcode() != MainOp->getOpcode();
+ }));
+ Last->setOperations(InstructionsState(MainOp, AltOp));
+ for (Value *V : VL) {
+ auto *I = dyn_cast<Instruction>(V);
+ if (!I)
+ continue;
+ ScalarsInSplitNodes.try_emplace(I, Last);
+ }
+ }
Last->ReorderIndices.append(ReorderIndices.begin(), ReorderIndices.end());
}
- if (!Last->isGather()) {
+ if (!Last->isGather() && Last->State != TreeEntry::SplitVectorize) {
for (Value *V : VL) {
+ if (isa<PoisonValue>(V))
+ continue;
const TreeEntry *TE = getTreeEntry(V);
assert((!TE || TE == Last || doesNotNeedToBeScheduled(V)) &&
"Scalar already in tree!");
@@ -3679,7 +3723,7 @@ class BoUpSLP {
}
}
assert(!BundleMember && "Bundle and VL out of sync");
- } else {
+ } else if (Last->isGather()) {
// Build a map for gathered scalars to the nodes where they are used.
bool AllConstsOrCasts = true;
for (Value *V : VL)
@@ -3745,6 +3789,9 @@ class BoUpSLP {
/// nodes.
SmallDenseMap<Value *, SmallVector<TreeEntry *>> MultiNodeScalars;
+ /// Scalars, used in split vectorize nodes.
+ SmallDenseMap<Value *, TreeEntry *> ScalarsInSplitNodes;
+
/// Maps a value to the proposed vectorizable size.
SmallDenseMap<Value *, unsigned> InstrElementSize;
@@ -5648,12 +5695,14 @@ BoUpSLP::getReorderingData(const TreeEntry &TE, bool TopToBottom) {
}) &&
(TE.ReorderIndices.empty() || isReverseOrder(TE.ReorderIndices)))
return std::nullopt;
- if ((TE.State == TreeEntry::Vectorize ||
- TE.State == TreeEntry::StridedVectorize) &&
- (isa<LoadInst, ExtractElementInst, ExtractValueInst>(TE.getMainOp()) ||
- (TopToBottom && isa<StoreInst, InsertElementInst>(TE.getMainOp())))) {
- assert(!TE.isAltShuffle() && "Alternate instructions are only supported by "
- "BinaryOperator and CastInst.");
+ if (TE.State == TreeEntry::SplitVectorize ||
+ ((TE.State == TreeEntry::Vectorize ||
+ TE.State == TreeEntry::StridedVectorize) &&
+ (isa<LoadInst, ExtractElementInst, ExtractValueInst>(TE.getMainOp()) ||
+ (TopToBottom && isa<StoreInst, InsertElementInst>(TE.getMainOp()))))) {
+ assert((TE.State == TreeEntry::SplitVectorize || !TE.isAltShuffle()) &&
+ "Alternate instructions are only supported by "
+ "BinaryOperator and CastInst.");
return TE.ReorderIndices;
}
if (TE.State == TreeEntry::Vectorize && TE.getOpcode() == Instruction::PHI) {
@@ -5938,7 +5987,7 @@ void BoUpSLP::reorderTopToBottom() {
// Patterns like [fadd,fsub] can be combined into a single instruction in
// x86. Reordering them into [fsub,fadd] blocks this pattern. So we need
// to take into account their order when looking for the most used order.
- if (TE->isAltShuffle()) {
+ if (TE->isAltShuffle() && TE->State != TreeEntry::SplitVectorize) {
VectorType *VecTy =
getWidenedType(TE->Scalars[0]->getType(), TE->Scalars.size());
unsigned Opcode0 = TE->getOpcode();
@@ -5976,7 +6025,8 @@ void BoUpSLP::reorderTopToBottom() {
}
VFToOrderedEntries[TE->getVectorFactor()].insert(TE.get());
if (!(TE->State == TreeEntry::Vectorize ||
- TE->State == TreeEntry::StridedVectorize) ||
+ TE->State == TreeEntry::StridedVectorize ||
+ TE->State == TreeEntry::SplitVectorize) ||
!TE->ReuseShuffleIndices.empty())
GathersToOrders.try_emplace(TE.get(), *CurrentOrder);
if (TE->State == TreeEntry::Vectorize &&
@@ -5985,6 +6035,30 @@ void BoUpSLP::reorderTopToBottom() {
}
});
+ auto UpdateSplitUserNode = [&](TreeEntry *UserTE, unsigned Idx,
+ ArrayRef<int> Mask, ArrayRef<int> MaskOrder) {
+ assert(UserTE->State == TreeEntry::SplitVectorize &&
+ "Expected split user node.");
+ SmallVector<int> NewMask(UserTE->getVectorFactor());
+ SmallVector<int> NewMaskOrder(UserTE->getVectorFactor());
+ std::iota(NewMask.begin(), NewMask.end(), 0);
+ std::iota(NewMaskOrder.begin(), NewMaskOrder.end(), 0);
+ if (Idx == 0) {
+ copy(Mask, NewMask.begin());
+ copy(MaskOrder, NewMaskOrder.begin());
+ } else {
+ assert(Idx == 1 && "Expected either 0 or 1 index.");
+ unsigned Offset = UserTE->CombinedEntriesWithIndices.back().second;
+ for (unsigned I : seq<unsigned>(Mask.size())) {
+ NewMask[I + Offset] = Mask[I] + Offset;
+ NewMaskOrder[I + Offset] = MaskOrder[I] + Offset;
+ }
+ }
+ reorderScalars(UserTE->Scalars, NewMask);
+ reorderOrder(UserTE->ReorderIndices, NewMaskOrder, /*BottomOrder=*/true);
+ if (isIdentityOrder(UserTE->ReorderIndices))
+ UserTE->ReorderIndices.clear();
+ };
// Reorder the graph nodes according to their vectorization factor.
for (unsigned VF = VectorizableTree.front()->getVectorFactor();
!VFToOrderedEntries.empty() && VF > 1; VF -= 2 - (VF & 1U)) {
@@ -6007,7 +6081,8 @@ void BoUpSLP::reorderTopToBottom() {
for (const TreeEntry *OpTE : OrderedEntries) {
// No need to reorder this nodes, still need to extend and to use shuffle,
// just need to merge reordering shuffle and the reuse shuffle.
- if (!OpTE->ReuseShuffleIndices.empty() && !GathersToOrders.count(OpTE))
+ if (!OpTE->ReuseShuffleIndices.empty() && !GathersToOrders.count(OpTE) &&
+ OpTE->State != TreeEntry::SplitVectorize)
continue;
// Count number of orders uses.
const auto &Order = [OpTE, &GathersToOrders, &AltShufflesToOrders,
@@ -6114,6 +6189,8 @@ void BoUpSLP::reorderTopToBottom() {
// Just do the reordering for the nodes with the given VF.
if (TE->Scalars.size() != VF) {
if (TE->ReuseShuffleIndices.size() == VF) {
+ assert(TE->State != TreeEntry::SplitVectorize &&
+ "Split vectorized not expected.");
// Need to reorder the reuses masks of the operands with smaller VF to
// be able to find the match between the graph nodes and scalar
// operands of the given node during vectorization/cost estimation.
@@ -6121,7 +6198,8 @@ void BoUpSLP::reorderTopToBottom() {
[VF, &TE](const EdgeInfo &EI) {
return EI.UserTE->Scalars.size() == VF ||
EI.UserTE->Scalars.size() ==
- TE->Scalars.size();
+ TE->Scalars.size() ||
+ EI.UserTE->State == TreeEntry::SplitVectorize;
}) &&
"All users must be of VF size.");
if (SLPReVec) {
@@ -6144,19 +6222,29 @@ void BoUpSLP::reorderTopToBottom() {
// Update ordering of the operands with the smaller VF than the given
// one.
reorderNodeWithReuses(*TE, Mask);
+ // Update orders in user split vectorize nodes.
+ for (EdgeInfo &EI : TE->UserTreeIndices) {
+ if (EI.UserTE->State != TreeEntry::SplitVectorize)
+ continue;
+ UpdateSplitUserNode(EI.UserTE, EI.EdgeIdx, Mask, MaskOrder);
+ }
}
continue;
}
- if ((TE->State == TreeEntry::Vectorize ||
- TE->State == TreeEntry::StridedVectorize) &&
- (isa<ExtractElementInst, ExtractValueInst, LoadInst, StoreInst,
- InsertElementInst>(TE->getMainOp()) ||
- (SLPReVec && isa<ShuffleVectorInst>(TE->getMainOp())))) {
- assert(!TE->isAltShuffle() &&
- "Alternate instructions are only supported by BinaryOperator "
- "and CastInst.");
- // Build correct orders for extract{element,value}, loads and
- // stores.
+ if ((TE->State == TreeEntry::SplitVectorize &&
+ TE->ReuseShuffleIndices.empty()) ||
+ ((TE->State == TreeEntry::Vectorize ||
+ TE->State == TreeEntry::StridedVectorize) &&
+ (isa<ExtractElementInst, ExtractValueInst, LoadInst, StoreInst,
+ InsertElementInst>(TE->getMainOp()) ||
+ (SLPReVec && isa<ShuffleVectorInst>(TE->getMainOp()))))) {
+ assert(
+ (!TE->isAltShuffle() || (TE->State == TreeEntry::SplitVectorize &&
+ TE->ReuseShuffleIndices.empty())) &&
+ "Alternate instructions are only supported by BinaryOperator "
+ "and CastInst.");
+ // Build correct orders for extract{element,value}, loads,
+ // stores and alternate (split) nodes.
reorderOrder(TE->ReorderIndices, Mask);
if (isa<InsertElementInst, StoreInst>(TE->getMainOp()))
TE->reorderOperands(Mask);
@@ -6177,6 +6265,12 @@ void BoUpSLP::reorderTopToBottom() {
addMask(NewReuses, TE->ReuseShuffleIndices);
TE->ReuseShuffleIndices.swap(NewReuses);
}
+ // Update orders in user split vectorize nodes.
+ for (EdgeInfo &EI : TE->UserTreeIndices) {
+ if (EI.UserTE->State != TreeEntry::SplitVectorize)
+ continue;
+ UpdateSplitUserNode(EI.UserTE, EI.EdgeIdx, Mask, MaskOrder);
+ }
}
}
}
@@ -6189,7 +6283,8 @@ bool BoUpSLP::canReorderOperands(
if (any_of(Edges, [I](const std::pair<unsigned, TreeEntry *> &OpData) {
return OpData.first == I &&
(OpData.second->State == TreeEntry::Vectorize ||
- OpData.second->State == TreeEntry::StridedVectorize);
+ OpData.second->State == TreeEntry::StridedVectorize ||
+ OpData.second->State == TreeEntry::SplitVectorize);
}))
continue;
if (TreeEntry *TE = getVectorizedOperand(UserTE, I)) {
@@ -6207,6 +6302,7 @@ bool BoUpSLP::canReorderOperands(
// node, just reorder reuses mask.
if (TE->State != TreeEntry::Vectorize &&
TE->State != TreeEntry::StridedVectorize &&
+ TE->State != TreeEntry::SplitVectorize &&
TE->ReuseShuffleIndices.empty() && TE->ReorderIndices.empty())
GatherOps.push_back(TE);
continue;
@@ -6216,6 +6312,7 @@ bool BoUpSLP::canReorderOperands(
[&Gather, UserTE, I](TreeEntry *TE) {
assert(TE->State != TreeEntry::Vectorize &&
TE->State != TreeEntry::StridedVectorize &&
+ TE->State != TreeEntry::SplitVectorize &&
"Only non-vectorized nodes are expected.");
if (any_of(TE->UserTreeIndices,
[UserTE, I](const EdgeInfo &EI) {
@@ -6245,13 +6342,15 @@ void BoUpSLP::reorderBottomToTop(bool IgnoreReorder) {
SmallVector<TreeEntry *> NonVectorized;
for (const std::unique_ptr<TreeEntry> &TE : VectorizableTree) {
if (TE->State != TreeEntry::Vectorize &&
- TE->State != TreeEntry::StridedVectorize)
+ TE->State != TreeEntry::StridedVectorize &&
+ TE->State != TreeEntry::SplitVectorize)
NonVectorized.push_back(TE.get());
if (std::optional<OrdersType> CurrentOrder =
getReorderingData(*TE, /*TopToBottom=*/false)) {
OrderedEntries.insert(TE.get());
if (!(TE->State == TreeEntry::Vectorize ||
- TE->State == TreeEntry::StridedVectorize) ||
+ TE->State == TreeEntry::StridedVectorize ||
+ TE->State == TreeEntry::SplitVectorize) ||
!TE->ReuseShuffleIndices.empty())
GathersToOrders.insert(TE.get());
}
@@ -6270,6 +6369,7 @@ void BoUpSLP::reorderBottomToTop(bool IgnoreReorder) {
for (TreeEntry *TE : OrderedEntries) {
if (!(TE->State == TreeEntry::Vectorize ||
TE->State == TreeEntry::StridedVectorize ||
+ TE->State == TreeEntry::SplitVectorize ||
(TE->isGather() && GathersToOrders.contains(TE))) ||
TE->UserTreeIndices.empty() || !TE->ReuseShuffleIndices.empty() ||
!all_of(drop_begin(TE->UserTreeIndices),
@@ -6295,6 +6395,51 @@ void BoUpSLP::reorderBottomToTop(bool IgnoreReorder) {
return Data1.first->Idx > Data2.first->Idx;
});
for (auto &Data : UsersVec) {
+ if (Data.first->State == TreeEntry::SplitVectorize) {
+ assert(
+ Data.second.size() <= 2 &&
+ "Expected not greater than 2 operands for split vectorize node.");
+ if (any_of(Data.second, [](const auto &Op) {
+ return Op.second->UserTreeIndices.size() != 1;
+ }))
+ continue;
+ // Update orders in user split vectorize nodes.
+ for (const auto &P : Data.first->CombinedEntriesWithIndices) {
+ TreeEntry &OpTE = *VectorizableTree[P.first].get();
+ if (OpTE.isGather() || OpTE.ReorderIndices.empty())
+ continue;
+ SmallVector<int> Mask;
+ inversePermutation(OpTE.ReorderIndices, Mask);
+ SmallVector<int> MaskOrder(OpTE.ReorderIndices.size(),
+ PoisonMaskElem);
+ unsigned E = OpTE.ReorderIndices.size();
+ transform(OpTE.ReorderIndices, MaskOrder.begin(), [E](unsigned I) {
+ return I < E ? static_cast<int>(I) : PoisonMaskElem;
+ });
+ SmallVector<int> NewMask(Data.first->getVectorFactor());
+ SmallVector<int> NewMaskOrder(Data.first->getVectorFactor());
+ std::iota(NewMask.begin(), NewMask.end(), 0);
+ std::iota(NewMaskOrder.begin(), NewMaskOrder.end(), 0);
+ if (P.second == 0) {
+ copy(Mask, NewMask.begin());
+ copy(MaskOrder, NewMaskOrder.begin());
+ } else {
+ unsigned Offset = P.second;
+ for (unsigned I : seq<unsigned>(Mask.size())) {
+ NewMask[I + Offset] = Mask[I] + Offset;
+ NewMaskOrder[I + Offset] = MaskOrder[I] + Offset;
+ }
+ }
+ reorderScalars(Data.first->Scalars, NewMask);
+ reorderOrder(Data.first->ReorderIndices, NewMaskOrder,
+ /*BottomOrder=*/true);
+ if (isIdentityOrder(Data.first->ReorderIndices))
+ Data.first->ReorderIndices.clear();
+ // Clear ordering of the operand.
+ OpTE.ReorderIndices.clear();
+ }
+ continue;
+ }
// Check that operands are used only in the User node.
SmallVector<TreeEntry *> GatherOps;
if (!canReorderOperands(Data.first, Data.second, NonVectorized,
@@ -6451,6 +6596,7 @@ void BoUpSLP::reorderBottomToTop(bool IgnoreReorder) {
// Gathers are processed separately.
if (TE->State != TreeEntry::Vectorize &&
TE->State != TreeEntry::StridedVectorize &&
+ TE->State != TreeEntry::SplitVectorize &&
(TE->State != TreeEntry::ScatterVectorize ||
TE->ReorderIndices.empty()))
continue;
@@ -6521,7 +6667,7 @@ void BoUpSLP::buildExternalUses(
TreeEntry *Entry = TEPtr.get();
// No need to handle users of gathered values.
- if (Entry->isGather())
+ if (Entry->isGather() || Entry->State == TreeEntry::SplitVectorize)
continue;
// For each lane:
@@ -8227,6 +8373,142 @@ void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
return;
}
+ // Tries to build split node.
+ auto TrySplitNode = [&, &TTI = *TTI](unsigned SmallNodeSize,
+ const InstructionsState &LocalState) {
+ if (VL.size() <= SmallNodeSize)
+ return false;
+
+ // Any value is used in ...
[truncated]
``````````
</details>
https://github.com/llvm/llvm-project/pull/123360
More information about the llvm-commits
mailing list