[llvm] [vectorization] More flexibility for VFxIC (PR #138709)
via llvm-commits
llvm-commits at lists.llvm.org
Wed Jun 18 02:18:54 PDT 2025
llvmbot wrote:
<!--LLVM PR SUMMARY COMMENT-->
@llvm/pr-subscribers-vectorizers
@llvm/pr-subscribers-llvm-transforms
Author: Serval MARTINOT-LAGARDE (Serval6)
<details>
<summary>Changes</summary>
Vectorization factor (VF) and Interleave count (IC) are key to obtain good performance.
As of today, controlling VF and IC can be done:
(1) At compilation unit though compilers, `-mllvm -force-vector-width=8 -mllvm -force-vector-interleave=2`
(2) At loop level though pragmas, `#pragma clang loop vectorize_width(2) interleave_count(2)`
Approach (1) is only used in a "test mode": if VF is not valid it can either crash or fall back to a scalar version.
Approach (2) is effective but can be only deployed at loop level based on user expertise. The exploration of the best combination of (VF x IC) for a given loop is architecture/machine dependent and must be done for every new architecture/machine. Thus, this is both error prone and time consuming. Also, developer only does this fine-tuning for hotspots, leaving room for improvement for other loops.
When applications are flat (e.g., weather forecast code), this approach is no longer viable since no real hotspot exists.
Based on that, we developed (see the just-accepted paper at ISC'25 [^1]) a new technique to ease the automation of testing various (VF x IC) at the compilation unit level.
Two new compilation flags were proposed:
- `clang -mllvm -change-vectorize-to-custom-IC=X`: that set the custom Interleave Count
- `clang -mllvm -change-vectorize-to-custom-VF=Y`: that set the custom Vectorization Factor
For each loop in the compilation unit, the vectorizer uses these custom factors if vectorization is enabled, custom factors are valid and profitable. As a convention, we established that setting a coefficient to 0 uses the default value chosen by the LoopVectorizer.
This approach supposes that there exist a good (VF x IC) at the application level: this is a reasonable assumption for code that always manipulates same data-type (typical case of HPC applications).
In the paper, experiment has been reported with a subset of LORE [^2] and several mini-apps. This shows both the maturity of the approach and the potential gain with this exhaustive exploration. Table below depicts the potential gain on various benchmarks (with respect to `mcpu-generic-vf0-ic0`):
| Bench | Neoverse-N1 | Neoverse-V1 | Neoverse-V2 |
| ------------------- | ----------- | ----------- | ----------- |
| LORE[^2] | -6.17% | -22.34% | -15.97% |
| MiniBUDE[^3] | -4% | -18% | -16% |
| CloverLeaf[^4] | -0.4% | -2% | -3% |
| Hydro[^5] | -128% | -2% | -0.1% |
| LULESH[^6] | -15% | -1% | -5% |
| Dwarf-p-cloudsc[^7] | -2% | -11% | -12% |
Speed up of this table is computed using formula `speedup = 1 - ref_time/mut_time` with `ref_time` the time of the run `mcpu-generic-vf0-ic0` and `mut_time` the time of the best mutation.
[^1]: https://ieeexplore.ieee.org/document/11018308
[^2]: https://vectorization.computer/
[^3]: https://github.com/UoB-HPC/miniBUDE
[^4]: https://github.com/UK-MAC/CloverLeaf
[^5]: https://github.com/HydroBench/Hydro
[^6]: https://github.com/LLNL/LULESH
[^7]: https://github.com/ecmwf-ifs/dwarf-p-cloudsc
---
Full diff: https://github.com/llvm/llvm-project/pull/138709.diff
2 Files Affected:
- (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h (+10)
- (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+39-1)
``````````diff
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index 8f6a73d0a2dd8..430dd2277c973 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -444,6 +444,16 @@ class LoopVectorizationPlanner {
/// all profitable VFs in ProfitableVFs.
VectorizationFactor computeBestVF();
+ /// Search in \p ProfitableVFs if the selected \p VF exists.
+ std::optional<VectorizationFactor> getProfitableVF(ElementCount VF) const {
+ if (!hasPlanWithVF(VF))
+ return std::nullopt;
+ for (const auto &P : ProfitableVFs)
+ if (P.Width == VF)
+ return P;
+ return std::nullopt;
+ }
+
/// Generate the IR code for the vectorized loop captured in VPlan \p BestPlan
/// according to the best selected \p VF and \p UF.
///
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index a28cda9fe62b3..50b7a509a7d20 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -400,6 +400,13 @@ static cl::opt<bool> EnableEarlyExitVectorization(
cl::desc(
"Enable vectorization of early exit loops with uncountable exits."));
+static cl::opt<unsigned>
+ ChangeVectToCustomVF("change-vectorize-to-custom-VF", cl::init(0),
+ cl::Hidden, cl::desc("Change vectorize to custom VF"));
+static cl::opt<unsigned>
+ ChangeVectToCustomIC("change-vectorize-to-custom-IC", cl::init(0),
+ cl::Hidden, cl::desc("Change vectorize to custom IC"));
+
// Likelyhood of bypassing the vectorized loop because assumptions about SCEV
// variables not overflowing do not hold. See `emitSCEVChecks`.
static constexpr uint32_t SCEVCheckBypassWeights[] = {1, 127};
@@ -7365,7 +7372,18 @@ void LoopVectorizationPlanner::plan(ElementCount UserVF, unsigned UserIC) {
CM.collectValuesToIgnore();
CM.collectElementTypesForWidening();
- FixedScalableVFPair MaxFactors = CM.computeMaxVF(UserVF, UserIC);
+ // Change User(VF/IC) to ChangeVect(VF/IC) if defined and User(VF/IC) is 0,
+ // used in computeMaxVF
+ auto CV2CVF = UserVF;
+ auto CV2CIC = UserIC;
+ if (ChangeVectToCustomVF > 0 && UserVF.isZero())
+ CV2CVF =
+ ElementCount::get(ChangeVectToCustomVF < 2 ? 2 : ChangeVectToCustomVF,
+ UserVF.isScalable());
+ if (ChangeVectToCustomIC > 0 && UserIC == 0)
+ CV2CIC = ChangeVectToCustomIC;
+
+ FixedScalableVFPair MaxFactors = CM.computeMaxVF(CV2CVF, CV2CIC);
if (!MaxFactors) // Cases that should not to be vectorized nor interleaved.
return;
@@ -11121,6 +11139,26 @@ bool LoopVectorizePass::processLoop(Loop *L) {
bool DisableRuntimeUnroll = false;
MDNode *OrigLoopID = L->getLoopID();
{
+ if (ChangeVectToCustomIC != 0 || ChangeVectToCustomVF != 0) {
+ LLVM_DEBUG(dbgs() << "LV: ChangeVectorizedToCustom (CustomVF:"
+ << ChangeVectToCustomVF
+ << ", CustomIC:" << ChangeVectToCustomIC
+ << ", UserVF:" << UserVF << ", UserIC:" << UserIC
+ << ") on (VF:" << VF.Width << ", IC:" << IC << "): ");
+ auto CVF = ElementCount::get(ChangeVectToCustomVF, VF.Width.isScalable());
+ if (ChangeVectToCustomVF == 0)
+ CVF = VF.Width;
+ std::optional<VectorizationFactor> MaybeVF = LVP.getProfitableVF(CVF);
+ if (MaybeVF) {
+ VF = *MaybeVF;
+ if (ChangeVectToCustomIC > 0)
+ IC = ChangeVectToCustomIC;
+ LLVM_DEBUG(dbgs() << "APPLIED (VF:" << VF.Width << ", IC:" << IC
+ << ")\n");
+ } else
+ LLVM_DEBUG(dbgs() << "INVALID\n");
+ }
+
using namespace ore;
if (!VectorizeLoop) {
assert(IC > 1 && "interleave count should not be 1 or 0");
``````````
</details>
https://github.com/llvm/llvm-project/pull/138709
More information about the llvm-commits
mailing list