[llvm-branch-commits] [BOLT] Impute missing trace fall-through (PR #145258)
via llvm-branch-commits
llvm-branch-commits at lists.llvm.org
Sun Jun 22 22:07:02 PDT 2025
llvmbot wrote:
<!--LLVM PR SUMMARY COMMENT-->
@llvm/pr-subscribers-bolt
Author: Amir Ayupov (aaupov)
<details>
<summary>Changes</summary>
This PR adds *imputing the fall-throughs for the top-of-stack (TOS)
entries*, addressing fall-through undercounting for branches that appear
on the top of the branch stack more often than they should.
# Branch stack sampling
Branch stack sampling (LBR, BRBE) provides a stack of (branch, target)
entries for the last N taken branches:
- Intel SKL+: 32,
- AMD BRS/LBRv2: 16,
- ARM BRBE: 8/16/32/64 depending on the implementation,
- arbitrary for trace-synthesized branch stacks (ARM ETM, Intel PT).
# DataAggregator representation/traces
The internal representation of a unit of the branch profile is a
_trace_, a triple of branch source, branch target/fall-through start,
and fall-through end/next branch:
https://github.com/llvm/llvm-project/blob/89c61449e60703e42c5f274ed41a21f3bc386cf0/bolt/include/bolt/Profile/DataAggregator.h#L116-L118
Branch source and branch target match the branch stack entry.
Fall-through end is the branch start for the next stack entry.
Naturally, the top-of-stack (TOS) entry doesn't have a fall-through end.
Internally, this case is represented with a magic value `BR_ONLY` in
`To` field:
https://github.com/llvm/llvm-project/blob/89c61449e60703e42c5f274ed41a21f3bc386cf0/bolt/include/bolt/Profile/DataAggregator.h#L111
# Traces and fall-throughs
Traces combine taken branches and fall-throughs into one record.
This helps achieve efficiencies throughout profile pre-aggregation,
profile parsing, and most critically for BOLT, enables distinguishing
*call continuation* fall-throughs from the rest.
This information is used to set the weight for call to call continuation
fall-through CFG edge, in case they are in separate basic blocks. The
effect of this information is non-trivial, providing a measurable CPU
win over separate branches and non-differentiated fall-throughs.
# Motivation: top-of-stack biased branches
Internal studies indicate that Intel LBR sampling is not completely
fair.
Intuitively, for any branch in the program, the probability for it to be
captured in a branch stack should be proportional to its execution
frequency, and if captured, it should appear in any given entry with
equal probability of ~1/N, N being the size of the branch stack.
However, the observed probability distribution of the percentage of
times an entry appears on the top of the stack does not follow a
binomial distribution.
Relaxing the assumption of a constant probability of 1/N and modeling
the probability as a beta distribution, the resulting distribution more
closely resembled the actual observed distribution.
This means that for branches that are biased to be at the top of the
stack, the fall-through path is undercounted in the profile.
# Implementation
TOS entries are easily identifiable with a `BR_ONLY` value in `Trace.To`
field. To get an expected fall-through length for such traces:
- group traces by the first two fields (`Branch`, `From`),
- using traces that do have fall-through end, compute *weighed average
fall-through length*,
- use that for the trace without a fall-through end.
Since traces are sorted by their three fields, the grouping is done
naturally, and valid fall-through traces come first within a group, and
`BR_ONLY` trace comes last in the group. This allows us to compute the
weighted average in a single pass over traces.
Care is taken in two special cases:
- Fall-through start is an unconditional jump (unconditional branch,
call, or return). In this case, the fall-through length is set to
zero. This still allows extending it to cover call to continuation
fall-through CFG edge.
- Traces in external code (DSO, JIT) are skipped.
# Effect
For a large binary, this change improves the profile quality metrics:
- no significant increase in mismatching traces (primary raw profile
quality metric),
- increases the profile density: 599.4 -> 620.6 (density is computed
using the fall-through ranges),
- CG flow imbalance: 11.56% -> 8.28%,
- CFG flow imbalance:
- weighted: 3.63% -> 3.34%,
- worst: 17.13% -> 22.05% (the only regression).
---
Full diff: https://github.com/llvm/llvm-project/pull/145258.diff
3 Files Affected:
- (modified) bolt/include/bolt/Core/MCPlusBuilder.h (+6)
- (modified) bolt/include/bolt/Profile/DataAggregator.h (+4)
- (modified) bolt/lib/Profile/DataAggregator.cpp (+62)
``````````diff
diff --git a/bolt/include/bolt/Core/MCPlusBuilder.h b/bolt/include/bolt/Core/MCPlusBuilder.h
index 804100db80793..50c5f5bcb9303 100644
--- a/bolt/include/bolt/Core/MCPlusBuilder.h
+++ b/bolt/include/bolt/Core/MCPlusBuilder.h
@@ -430,6 +430,12 @@ class MCPlusBuilder {
return Analysis->isIndirectBranch(Inst);
}
+ bool IsUnconditionalJump(const MCInst &Inst) const {
+ const MCInstrDesc &Desc = Info->get(Inst.getOpcode());
+ // barrier captures returns and unconditional branches
+ return Desc.isCall() || Desc.isBarrier();
+ }
+
/// Returns true if the instruction is memory indirect call or jump
virtual bool isBranchOnMem(const MCInst &Inst) const {
llvm_unreachable("not implemented");
diff --git a/bolt/include/bolt/Profile/DataAggregator.h b/bolt/include/bolt/Profile/DataAggregator.h
index 98e4bba872846..00c6f56520fdc 100644
--- a/bolt/include/bolt/Profile/DataAggregator.h
+++ b/bolt/include/bolt/Profile/DataAggregator.h
@@ -499,6 +499,10 @@ class DataAggregator : public DataReader {
/// If \p FileBuildID has no match, then issue an error and exit.
void processFileBuildID(StringRef FileBuildID);
+ /// Infer missing fall-throughs for branch-only traces (LBR top-of-stack
+ /// entries).
+ void imputeFallThroughs();
+
/// Debugging dump methods
void dump() const;
void dump(const PerfBranchSample &Sample) const;
diff --git a/bolt/lib/Profile/DataAggregator.cpp b/bolt/lib/Profile/DataAggregator.cpp
index 88229bb31a2ad..2fcc2561cc212 100644
--- a/bolt/lib/Profile/DataAggregator.cpp
+++ b/bolt/lib/Profile/DataAggregator.cpp
@@ -77,6 +77,11 @@ FilterPID("pid",
cl::Optional,
cl::cat(AggregatorCategory));
+static cl::opt<bool> ImputeTraceFallthrough(
+ "impute-trace-fall-through",
+ cl::desc("impute missing fall-throughs for branch-only traces"),
+ cl::Optional, cl::cat(AggregatorCategory));
+
static cl::opt<bool>
IgnoreBuildID("ignore-build-id",
cl::desc("continue even if build-ids in input binary and perf.data mismatch"),
@@ -529,6 +534,60 @@ void DataAggregator::parsePerfData(BinaryContext &BC) {
deleteTempFiles();
}
+void DataAggregator::imputeFallThroughs() {
+ if (Traces.empty())
+ return;
+
+ std::pair PrevBranch(Trace::EXTERNAL, Trace::EXTERNAL);
+ uint64_t AggregateCount = 0;
+ uint64_t AggregateFallthroughSize = 0;
+ uint64_t InferredTraces = 0;
+
+ // Helper map with whether the instruction is a call/ret/unconditional branch
+ std::unordered_map<uint64_t, bool> IsUncondJumpMap;
+ auto checkUncondJump = [&](const uint64_t Addr) {
+ auto isUncondJump = [&](auto MI) {
+ return MI && BC->MIB->IsUnconditionalJump(*MI);
+ };
+ auto It = IsUncondJumpMap.find(Addr);
+ if (It != IsUncondJumpMap.end())
+ return It->second;
+ BinaryFunction *Func = getBinaryFunctionContainingAddress(Addr);
+ if (!Func)
+ return false;
+ const uint64_t Offset = Addr - Func->getAddress();
+ if (Func->hasInstructions()
+ ? isUncondJump(Func->getInstructionAtOffset(Offset))
+ : isUncondJump(Func->disassembleInstructionAtOffset(Offset))) {
+ IsUncondJumpMap.emplace(Addr, true);
+ return true;
+ }
+ return false;
+ };
+
+ for (auto &[Trace, Info] : Traces) {
+ if (Trace.From == Trace::EXTERNAL)
+ continue;
+ std::pair CurrentBranch(Trace.Branch, Trace.From);
+ if (Trace.To == Trace::BR_ONLY) {
+ uint64_t InferredBytes = PrevBranch == CurrentBranch
+ ? AggregateFallthroughSize / AggregateCount
+ : !checkUncondJump(Trace.From);
+ Trace.To = Trace.From + InferredBytes;
+ outs() << "Inferred " << Trace << " " << InferredBytes << '\n';
+ ++InferredTraces;
+ } else {
+ if (CurrentBranch != PrevBranch)
+ AggregateCount = AggregateFallthroughSize = 0;
+ if (Trace.To != Trace::EXTERNAL)
+ AggregateFallthroughSize += (Trace.To - Trace.From) * Info.TakenCount;
+ AggregateCount += Info.TakenCount;
+ }
+ PrevBranch = CurrentBranch;
+ }
+ outs() << "Inferred " << InferredTraces << " traces\n";
+}
+
Error DataAggregator::preprocessProfile(BinaryContext &BC) {
this->BC = &BC;
@@ -541,6 +600,9 @@ Error DataAggregator::preprocessProfile(BinaryContext &BC) {
// Sort parsed traces for faster processing.
llvm::sort(Traces, llvm::less_first());
+ if (opts::ImputeTraceFallthrough)
+ imputeFallThroughs();
+
if (opts::HeatmapMode) {
if (std::error_code EC = printLBRHeatMap())
return errorCodeToError(EC);
``````````
</details>
https://github.com/llvm/llvm-project/pull/145258
More information about the llvm-branch-commits
mailing list