[llvm-branch-commits] [BOLT] Impute missing trace fall-through (PR #145258)

Sun Jun 22 22:07:02 PDT 2025

llvmbot wrote:




@llvm/pr-subscribers-bolt

Author: Amir Ayupov (aaupov)

<details>
<summary>Changes</summary>

This PR adds *imputing the fall-throughs for the top-of-stack (TOS)
entries*, addressing fall-through undercounting for branches that appear
on the top of the branch stack more often than they should.

# Branch stack sampling
Branch stack sampling (LBR, BRBE) provides a stack of (branch, target)
entries for the last N taken branches:
- Intel SKL+: 32,
- AMD BRS/LBRv2: 16,
- ARM BRBE: 8/16/32/64 depending on the implementation,
- arbitrary for trace-synthesized branch stacks (ARM ETM, Intel PT).

# DataAggregator representation/traces
The internal representation of a unit of the branch profile is a
_trace_, a triple of branch source, branch target/fall-through start,
and fall-through end/next branch:
https://github.com/llvm/llvm-project/blob/89c61449e60703e42c5f274ed41a21f3bc386cf0/bolt/include/bolt/Profile/DataAggregator.h#L116-L118

Branch source and branch target match the branch stack entry.
Fall-through end is the branch start for the next stack entry.
Naturally, the top-of-stack (TOS) entry doesn't have a fall-through end.
Internally, this case is represented with a magic value `BR_ONLY` in
`To` field:
https://github.com/llvm/llvm-project/blob/89c61449e60703e42c5f274ed41a21f3bc386cf0/bolt/include/bolt/Profile/DataAggregator.h#L111

# Traces and fall-throughs
Traces combine taken branches and fall-throughs into one record.
This helps achieve efficiencies throughout profile pre-aggregation,
profile parsing, and most critically for BOLT, enables distinguishing
*call continuation* fall-throughs from the rest.

This information is used to set the weight for call to call continuation
fall-through CFG edge, in case they are in separate basic blocks. The
effect of this information is non-trivial, providing a measurable CPU 
win over separate branches and non-differentiated fall-throughs.

# Motivation: top-of-stack biased branches
Internal studies indicate that Intel LBR sampling is not completely
fair.

Intuitively, for any branch in the program, the probability for it to be
captured in a branch stack should be proportional to its execution
frequency, and if captured, it should appear in any given entry with
equal probability of ~1/N, N being the size of the branch stack.
However, the observed probability distribution of the percentage of
times an entry appears on the top of the stack does not follow a
binomial distribution.

Relaxing the assumption of a constant probability of 1/N and modeling
the  probability as a beta distribution, the resulting distribution more
closely resembled the actual observed distribution.

This means that for branches that are biased to be at the top of the
stack, the fall-through path is undercounted in the profile.

# Implementation
TOS entries are easily identifiable with a `BR_ONLY` value in `Trace.To`
field. To get an expected fall-through length for such traces:
- group traces by the first two fields (`Branch`, `From`),
- using traces that do have fall-through end, compute *weighed average
  fall-through length*,
- use that for the trace without a fall-through end.

Since traces are sorted by their three fields, the grouping is done
naturally, and valid fall-through traces come first within a group, and
`BR_ONLY` trace comes last in the group. This allows us to compute the
weighted average in a single pass over traces.

Care is taken in two special cases:
- Fall-through start is an unconditional jump (unconditional branch,
  call, or return). In this case, the fall-through length is set to
  zero. This still allows extending it to cover call to continuation
  fall-through CFG edge.
- Traces in external code (DSO, JIT) are skipped.

# Effect
For a large binary, this change improves the profile quality metrics:
- no significant increase in mismatching traces (primary raw profile
  quality metric),
- increases the profile density: 599.4 -> 620.6 (density is computed
  using the fall-through ranges),
- CG flow imbalance: 11.56% -> 8.28%,
- CFG flow imbalance:
  - weighted: 3.63% -> 3.34%,
  - worst: 17.13% -> 22.05% (the only regression).

---
Full diff: https://github.com/llvm/llvm-project/pull/145258.diff


3 Files Affected:

- (modified) bolt/include/bolt/Core/MCPlusBuilder.h (+6) 
- (modified) bolt/include/bolt/Profile/DataAggregator.h (+4) 
- (modified) bolt/lib/Profile/DataAggregator.cpp (+62) 


``````````diff

diff --git a/bolt/include/bolt/Core/MCPlusBuilder.h b/bolt/include/bolt/Core/MCPlusBuilder.h
index 804100db80793..50c5f5bcb9303 100644
--- a/bolt/include/bolt/Core/MCPlusBuilder.h
+++ b/bolt/include/bolt/Core/MCPlusBuilder.h
@@ -430,6 +430,12 @@ class MCPlusBuilder {
     return Analysis->isIndirectBranch(Inst);
   }
 
+  bool IsUnconditionalJump(const MCInst &Inst) const {
+    const MCInstrDesc &Desc = Info->get(Inst.getOpcode());
+    // barrier captures returns and unconditional branches
+    return Desc.isCall() || Desc.isBarrier();
+  }
+
   /// Returns true if the instruction is memory indirect call or jump
   virtual bool isBranchOnMem(const MCInst &Inst) const {
     llvm_unreachable("not implemented");
diff --git a/bolt/include/bolt/Profile/DataAggregator.h b/bolt/include/bolt/Profile/DataAggregator.h
index 98e4bba872846..00c6f56520fdc 100644
--- a/bolt/include/bolt/Profile/DataAggregator.h
+++ b/bolt/include/bolt/Profile/DataAggregator.h
@@ -499,6 +499,10 @@ class DataAggregator : public DataReader {
   /// If \p FileBuildID has no match, then issue an error and exit.
   void processFileBuildID(StringRef FileBuildID);
 
+  /// Infer missing fall-throughs for branch-only traces (LBR top-of-stack
+  /// entries).
+  void imputeFallThroughs();
+
   /// Debugging dump methods
   void dump() const;
   void dump(const PerfBranchSample &Sample) const;
diff --git a/bolt/lib/Profile/DataAggregator.cpp b/bolt/lib/Profile/DataAggregator.cpp
index 88229bb31a2ad..2fcc2561cc212 100644
--- a/bolt/lib/Profile/DataAggregator.cpp
+++ b/bolt/lib/Profile/DataAggregator.cpp
@@ -77,6 +77,11 @@ FilterPID("pid",
   cl::Optional,
   cl::cat(AggregatorCategory));
 
+static cl::opt<bool> ImputeTraceFallthrough(
+    "impute-trace-fall-through",
+    cl::desc("impute missing fall-throughs for branch-only traces"),
+    cl::Optional, cl::cat(AggregatorCategory));
+
 static cl::opt<bool>
 IgnoreBuildID("ignore-build-id",
   cl::desc("continue even if build-ids in input binary and perf.data mismatch"),
@@ -529,6 +534,60 @@ void DataAggregator::parsePerfData(BinaryContext &BC) {
   deleteTempFiles();
 }
 
+void DataAggregator::imputeFallThroughs() {
+  if (Traces.empty())
+    return;
+
+  std::pair PrevBranch(Trace::EXTERNAL, Trace::EXTERNAL);
+  uint64_t AggregateCount = 0;
+  uint64_t AggregateFallthroughSize = 0;
+  uint64_t InferredTraces = 0;
+
+  // Helper map with whether the instruction is a call/ret/unconditional branch
+  std::unordered_map<uint64_t, bool> IsUncondJumpMap;
+  auto checkUncondJump = [&](const uint64_t Addr) {
+    auto isUncondJump = [&](auto MI) {
+      return MI && BC->MIB->IsUnconditionalJump(*MI);
+    };
+    auto It = IsUncondJumpMap.find(Addr);
+    if (It != IsUncondJumpMap.end())
+      return It->second;
+    BinaryFunction *Func = getBinaryFunctionContainingAddress(Addr);
+    if (!Func)
+      return false;
+    const uint64_t Offset = Addr - Func->getAddress();
+    if (Func->hasInstructions()
+            ? isUncondJump(Func->getInstructionAtOffset(Offset))
+            : isUncondJump(Func->disassembleInstructionAtOffset(Offset))) {
+      IsUncondJumpMap.emplace(Addr, true);
+      return true;
+    }
+    return false;
+  };
+
+  for (auto &[Trace, Info] : Traces) {
+    if (Trace.From == Trace::EXTERNAL)
+      continue;
+    std::pair CurrentBranch(Trace.Branch, Trace.From);
+    if (Trace.To == Trace::BR_ONLY) {
+      uint64_t InferredBytes = PrevBranch == CurrentBranch
+                                   ? AggregateFallthroughSize / AggregateCount
+                                   : !checkUncondJump(Trace.From);
+      Trace.To = Trace.From + InferredBytes;
+      outs() << "Inferred " << Trace << " " << InferredBytes << '\n';
+      ++InferredTraces;
+    } else {
+      if (CurrentBranch != PrevBranch)
+        AggregateCount = AggregateFallthroughSize = 0;
+      if (Trace.To != Trace::EXTERNAL)
+        AggregateFallthroughSize += (Trace.To - Trace.From) * Info.TakenCount;
+      AggregateCount += Info.TakenCount;
+    }
+    PrevBranch = CurrentBranch;
+  }
+  outs() << "Inferred " << InferredTraces << " traces\n";
+}
+
 Error DataAggregator::preprocessProfile(BinaryContext &BC) {
   this->BC = &BC;
 
@@ -541,6 +600,9 @@ Error DataAggregator::preprocessProfile(BinaryContext &BC) {
   // Sort parsed traces for faster processing.
   llvm::sort(Traces, llvm::less_first());
 
+  if (opts::ImputeTraceFallthrough)
+    imputeFallThroughs();
+
   if (opts::HeatmapMode) {
     if (std::error_code EC = printLBRHeatMap())
       return errorCodeToError(EC);

``````````

</details>


https://github.com/llvm/llvm-project/pull/145258