[PATCH] D152840: [ELF] A new code layout algorithm for function reordering [3a/3]

Sun Sep 17 16:48:39 PDT 2023

MaskRay added a comment.

Thanks for the update.

https://maskray.me/blog/2021-08-08-toolchain-testing#the-test-checks-at-the-wrong-layer We need some unittests in llvm/. I created D159527 <https://reviews.llvm.org/D159527>

================
Comment at: lld/ELF/CMakeLists.txt:75
   Support
+  TransformUtils
   TargetParser
----------------
move after TargetParser for an alphabetical order.

================
Comment at: lld/ELF/CallGraphSort.cpp:9
 ///
-/// Implementation of Call-Chain Clustering from: Optimizing Function Placement
-/// for Large-Scale Data-Center Applications
-/// https://research.fb.com/wp-content/uploads/2017/01/cgo2017-hfsort-final1.pdf
-///
-/// The goal of this algorithm is to improve runtime performance of the final
-/// executable by arranging code sections such that page table and i-cache
-/// misses are minimized.
-///
-/// Definitions:
-/// * Cluster
-///   * An ordered list of input sections which are laid out as a unit. At the
-///     beginning of the algorithm each input section has its own cluster and
-///     the weight of the cluster is the sum of the weight of all incoming
-///     edges.
-/// * Call-Chain Clustering (C³) Heuristic
-///   * Defines when and how clusters are combined. Pick the highest weighted
-///     input section then add it to its most likely predecessor if it wouldn't
-///     penalize it too much.
-/// * Density
-///   * The weight of the cluster divided by the size of the cluster. This is a
-///     proxy for the amount of execution time spent per byte of the cluster.
-///
-/// It does so given a call graph profile by the following:
-/// * Build a weighted call graph from the call graph profile
-/// * Sort input sections by weight
-/// * For each input section starting with the highest weight
-///   * Find its most likely predecessor cluster
-///   * Check if the combined cluster would be too large, or would have too low
-///     a density.
-///   * If not, then combine the clusters.
-/// * Sort non-empty clusters by density
+/// The file is responsible for sorting sections using the profile data by
+/// placing frequently executed code sections together. The goal of the
----------------
using LLVM call graph profile data ...

================
Comment at: lld/ELF/CallGraphSort.cpp:12
+/// placement is to improve the runtime performance of the final executable by
+/// arranging code sections such that page table and i-cache misses are reduced.
 ///
----------------
TLB misses and i-cache misses

================
Comment at: lld/ELF/CallGraphSort.cpp:277
+// Fill in necessary data for Cache-Directed Sort.
+static void buildCallGraph(std::vector<uint64_t> &funcSizes,
+                           std::vector<uint64_t> &funcCounts,
----------------
lld code prefers `SmallVector<X, 0>` for a smaller size and usually better performance.

D159526 will change these to use ArrayRef, enabling `SmallVector<X, 0>` at call sites.

================
Comment at: lld/ELF/CallGraphSort.cpp:279
+                           std::vector<uint64_t> &funcCounts,
+                           std::vector<std::pair<CallT, uint64_t>> &callCounts,
+                           std::vector<uint64_t> &callOffsets,
----------------
the nested pair feels unnatural to me. I updated D159526 to change the element type to a `std::tuple`

================
Comment at: lld/ELF/CallGraphSort.cpp:317
+    callCounts.push_back(std::make_pair(std::make_pair(from, to), weight));
+    callOffsets.push_back((funcSizes[from] + 1) / 2);
+    funcCounts[to] += weight;
----------------
This needs a comment: `// Assume that the jump is at the middle of the input section. The profile data does not contain jump offsets.`

================
Comment at: lld/ELF/CallGraphSort.cpp:332
+  std::vector<const InputSectionBase *> sections;
+  buildCallGraph(funcSizes, funcCounts, callCounts, callOffsets, sections);
+
----------------
buildCallGraph is called once and the conventional style just remove the extra function. 

================
Comment at: lld/ELF/CallGraphSort.h:17

+llvm::DenseMap<const InputSectionBase *, int> applyCDSort();
+
----------------
This does not need to be a public API. `Apply` seems less appropriate in the context. I created D159526 to rename this to `compute*`

================
Comment at: lld/ELF/Options.td:340

+defm use_cdsort: BB<"use-cdsort",
+    "Use cache-directed-sort to reorder functions in the binary (default)",
----------------
I feel that this abbreviated option name isn't intuitive. 

A similar `--call-graph-profile-sort` intentionally picks a long but descriptive name. I think `--use-cache-directed-sort` will be better.

I suggest we make `--no-use-cache-directed-sort` the default and then change the default in another patch.

================
Comment at: lld/ELF/Options.td:341
+defm use_cdsort: BB<"use-cdsort",
+    "Use cache-directed-sort to reorder functions in the binary (default)",
+    "Do not use cache-directed-sort to reorder functions in the binary">;
----------------
`Reorder input sections with cache-directed sort.`

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152840/new/

https://reviews.llvm.org/D152840