[PATCH] D152834: A new code layout algorithm for function reordering [2/3]

Tue Jun 13 13:31:24 PDT 2023

spupyrev created this revision.
Herald added a subscriber: hiraditya.
Herald added a project: All.
spupyrev edited the summary of this revision.
spupyrev added reviewers: wenlei, hoy, wlei.
spupyrev added subscribers: Amir, maksfb.
spupyrev published this revision for review.
Herald added a project: LLVM.
Herald added a subscriber: llvm-commits.

We are bringing a new algorithm for function layout (reordering) based on the
call graph (extracted from a profile data). The algorithm is an improvement of
top of a known heuristic, C^3. It tries to co-locate hot and frequently executed
together functions in the resulting ordering. Unlike C^3, it explores a larger
search space and have an objective closely tied to the performance of 
instruction and i-TLB caches. Hence, the name CDS = Cache-Directed Sort.
The algorithm can be used at the linking or post-linking (e.g., BOLT) stage.

The algorithm shares some similarities with C^3 and an approach for basic block
reordering (ext-tsp). It works with chains (ordered lists)
of functions. Initially all chains are isolated functions. On every iteration, 
we pick a pair of chains whose merging yields the biggest increase in the 
objective, which is a weighted combination of frequency-based and distance-based
locality. That is, we try to co-locate hot functions together (so they can share
the cache lines) and functions frequently executed together. The merging process
stops when there is only one chain left, or when merging does not improve the 
objective. In the latter case, the remaining chains are sorted by density in the 
decreasing order.

**Complexity**
We regularly apply the algorithm for large data-center binaries containing 10K+
(hot) functions, and the algorithm takes only a few seconds. For some extreme
cases with 100K-1M nodes, the runtime is within minutes.

**Perf-impact**
We extensively tested the implementation extensively on a benchmark of isolated 
binaries and prod services. The impact is measurable for "larger" binaries that
are front-end bound: the cpu time improvement (on top of C^3) is in the range 
of [0% .. 1%], which is a result of a reduced i-TLB miss rate (by up to 20%) and
i-cache miss rate (up to 5%).

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D152834

Files:
  llvm/include/llvm/Transforms/Utils/CodeLayout.h
  llvm/lib/Transforms/Utils/CodeLayout.cpp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D152834.530969.patch
Type: text/x-patch
Size: 18060 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20230613/dfd258a8/attachment.bin>