[PATCH] D152834: A new code layout algorithm for function reordering [2/3]

Fri Jun 23 12:05:33 PDT 2023

spupyrev added a comment.

Thanks for teaching me how to measure the impact of instruction caches. While re-running the experiments with the new events, I realized that my earlier report was not using C^3 as the baseline. Instead the numbers were on top of an improved code layout (referred to hfsort+) utilized by BOLT, which is not relevant here; I apologize for the confusion.
Below are details of the latest run on the same clang benchmarks, with and without huge pages. In addition to comparing the new algorithm to C^3, i'm also including the numbers on top of the "input" ordering that comes from the compiler. Here I'm building the binary with LTO and AutoFDO, but observe similar numbers when using instrumentation counts or other sampling-based profiling approaches (e.g., CSSPGO).

//No hugepages://

|                            | cds  | c^3 (delta, %)         | input (delta, %)       |
| benchmark1                 |
| frontend_retired.l1i_miss  | 69351242    | 70805714 (**1.99 ± 0.14**)    | 73665990 (**5.88 ± 0.11**)    |
| icache_64b.iftag_stall     | 377880876    | 440763009 (**14.67 ± 0.32**)   | 615372537 (**38.18 ± 0.41**)   |
| frontend_retired.itlb_miss | 4917651    | 5823311 (**15.58 ± 0.42**)   | 8996999 (**45.14 ± 0.36**)   |
| task-clock                 | 5348 | 5393 (**0.72 ± 0.33**) | 5431 (**1.47 ± 0.24**) |
| benchmark2                 |
| frontend_retired.l1i_miss  | 61681268    | 63522660 (**2.92 ± 0.27**)    | 64953229 (**5.01 ± 0.17**)    |
| icache_64b.iftag_stall     | 325869634    | 377176494 (**13.49 ± 0.43**)   | 495816988 (**34.07 ± 0.38**)   |
| frontend_retired.itlb_miss | 3814520    | 4502050 (**15.33 ± 0.66**)   | 7121213 (**47.04 ± 0.27**)   |
| task-clock                 | 8311 | 8338 (**0.32 ± 0.17**) | 8363 (**0.62 ± 0.23**) |
|

//With hugepages://

|                            | cds    | c^3 (delta, %)            | input (delta, %)          |
| benchmark1                 |
| frontend_retired.l1i_miss  | 67951983      | 68463724 (**0.75 ± 0.16**)       | 70894569 (**4.25 ± 1.32**)       |
| icache_64b.iftag_stall     | 132528699      | 151086801 (**12.26 ± 0.26**)      | 179977955 (**26.24 ± 0.84**)      |
| frontend_retired.itlb_miss | 255677 | 322445 (**20.03 ± 0.41**) | 524749 (**51.21 ± 1.04**) |
| task-clock                 | 5287   | 5314 (**0.51 ± 0.30**)    | 5349 (**1.27 ± 0.28**)    |
| benchmark2                 |
| frontend_retired.l1i_miss  | 59593334      | 60726917 (**1.85 ± 0.34**)       | 60064530 (**0.97 ± 0.30**)       |
| icache_64b.iftag_stall     | 130194520      | 133511012 (**2.67 ± 1.03**)       | 146815089 (**11.41 ± 1.05**)      |
| frontend_retired.itlb_miss | 207543 | 259727 (**20.07 ± 2.38**) | 416369 (**50.05 ± 1.09**) |
| task-clock                 | 8238   | 8266 (0.35 ± 0.38)        | 8276 (**0.45 ± 0.26**)    |
|

Besides these numbers, I can only share one data-point on a large production service, where CDS outperforms C^3 by around 0.25%-0.3% cpu (with huge pages and many other optimizations turned on). Though I'd generalize the wins on other binaries/benchmarks with an extra care, as it depends on a lot of factors.

Of course, using huge pages diminishes the impact of function reordering; yet it can still provide benefits.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152834/new/

https://reviews.llvm.org/D152834