[PATCH] D152834: A new code layout algorithm for function reordering [2/3]

Wed Jun 21 16:00:08 PDT 2023

davidxl added a comment.

In D152834#4439348 <https://reviews.llvm.org/D152834#4439348>, @spupyrev wrote:

> Here are my measurements on the clang binary (release_14) by compiling two large cpp files (benchmark1 and benchmark2). Negative values are improvements, bold ones are stat sig.
>
> 1. Using the alg in LLD (on top of D152840 <https://reviews.llvm.org/D152840>), with AutoFDO and //without// huge pages
>
> |               | base            | test            | delta(%)         |
> | benchmark1    |
> | task-clock    | 6440.07 ± 16.94 | 6373.95 ± 14.70 | **-1.01 ± 0.33** |
> | icache-misses | 218340412 ± 361175      | 210448621 ± 387645      | **-3.61 ± 0.16** |
> | itlb-misses   | 45609238 ± 129225      | 42629503 ± 46353       | **-6.38 ± 0.16** |
> | benchmark2    |
> | task-clock    | 9509.05 ± 21.74 | 9443.90 ± 30.05 | **-0.73 ± 0.29** |
> | icache-misses | 174525893 ± 294760      | 166852633 ± 253654      | **-4.38 ± 0.16** |
> | itlb-misses   | 36756578 ± 90162       | 35175447 ± 91296       | **-4.29 ± 0.29** |
> |
>
>
>
> 2. Using the alg in BOLT (on top of D153039 <https://reviews.llvm.org/D153039>), with AutoFDO and //without// huge pages
>
> |               | base            | test            | delta(%)         |
> | benchmark1    |
> | task-clock    | 5398.95 ± 18.12 | 5366.50 ± 12.47 | **-0.52 ± 0.39** |
> | icache-misses | 86342101 ± 258598      | 85715814 ± 152068      | **-0.63 ± 0.19** |
> | itlb-misses   | 16891309 ± 40480       | 15543677 ± 57307       | **-7.99 ± 0.41** |
> | benchmark2    |
> | task-clock    | 8307.96 ± 15.84 | 8316.72 ± 15.50 | 0.10 ± 0.20      |
> | icache-misses | 67742141 ± 470515      | 65716219 ± 198470      | **-2.82 ± 0.20** |
> | itlb-misses   | 13591076 ± 99998       | 12462672 ± 70100       | **-8.18 ± 0.54** |
> |
>
>
>
> 3. Using the alg in BOLT (on top of D153039 <https://reviews.llvm.org/D153039>), with AutoFDO and //with// huge pages
>
> |               | base            | test            | delta(%)          |
> | benchmark1    |
> | task-clock    | 5329.71 ± 38.16 | 5333.77 ± 17.21 | 0.31 ± 0.49       |
> | icache-misses | 89754736 ± 93088       | 90480531 ± 236996      | **0.69 ± 0.22**   |
> | itlb-misses   | 2279266 ± 15032       | 1973922 ± 13429       | **-13.45 ± 0.86** |
> | benchmark2    |
> | task-clock    | 8241.64 ± 16.92 | 8252.00 ± 13.55 | 0.15 ± 0.21       |
> | icache-misses | 69470543 ± 141858      | 68224372 ± 177928      | **-1.79 ± 0.32**  |
> | itlb-misses   | 1902566 ± 36542       | 2070742 ± 19558       | **9.27 ± 1.60**   |
> |

The timing data with huge pages look expected. The icache miss and itlb miss data look puzzling though -- benchmark1 sees slight icache miss increase and huge itlb miss reduction while benchmark2 is the opposite. While the slight increase of icache miss can be the result of increase in conflict misses due to the use of huge pages, the increase in ITLB misses for benchmark2 is surprising.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152834/new/

https://reviews.llvm.org/D152834