[PATCH] D85368: [llvm][CodeGen] Machine Function Splitter

Fri Aug 7 18:38:57 PDT 2020

snehasish added a comment.

> Could you share the details of the machine as well?

Sure, these were measured on a Lenovo P920 <https://reviews.llvm.org/P920> workstation -- Intel Skylake based Xeon(R) Gold 6154 CPU.

> The improvements are well within noise.

For SPEC, the reported intrate improvement numbers are an average across 5 iterations. Note that SPEC binaries are tiny in size may only improve code locality in some cases.

> While itlb reduction looks quite impressive, it doesn't seem to translate quite well to the runtime improvement.

It stands to reason that removing the itlb bottleneck will expose the next one :) We could dig deeper by looking into how the top down profile changes with and without splitting.

> Did we see consistent >2% improvement with multiple runs? Please share the numbers.

We see consistent 2%+ improvements over FDO optimized binaries. The numbers reported are averaged across 10 runs, here is the data for one such experiment where 500 invocations of clang were executed and the overall end to end user time was measured. For completeness, I have included the data for a hot-cold-split optimized binary as well. Note this particular experiment does not use ThinLTO for any of the builds since I had some trouble running the hot-cold-split pass with ThinLTO enabled.

  |----------------------------------|----------------------|----------------|-----------|
  |                                  | User time in seconds ($ time run-commands.sh)     |
  |----------------------------------|----------------------|----------------|-----------|
  | Run #                            | FDO baseline         | Hot cold split | MFS       |
  |                                1 |               484.65 |          479.2 |    466.93 |
  |                                2 |                483.4 |         478.28 |    470.25 |
  |                                3 |               485.57 |         479.15 |    470.36 |
  |                                4 |               480.37 |         480.34 |    469.85 |
  |                                5 |               482.97 |         478.18 |    471.93 |
  |                                6 |               484.06 |         479.74 |    473.27 |
  |                                7 |               482.67 |         477.42 |    472.56 |
  |                                8 |               483.53 |         476.99 |    474.58 |
  |                                9 |               486.43 |         480.76 |    473.92 |
  |                               10 |               489.94 |         480.11 |    471.42 |
  |----------------------------------|----------------------|----------------|-----------|
  | 2 Tail Paired T-Test vs Baseline |                      |      0.0000636 | 0.0000006 |
  |----------------------------------|----------------------|----------------|-----------|
  | Average                          |              484.359 |        479.017 |   471.507 |
  |----------------------------------|----------------------|----------------|-----------|
  | % Change                         |                      |           1.10 |      2.65 |
  |----------------------------------|----------------------|----------------|-----------|

Here is the data for TLB and icache. Each event was collected independently along with instructions to ensure no multiplexing. The variance reported by perf was less than 1% for each event (often less than 0.5%).

  |-----------|--------------------------------------------|--------------------------------------------------------|
  |           | $ perf stat -r 3 -e frontend_retired.${EVENT}:u,instructions:u -- run-commands.sh                   |
  |-----------|--------------------------------------------|--------------------------------------------------------|
  |           | Machine Function Splitter                  | FDO Baseline                                           |
  |-----------|--------------------------------------------|--------------------------------------------------------|
  | EVENT     | Misses        | Instructions      | MPKI   | Misses         | Instructions      | MPKI   | % Change |
  | itlb_miss | 1,411,325,040 | 1,618,495,692,919 | 0.8720 |  2,066,003,373 | 1,618,097,715,534 | 1.2768 |    31.70 |
  | stlb_miss |   131,949,440 | 1,618,466,757,079 | 0.0815 |    195,471,938 | 1,618,061,281,016 | 0.1208 |    32.51 |
  | l1i_miss  | 9,678,255,804 | 1,618,479,987,914 | 5.9798 | 10,698,143,090 | 1,618,081,273,918 | 6.6116 |     9.56 |
  | l2_miss   |   434,287,963 | 1,618,443,723,597 | 0.2683 |    542,869,835 | 1,618,081,904,973 | 0.3355 |    20.02 |
  |-----------|--------------------------------------------|--------------------------------------------------------|

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85368/new/

https://reviews.llvm.org/D85368