[all-commits] [llvm/llvm-project] 54e52a: [X86] Reduce znver3/4 LoopMicroOpBufferSize to pra...

Thu May 16 06:44:22 PDT 2024

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 54e52aa5ebe68de122a3fe6064e0abef97f6b8e0
      https://github.com/llvm/llvm-project/commit/54e52aa5ebe68de122a3fe6064e0abef97f6b8e0
  Author: Simon Pilgrim <llvm-dev at redking.me.uk>
  Date:   2024-05-16 (Thu, 16 May 2024)

  Changed paths:
    M llvm/lib/Target/X86/X86ScheduleZnver3.td
    M llvm/lib/Target/X86/X86ScheduleZnver4.td
    M llvm/test/Transforms/LoopUnroll/X86/znver3.ll

  Log Message:
  -----------
  [X86] Reduce znver3/4 LoopMicroOpBufferSize to practical loop unrolling values (#91340)

The znver3/4 scheduler models have previously associated the LoopMicroOpBufferSize with the maximum size of their op caches, and when this led to quadratic complexity issues this were reduced to a value of 512 uops, based mainly on compilation time and not its effectiveness on runtime performance.

>From a runtime performance POV, a large LoopMicroOpBufferSize leads to a higher number of loop unrolls, meaning the cpu has to rely on the frontend decode rate (4 ins/cy max) for much longer to fill the op cache before looping begins and we make use of the faster op cache rate (8/9 ops/cy).

This patch proposes we instead cap the size of the LoopMicroOpBufferSize based off the maximum rate from the op cache (znver3 = 8op/cy, znver4 = 9op/cy) and the branch misprediction penalty from the opcache (~12cy) as a estimate of the useful number of ops we can unroll a loop by before mispredictions are likely to cause stalls. This isn't a perfect metric, but does try to be closer to the spirit of how we use LoopMicroOpBufferSize in the compiler vs the size of a similar naming buffer in the cpu.

To unsubscribe from these emails, change your notification settings at https://github.com/llvm/llvm-project/settings/notifications