[all-commits] [llvm/llvm-project] 54e52a: [X86] Reduce znver3/4 LoopMicroOpBufferSize to pra...
Simon Pilgrim via All-commits
all-commits at lists.llvm.org
Thu May 16 06:44:22 PDT 2024
Branch: refs/heads/main
Home: https://github.com/llvm/llvm-project
Commit: 54e52aa5ebe68de122a3fe6064e0abef97f6b8e0
https://github.com/llvm/llvm-project/commit/54e52aa5ebe68de122a3fe6064e0abef97f6b8e0
Author: Simon Pilgrim <llvm-dev at redking.me.uk>
Date: 2024-05-16 (Thu, 16 May 2024)
Changed paths:
M llvm/lib/Target/X86/X86ScheduleZnver3.td
M llvm/lib/Target/X86/X86ScheduleZnver4.td
M llvm/test/Transforms/LoopUnroll/X86/znver3.ll
Log Message:
-----------
[X86] Reduce znver3/4 LoopMicroOpBufferSize to practical loop unrolling values (#91340)
The znver3/4 scheduler models have previously associated the LoopMicroOpBufferSize with the maximum size of their op caches, and when this led to quadratic complexity issues this were reduced to a value of 512 uops, based mainly on compilation time and not its effectiveness on runtime performance.
>From a runtime performance POV, a large LoopMicroOpBufferSize leads to a higher number of loop unrolls, meaning the cpu has to rely on the frontend decode rate (4 ins/cy max) for much longer to fill the op cache before looping begins and we make use of the faster op cache rate (8/9 ops/cy).
This patch proposes we instead cap the size of the LoopMicroOpBufferSize based off the maximum rate from the op cache (znver3 = 8op/cy, znver4 = 9op/cy) and the branch misprediction penalty from the opcache (~12cy) as a estimate of the useful number of ops we can unroll a loop by before mispredictions are likely to cause stalls. This isn't a perfect metric, but does try to be closer to the spirit of how we use LoopMicroOpBufferSize in the compiler vs the size of a similar naming buffer in the cpu.
To unsubscribe from these emails, change your notification settings at https://github.com/llvm/llvm-project/settings/notifications
More information about the All-commits
mailing list