[PATCH] D43256: [MBP] Move a latch block with conditional exit and multi predecessors to top of loop
Evgeniy via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Jul 18 03:54:49 PDT 2019
ebrevnov added a comment.
Herald added subscribers: wuzish, MaskRay.
This change causes 35% regression on very simple loop which is hot part of our internal micro benchmark.
This loop takes 99% of total execution time and has reasonably large number of iterations.
All measurements were performed on Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz.
Here is the loop in question before the change:
25.41 | 0x30020e20: c5fa10 vmovss 12(%rcx,%rsi,4), %xmm0
5.12 | 0x30020e26: c5f82e vucomiss %xmm0, %xmm0
0.61 | 0x30020e2a: 7a1d jp 29 ; 0x30020e49
0.61 | 0x30020e2c: c5fa10 vmovss 12(%rax,%rsi,4), %xmm2
28.07 | 0x30020e32: c5f82e vucomiss %xmm1, %xmm0
4.10 | 0x30020e36: 7506 jne 6 ; 0x30020e3e
| 0x30020e38: 0f8bd0 jnp 720 ; 0x3002110e
0.41 | 0x30020e3e: c5fac2 vcmpless %xmm2, %xmm0, %xmm3
0.41 | 0x30020e43: c4e369 vblendvps %xmm3, %xmm0, %xmm2, %xmm0
35.04 | 0x30020e49: c5fa11 vmovss %xmm0, 12(%rdx,%rsi,4)
0.20 | 0x30020e4f: 48ffc6 incq %rsi
| 0x30020e52: 4839de cmpq %rbx, %rsi
| 0x30020e55: 72c9 jb -55 ; 0x30020e20
After the change:
| 0x30020a40: c5fa11 vmovss %xmm0, 12(%rdx,%rsi,4)
| 0x 30020a46: 48ffc6 incq %rsi
27.25 | 0x 30020a49: 4839de cmpq %rbx, %rsi
| 0x30020a4c: 0f836e jae -146 ; 0x300209c0
| 0x30020a52: c5fa10 vmovss 12(%rcx,%rsi,4), %xmm0
| 0x30020a58: c5f82e vucomiss %xmm0, %xmm0
| 0x30020a5c: 7ae2 jp -30 ; 0x30020a40
27.46 | 0x30020a5e: c5fa10 vmovss 12(%rax,%rsi,4), %xmm2
| 0x30020a64: c5f82e vucomiss %xmm1, %xmm0
| 0x30020a68: 7506 jne 6 ; 0x30020a70
| 0x30020a6a: 0f8bfd jnp 509 ; 0x30020c6d
| 0x30020a70: c5fac2 vcmpless %xmm2, %xmm0, %xmm3
23.36 | 0x30020a75: c4e369 vblendvps %xmm3, %xmm0, %xmm2, %xmm0
21.93 | 0x30020a7b: ebc3 jmp -61 ; 0x30020a40
So far I don't have full understanding why that causes 35% slow down.
One note is that by minimizing number of taken branches we actually increase number of branch instruction in the loop what increases code size.
Moreover we increase number of branch instructions executed at runtime for old fall-through path and don't decrease it for all over paths.
While these two facts may negatively affect performance they don't explain 35% slowdown. Something more complicated happens behind the scene.
This example shows that this optimization is not always beneficial and requires more complicated profitability heuristic.
Any ideas?
Repository:
rL LLVM
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D43256/new/
https://reviews.llvm.org/D43256
More information about the llvm-commits
mailing list