[PATCH] D43256: [MBP] Move a latch block with conditional exit and multi predecessors to top of loop

Thu Jul 18 03:54:49 PDT 2019

ebrevnov added a comment.
Herald added subscribers: wuzish, MaskRay.

This change causes 35% regression on very simple loop which is hot part of our internal micro benchmark.
This loop takes 99% of total execution time and has reasonably large number of iterations.
All measurements were performed on Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz.

Here is the loop in question before the change:

  25.41  | 0x30020e20:   c5fa10 vmovss    12(%rcx,%rsi,4), %xmm0
    5.12  | 0x30020e26:   c5f82e vucomiss    %xmm0, %xmm0
    0.61  | 0x30020e2a:   7a1d   jp    29                              ; 0x30020e49
    0.61  | 0x30020e2c:   c5fa10 vmovss    12(%rax,%rsi,4), %xmm2
   28.07 | 0x30020e32:   c5f82e vucomiss    %xmm1, %xmm0
    4.10  | 0x30020e36:   7506   jne    6                              ; 0x30020e3e
             | 0x30020e38:   0f8bd0 jnp    720                            ; 0x3002110e
    0.41  | 0x30020e3e:   c5fac2 vcmpless    %xmm2, %xmm0, %xmm3
    0.41  | 0x30020e43:   c4e369 vblendvps    %xmm3, %xmm0, %xmm2, %xmm0
   35.04 | 0x30020e49:   c5fa11 vmovss    %xmm0, 12(%rdx,%rsi,4)
    0.20  | 0x30020e4f:   48ffc6 incq    %rsi
             | 0x30020e52:   4839de cmpq    %rbx, %rsi
             | 0x30020e55:   72c9   jb    -55                             ; 0x30020e20

After the change:

            | 0x30020a40:   c5fa11 vmovss    %xmm0, 12(%rdx,%rsi,4)
            | 0x 30020a46:   48ffc6 incq    %rsi
  27.25 | 0x 30020a49:   4839de cmpq    %rbx, %rsi
            | 0x30020a4c:   0f836e jae    -146                           ; 0x300209c0
            | 0x30020a52:   c5fa10 vmovss    12(%rcx,%rsi,4), %xmm0
            | 0x30020a58:   c5f82e vucomiss    %xmm0, %xmm0
            | 0x30020a5c:   7ae2   jp    -30                             ; 0x30020a40
  27.46 | 0x30020a5e:   c5fa10 vmovss    12(%rax,%rsi,4), %xmm2
            | 0x30020a64:   c5f82e vucomiss    %xmm1, %xmm0
            | 0x30020a68:   7506   jne    6                              ; 0x30020a70
            | 0x30020a6a:   0f8bfd jnp    509                            ; 0x30020c6d
            | 0x30020a70:   c5fac2 vcmpless    %xmm2, %xmm0, %xmm3
  23.36 | 0x30020a75:   c4e369 vblendvps    %xmm3, %xmm0, %xmm2, %xmm0
  21.93 | 0x30020a7b:   ebc3   jmp    -61                            ; 0x30020a40

So far I don't have full understanding why that causes 35% slow down.
One note is that by minimizing number of taken branches we actually increase number of branch instruction in the loop what increases code size.
Moreover we increase number of branch instructions executed at runtime for old fall-through path and don't decrease it for all over paths.
While these two facts may negatively affect performance they don't explain 35% slowdown. Something more complicated happens behind the scene.

This example shows that this optimization is not always beneficial and requires more complicated profitability heuristic.

Any ideas?

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D43256/new/

https://reviews.llvm.org/D43256