[PATCH] D39352: [SimplifyCFG] Don't do if-conversion if there is a long dependence chain

Thu Oct 26 16:24:15 PDT 2017

Carrot created this revision.

The motivated test case is

struct ICmp {

  bool operator()(int *a, int *b) const { return *a < *b; }

};

typedef set<int *, ICmp> PSet;

void foo(PSet* pset, int* l) {

  pset->insert(l);

}

The actual hot code is in function _M_get_insert_unique_pos

   while (__x != 0)
  {
    __y = __x; 
    __comp = _M_impl._M_key_compare(__k, _S_key(__x));
    __x = __comp ? _S_left(__x) : _S_right(__x);
  }

LLVM generates following code:

.LBB1_2:                                # %while.body.i

1. =>This Inner Loop Header: Depth=1 movq    %rdx, %rbx movq    32(%rbx), %rcx movl    (%rcx), %ecx leaq    24(%rbx), %rdx leaq    16(%rbx), %rsi cmpl    %ecx, %eax cmovlq  %rsi, %rdx movq    (%rdx), %rdx testq   %rdx, %rdx jne     .LBB1_2

GCC generates:

  7.53 │4137d0:   mov    0x10(%rbx),%rax                                   
  0.01 │4137d4:   mov    $0x1,%edi                                               
       │4137d9:   test   %rax,%rax                                                                         
  0.95 │4137dc: ↓ je     4137f7 <BM_CallF(testing::benchmark::State&)+0xd7>     

  1.18 │4137de:   mov    %rax,%rbx                                                                

40.92 │4137e1:   mov    0x20(%rbx),%rax                                               
 34.82 │4137e5:   mov    (%rax),%ecx

       │4137e7:   cmp    %r14d,%ecx                                                                    
  1.83 │4137ea: ↑ jg     4137d0 <BM_CallF(testing::benchmark::State&)+0xb0>  

  8.05 │4137ec:   mov    0x18(%rbx),%rax                                   
  0.01 │4137f0:   xor    %edi,%edi                               
       │4137f2:   test   %rax,%rax                                                   
  0.50 │4137f5: ↑ jne    4137de <BM_CallF(testing::benchmark::State&)+0xbe>             

The gcc generated code is 15% faster.  The reason is in llvm generated code, there is a long dependence chain including cmov and later instructions, all of them must be executed one by one. In gcc generated code, the instructions after branch don't have data dependence on instructions before branch, so they can be executed in parallel. Even if the branch is highly unpredictable(like visiting RBTree in this case), if the dependence chain is long enough, the correctly predicted execution can still bring overall performance improvement.

We have observed several internal applications benefited from this optimization. All of them visit tree like data structures.

Before if-conversion, this patch finds out the length of the dependence chain (only the part that can be executed in parallel with code after branch) before cmp, if the length is longer than a threshold, then give up if-conversion.

With this patch, now I can get

// -O2
.LBB1_3:                                # %while.body.i

  1. =>This Inner Loop Header: Depth=1 movq    %rdx, %rbx movq    32(%rbx), %rcx movl    (%rcx), %ecx cmpl    %ecx, %eax jge     .LBB1_5
1. BB#4:                                 # %cond.true.i
  1. in Loop: Header=BB1_3 Depth=1 leaq    16(%rbx), %rdx jmp     .LBB1_6 .p2align        4, 0x90

.LBB1_5:                                # %cond.false.i

1. in Loop: Header=BB1_3 Depth=1 leaq    24(%rbx), %rdx

.LBB1_6:                                # %cond.end.i

1. in Loop: Header=BB1_3 Depth=1 movq    (%rdx), %rdx testq   %rdx, %rdx jne     .LBB1_3

// -O3
.LBB1_3:                                # %while.body.i

  1. =>This Inner Loop Header: Depth=1 movq    %rdx, %rbx movq    32(%rbx), %rcx movl    (%rcx), %ecx cmpl    %ecx, %eax jge     .LBB1_5
1. BB#4:                                 # %cond.true.i
  1. in Loop: Header=BB1_3 Depth=1 leaq    16(%rbx), %rdx movq    (%rdx), %rdx testq   %rdx, %rdx jne     .LBB1_3 jmp     .LBB1_7 .p2align        4, 0x90

.LBB1_5:                                # %cond.false.i

1. in Loop: Header=BB1_3 Depth=1 leaq    24(%rbx), %rdx movq    (%rdx), %rdx testq   %rdx, %rdx jne     .LBB1_3

Still worse than gcc generated code, but those are separate issues (code layout, extra lea), and it is already faster than cmov version.

https://reviews.llvm.org/D39352

Files:
  include/llvm/Analysis/TargetTransformInfo.h
  include/llvm/Analysis/TargetTransformInfoImpl.h
  include/llvm/CodeGen/BasicTTIImpl.h
  lib/Analysis/TargetTransformInfo.cpp
  lib/Transforms/Utils/SimplifyCFG.cpp
  test/Transforms/SimplifyCFG/X86/if-conversion.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D39352.120483.patch
Type: text/x-patch
Size: 11451 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20171026/ae8c781d/attachment.bin>