[LLVMbugs] [Bug 22226] New: Performance degradations of several tests from eembc.1.1 and eembc.2.0 suites on x86 Avoton-1.7 due to ‘InstCombine?=: transformation A-B < 0 into A <=?UTF-8?Q? B’

Wed Jan 14 05:39:22 PST 2015

http://llvm.org/bugs/show_bug.cgi?id=22226

            Bug ID: 22226
           Summary: Performance degradations of several tests from
                    eembc.1.1 and eembc.2.0 suites on x86 Avoton-1.7  due
                    to ‘InstCombine: transformation A-B < 0 into A < B’
           Product: tools
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P
         Component: opt
          Assignee: david.majnemer at gmail.com
          Reporter: sergey.k.okunev at gmail.com
                CC: elena.demikhovsky at intel.com, llvmbugs at cs.uiuc.edu,
                    michael.m.kuperstein at intel.com
    Classification: Unclassified

Created attachment 13685
  --> http://llvm.org/bugs/attachment.cgi?id=13685&action=edit
Initial IR for test eembc/autcor00

While our performance testing  regressions on tests autocor00, fbital00 from
eembc.1.1 suite and tests mp2enf32, mp4encod from eembc.2.0 suite were
detected. Bisect analysis showed LLVM revision 225034 is responsible for these
degradations. This revision enabled ‘InstCombine: transformation A-B < 0 into A
< B’ as described in comments.

commit 0f77ccd6bbe35543153dcd97a10fe552a5077e8d
Author: David Majnemer <david.majnemer at gmail.com>
Date:   Wed Dec 31 04:21:41 2014 +0000

    InstCombine: try to transform A-B < 0 into A < B

    We are allowed to move the 'B' to the right hand side if we an prove
    there is no signed overflow and if the comparison itself is signed.

    git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@225034
91177308-0d34-0410-b5e6-96231b3b80d8

Submitted transformation prevents to enabling some following optimizations that
leads to additional operations in x86 loop code in degraded cases. Consider a
couple of examples from eembc.1.1 and eembc.2.0 with fragments of IR dumps and
asm codes for revisions before (r225031) and after degradations (r225034).

1) eembc_1_1/autcor00
---------------------
There is nested loop region with accumulator in the test. Initial IR dump of
the loop is in attachment (autcor00_0.ll).
On the third pass of ‘Combine redundant instructions’ considered transformation
is applied to condition of inner loop rounding branch in outer loop in revision
225034.

r225031
-------
*** IR Dump After Combine redundant instructions ***
………………………
for.body:                                         ; preds = %for.body.lr.ph,
%for.end
  %lag.032 = phi i32 [ 0, %for.body.lr.ph ], [ %inc16, %for.end ]
  %sub = sub nsw i32 %conv2, %lag.032
  %cmp428 = icmp sgt i32 %sub, 0                       !! initial cmp
  br i1 %cmp428, label %for.body6.lr.ph, label %for.end

for.body6.lr.ph:                                  ; preds = %for.body
  br label %for.body6

vs. 

r225034 
-------
*** IR Dump After Combine redundant instructions ***
………………………
for.body:                                         ; preds = %for.body.lr.ph,
%for.end
  %lag.032 = phi i32 [ 0, %for.body.lr.ph ], [ %inc16, %for.end ]
  %sub = sub nsw i32 %conv2, %lag.032
  %cmp428 = icmp sgt i32 %conv2, %lag.032              !! transformed cmp   
  br i1 %cmp428, label %for.body6.lr.ph, label %for.end

for.body6.lr.ph:                                  ; preds = %for.body
  br label %for.body6
……………………………..

After that ‘Induction Variable Simplification’ optimization is applied to the
loop in r225031 case (back edge branch condition is transformed to ‘ne’) and is
not applied in r225034. And resultant code of the loop of version before
degradation (r225031) is obtained more optimal – less instructions and the
length of loop iteration is less by 1 clock. Corresponding IR dumps and asm
codes of inner loop are the following.  

r225031
-------
*** IR Dump After Induction Variable Simplification ***
for.body6:                                        ; preds = %for.body6.lr.ph,
%for.body6
  %Accumulator.030 = phi i32 [ 0, %for.body6.lr.ph ], [ %add11, %for.body6 ] 
  %i.029 = phi i32 [ 0, %for.body6.lr.ph ], [ %inc, %for.body6 ] 
  %arrayidx = getelementptr inbounds i16* %InputData, i32 %i.029
  %2 = load i16* %arrayidx, align 2, !tbaa !2
  %conv7 = sext i16 %2 to i32 
  %add = add nsw i32 %i.029, %lag.032
  %arrayidx8 = getelementptr inbounds i16* %InputData, i32 %add 
  %3 = load i16* %arrayidx8, align 2, !tbaa !2 
  %conv9 = sext i16 %3 to i32 
  %mul = mul nsw i32 %conv9, %conv7 
  %shr = ashr i32 %mul, %conv10 
  %add11 = add nsw i32 %shr, %Accumulator.030
  %inc = add nsw i32 %i.029, 1
  %exitcond = icmp ne i32 %i.029, %indvars.iv      !! cmp and br were
transformed 
  br i1 %exitcond, label %for.body6, label %for.cond3.for.end_crit_edge
……………………………………………

Resultant asm code of the inner loop:

0xf77f6c00 27 200 movswl (%edx),%edi               !!
0xf77f6c03 28 2690 movswl (%edx,%ebx,2),%esi       !!
0xf77f6c07 29 110 add    $0x2,%edx
0xf77f6c0a 30 2712 imul   %edi,%esi
0xf77f6c0d 31 1485 sar    %cl,%esi
0xf77f6c0f 32 4722 add    %esi,%eax
0xf77f6c11 33 659 add    $0xffffffff,%ebp          !! 'i--' + jne
0xf77f6c14 34 2648 jne    f77f6c00 <fxpAutoCorrelation+0x50>

vs.

r225034 
-------
*** IR Dump After Induction Variable Simplification *** 
for.body6:                                        ; preds = %for.body6.lr.ph,
%for.body6
  %Accumulator.030 = phi i32 [ 0, %for.body6.lr.ph ], [ %add11, %for.body6 ]
  %i.029 = phi i32 [ 0, %for.body6.lr.ph ], [ %inc, %for.body6 ]
  %arrayidx = getelementptr inbounds i16* %InputData, i32 %i.029
  %0 = load i16* %arrayidx, align 2, !tbaa !2      
  %conv7 = sext i16 %0 to i32
  %add = add nsw i32 %i.029, %lag.032
  %arrayidx8 = getelementptr inbounds i16* %InputData, i32 %add 
  %1 = load i16* %arrayidx8, align 2, !tbaa !2 
  %conv9 = sext i16 %1 to i32 
  %mul = mul nsw i32 %conv9, %conv7 
  %shr = ashr i32 %mul, %conv10 
  %add11 = add nsw i32 %shr, %Accumulator.030
  %inc = add nsw i32 %i.029, 1
  %cmp4 = icmp slt i32 %inc, %sub                  !! cmp is not transformed
  br i1 %cmp4, label %for.body6, label %for.cond3.for.end_crit_edge

Resultant asm code of the inner loop:

0xf773dc10 28 20 mov    0x8(%esp),%ebx             !! additional fill instr.
0xf773dc14 29 2831 movswl (%edx),%edi
0xf773dc17 30 4 add    $0x1,%eax                   !! add - operand of 'cmp'
0xf773dc1a 31 2772 movswl (%edx,%ebx,2),%ebx       !! +1 clock in the loop
0xf773dc1e 32 3 add    $0x2,%edx
0xf773dc21 33 2744 imul   %edi,%ebx
0xf773dc24 34 1196 sar    %cl,%ebx
0xf773dc26 35 4320 add    %ebx,%ebp
0xf773dc28 36 249 cmp    %esi,%eax                 !! cmp + jl
0xf773dc2a 37 2862 jl     f773dc10 <fxpAutoCorrelation+0x60>

2) eembc_2_0/mp2enf32, func. 'dist1'
------------------------------------
Another case is the hottest loop from function ‘dist1’, test mp2enf32 from
eembc.2.0 suite. There are 16 if-statements such kind as ‘if ((x = a[0]  -
b[0])<0)  x = -x; y+= x;’ in the loop. Initial IR dump of the loop is in
attachment (mp2enf32_0_lp.ll).
And considered transformation is applied to these conditions with ‘<0’ inside
loop in r225034 case as seen in IR dump fragment after ‘Combine redundant
instructions’ phase.

r225031
-------
for.body:                                         ; preds = %for.cond
  %2 = load i8* %p1.0, align 1, !tbaa !2
  %conv = zext i8 %2 to i32
  %3 = load i8* %p2.0, align 1, !tbaa !2
  %conv3 = zext i8 %3 to i32
  %sub = sub nsw i32 %conv, %conv3
  %cmp4 = icmp slt i32 %sub, 0                    !! cmp is not transformed   
  br i1 %cmp4, label %if.then6, label %if.end

if.then6:                                         ; preds = %for.body
  %sub7 = sub nsw i32 0, %sub
  br label %if.end

if.end:                                           ; preds = %if.then6,
%for.body
  %v.0 = phi i32 [ %sub7, %if.then6 ], [ %sub, %for.body ]
  %add = add nsw i32 %s.0, %v.0
………………………………………………

vs.

r225034
-------
for.body:                                         ; preds = %for.cond
  %2 = load i8* %p1.0, align 1, !tbaa !2
  %conv = zext i8 %2 to i32
  %3 = load i8* %p2.0, align 1, !tbaa !2
  %conv3 = zext i8 %3 to i32
  %sub = sub nsw i32 %conv, %conv3
  %cmp4 = icmp ult i8 %2, %3                       !! cmp is transformed
  br i1 %cmp4, label %if.then6, label %if.end

if.then6:                                         ; preds = %for.body
  %sub7 = sub nsw i32 0, %sub
  br label %if.end

if.end:                                           ; preds = %if.then6,
%for.body
  %v.0 = phi i32 [ %sub7, %if.then6 ], [ %sub, %for.body ]
  %add = add nsw i32 %s.0, %v.0
…………………………………………

Then after ‘Simplify the CFG’ phase if-statement is converted to IR ‘select’
operation. In ‘Expand ISel Pseudo-instructions’ phase this pseudo-instruction
is expanded to ‘cmov’ with optimization which transforms sub and cmp
instructions into one ‘neg’ instruction in r225031 case and w/o optimization in
r225034 case.  Corresponding IR dump fragments of inner loop are the following. 

r225031
-------
for.body:                                         ; preds = %for.cond
  %2 = load i8* %p1.0, align 1, !tbaa !2           
  %conv = zext i8 %2 to i32 
  %3 = load i8* %p2.0, align 1, !tbaa !2
  %conv3 = zext i8 %3 to i32  
  %sub = sub nsw i32 %conv, %conv3
  %cmp4 = icmp slt i32 %sub, 0                    !!
  %sub7 = sub nsw i32 0, %sub 
  %sub7.sub = select i1 %cmp4, i32 %sub7, i32 %sub
  %add = add nsw i32 %s.0, %sub7.sub
…………………………………………

BB#2: derived from LLVM BB %for.body
    Predecessors according to CFG: BB#1 BB#3 
        %vreg0<def> = PHI %vreg153, <BB#1>, %vreg5, <BB#3>; GR32_NOSP:%vreg0
GR32:%vreg153,%vreg5
        %vreg1<def> = PHI %vreg154, <BB#1>, %vreg3, <BB#3>;
GR32:%vreg1,%vreg154,%vreg3
        %vreg2<def> = PHI %vreg154, <BB#1>, %vreg4, <BB#3>;
GR32:%vreg2,%vreg154,%vreg4
        %vreg155<def> = MOVZX32rm8 %vreg55, 1, %vreg0, -15, %noreg;
mem:LD1[%scevgep564](tbaa=<0x7d1a588>) GR32:%vreg155,%vreg55 GR32_NOSP:
%vreg0  
        %vreg156<def> = MOVZX32rm8 %vreg56, 1, %vreg0, -15, %noreg;
mem:LD1[%scevgep533](tbaa=<0x7d1a588>) GR32:%vreg156,%vreg56 GR32_NOSP:
%vreg0  
        %vreg157<def,tied1> = SUB32rr %vreg155<tied0>, %vreg156<kill>,
%EFLAGS<imp-def,dead>; GR32:%vreg157,%vreg155,%vreg156
        %vreg158<def,tied1> = NEG32r %vreg157<tied0>, %EFLAGS<imp-def>;
GR32:%vreg158,%vreg157
        %vreg159<def,tied1> = CMOVGE32rr %vreg157<tied0>, %vreg158<kill>,
%EFLAGS<imp-use>; GR32:%vreg159,%vreg157,%vreg158
        %vreg160<def,tied1> = ADD32rr %vreg159<tied0>, %vreg1,
%EFLAGS<imp-def,dead>; GR32:%vreg160,%vreg159,%vreg1
…………………………………………

vs.

r225034
-------
for.body:                                         ; preds = %for.cond
  %2 = load i8* %p1.0, align 1, !tbaa !2
  %conv = zext i8 %2 to i32 
  %3 = load i8* %p2.0, align 1, !tbaa !2 
  %conv3 = zext i8 %3 to i32 
  %sub = sub nsw i32 %conv, %conv3 
  %cmp4 = icmp ult i8 %2, %3                      !!
  %sub7 = sub nsw i32 0, %sub
  %sub7.sub = select i1 %cmp4, i32 %sub7, i32 %sub 
  %add = add nsw i32 %s.0, %sub7.sub
…………………………………………

BB#2: derived from LLVM BB %for.body
    Predecessors according to CFG: BB#1 BB#3
        %vreg0<def> = PHI %vreg173, <BB#1>, %vreg5, <BB#3>; GR32_NOSP:%vreg0
GR32:%vreg173,%vreg5
        %vreg1<def> = PHI %vreg174, <BB#1>, %vreg3, <BB#3>;
GR32:%vreg1,%vreg174,%vreg3
        %vreg2<def> = PHI %vreg174, <BB#1>, %vreg4, <BB#3>;
GR32:%vreg2,%vreg174,%vreg4
        %vreg175<def> = MOVZX32rm8 %vreg55, 1, %vreg0, 0, %noreg;
mem:LD1[%scevgep562](tbaa=<0x6f54588>) GR32:%vreg175,%vreg55 GR32_NOSP:%v
reg0     
        %vreg176<def> = MOVZX32rm8 %vreg56, 1, %vreg0, 0, %noreg;
mem:LD1[%scevgep531](tbaa=<0x6f54588>) GR32:%vreg176,%vreg56 GR32_NOSP:%v
reg0     
        %vreg177<def,tied1> = SUB32rr %vreg175<tied0>, %vreg176,
%EFLAGS<imp-def>; GR32:%vreg177,%vreg175,%vreg176 
        %vreg178<def,tied1> = NEG32r %vreg177<tied0>, %EFLAGS<imp-def,dead>;
GR32:%vreg178,%vreg177 
        %vreg179<def,tied1> = SUB32rr %vreg175<tied0>, %vreg176,
%EFLAGS<imp-def>; GR32:%vreg179,%vreg175,%vreg176 
        %vreg182<def,tied1> = CMOVB32rr %vreg179<tied0>, %vreg178<kill>,
%EFLAGS<imp-use>; GR32:%vreg182,%vreg179,%vreg178
……………………………………………

Thus, considered transformation leads to additional ‘sub’ as operand of each
‘cmov’ and negative performance impact on the register allocation of the loop
in r225034 version. Instrumented asm code fragments with 2 ‘cmov’s of the loop
are the following. 

r225031
-------
0xf7483410 67 112 movzbl -0xf(%ebx,%ebp,1),%edx
0xf7483415 68 71 movzbl -0xf(%ecx,%ebp,1),%esi
0xf748341a 69 124 mov    %eax,0x20(%esp)
0xf748341e 70 67 movzbl -0xe(%ebx,%ebp,1),%eax
0xf7483423 71 143 sub    %esi,%edx
0xf7483425 72 54 mov    %edx,%esi
0xf7483427 73 99 neg    %esi                       !! 'neg' is cond. and
operand of 'cmov'
0xf7483429 74 53 cmovl  %edx,%esi
0xf748342c 75 113 movzbl -0xe(%ecx,%ebp,1),%edx
0xf7483431 76 36 add    %edi,%esi                  !! 'add' is close to 'cmov'
w/o spill
0xf7483433 77 103 sub    %edx,%eax
0xf7483435 78 45 mov    %eax,%edx
0xf7483437 79 111 neg    %edx                      !! 'neg' is cond. and
operand of 'cmov'
0xf7483439 80 133 cmovl  %eax,%edx
0xf748343c 81 281 movzbl -0xd(%ebx,%ebp,1),%eax
0xf7483441 82 52 add    %esi,%edx                  !! 'add' is close to 'cmov'
w/o spill
..........

vs.

r225034
-------
0xf743c470 77 66 movzbl (%ecx,%ebx,1),%ebp
0xf743c474 78 97 movzbl (%eax,%ebx,1),%edx
0xf743c478 79 98 mov    %esi,(%esp)
0xf743c47b 80 93 mov    %ebp,%esi
0xf743c47d 81 73 sub    %edx,%esi
0xf743c47f 82 81 neg    %esi
0xf743c481 83 64 sub    %edx,%ebp                  !! additional sub is operand
of 'cmov'
0xf743c483 84 89 movzbl -0x1(%ecx,%ebx,1),%edx
0xf743c488 85 75 cmovb  %esi,%ebp
0xf743c48b 86 82 mov    %ebp,0x28(%esp)            !! spill instr. of 'cmov'
res.
0xf743c48f 87 85 mov    %edx,0x50(%esp)
0xf743c493 88 99 movzbl -0x1(%eax,%ebx,1),%edx
0xf743c498 89 72 mov    0x50(%esp),%ebp
0xf743c49c 90 100 mov    %ebp,%esi
0xf743c49e 91 72 sub    %edx,%esi
0xf743c4a0 92 94 neg    %esi
0xf743c4a2 93 89 sub    %edx,%ebp                  !! additional 'sub' is
operand of 'cmov'
0xf743c4a4 94 89 movzbl -0x2(%ecx,%ebx,1),%edx
0xf743c4a9 95 82 cmovb  %esi,%ebp
0xf743c4ac 96 83 mov    %ebp,0x50(%esp)            !! spill instr. of 'cmov'
res.
..........
0xf743c64d 226 174 add    0x50(%esp),%ebp          !! 'add' is in the end of
the loop
0xf743c651 227 163 mov    %ebp,%edi
0xf743c653 228 322 add    0x28(%esp),%edi          !! 'add' is in the end of
the loop

Okunev Sergey,
Software Engineer
Intel Compiler Team

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20150114/91c1bd39/attachment.html>