[LLVMbugs] [Bug 23070] New: Performance degradations of tests from eembc.1.1 suite on x86 Avoton-1.7 due to ‘SCEVExpander’ changes

Mon Mar 30 05:47:55 PDT 2015

https://llvm.org/bugs/show_bug.cgi?id=23070

            Bug ID: 23070
           Summary: Performance degradations of tests from eembc.1.1 suite
                    on x86 Avoton-1.7  due to ‘SCEVExpander’ changes
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P
         Component: Scalar Optimizations
          Assignee: unassignedbugs at nondot.org
          Reporter: sergey.k.okunev at gmail.com
                CC: david.l.kreitzer at intel.com, denis.briltz at intel.com,
                    elena.demikhovsky at intel.com, llvmbugs at cs.uiuc.edu,
                    michael.m.kuperstein at intel.com,
                    sanjoy at playingwithpointers.com, sergos.gnu at gmail.com,
                    zia.ansari at intel.com
    Classification: Unclassified

While our performance testing regressions on tests autocor00, aifftr01,
aiifft01 from eembc.1.1 suite were detected. Bisect analysis showed LLVM
revision 231018  is responsible for these degradations. The comments to commit
are the following.

commit caee94bbb4fb44971f594fe09fd61692dc4aa719
Author: Sanjoy Das <sanjoy at playingwithpointers.com>
Date:   Mon Mar 2 21:41:07 2015 +0000

    Revert some changes that were made to fix PR20680.

    This re-lands change r230921.  r230921 was reverted because it broke a
    clang test; a checkin fixing the clang test will be commited shortly.

    Summary:
    As far as I can tell, the real bug causing the issue was fixed in
    r230533.  SCEVExpander should mark an increment operation as nuw or nsw
    only if it can *prove* that the operation does not overflow.  There
    shouldn't be any situation where we have to do something different
    because of no-wrap flags generated by SCEVExpander.

    Revert "IndVarSimplify: Allow LFTR to fire more often"

    This reverts commit 1ade0f0faa98877b688e0b9da58e876052c1e04e (SVN: 222213).

    Revert "IndVarSimplify: Don't let LFTR compare against a poison value"

    This reverts commit c0f2b8b528d8a37b0a1522aae90af649d6357eb5 (SVN: 217102).

    Reviewers: majnemer, atrick, spatel

    Differential Revision: http://reviews.llvm.org/D7979

    git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@231018
91177308-0d34-0410-b5e6-96231b3b80d8

Submitted changes prevent to enabling some following loop optimizations that
leads to additional operations in x86 loop code in degraded cases. Consider
example from eembc.1.1 with fragments of IR dumps and asm codes for revisions
before (r231017) and after degradations (r231018). 
Options: -O2 -ffast-math -m32 -mfpmath=sse -march=slm -fPIE -pie

1) eembc_1_1/autcor00
---------------------
There is nested loop region with accumulator in the test. On second pass of
‘Induction Variable Simplification’ there are no recurrence optimizations in
r231018 case. Then “Loop Strength Reduction” could not apply further
transformation of loop condition. Corresponding loop IR dump fragments for two
versions are the following.

r231017:
-------
*** IR Dump After Induction Variable Simplification ***
for.body:                               ; preds = %for.body.lr.ph, %for.end
  %indvars.iv = phi i32 [ %1, %for.body.lr.ph ], [ %indvars.iv.next, %for.end ]
  %lag.032 = phi i32 [ 0, %for.body.lr.ph ], [ %inc16, %for.end ]
  %sub = sub nsw i32 %conv2, %lag.032
  %cmp428 = icmp sgt i32 %sub, 0
  br i1 %cmp428, label %for.body6.lr.ph, label %for.end

for.end:                      ; preds = %for.cond3.for.end_crit_edge, %for.body
........
  %inc16 = add nuw nsw i32 %lag.032, 1
  %indvars.iv.next = add nsw i32 %indvars.iv, -1
  %exitcond34 = icmp ne i32 %lag.032, %3              !!            
  br i1 %exitcond34, label %for.body, label %for.cond.for.end17_crit_edge

for.body6:                              ; preds = %for.body6.lr.ph, %for.body6
..........
  %inc = add nuw nsw i32 %i.029, 1
  %exitcond = icmp ne i32 %i.029, %indvars.iv        !! inner loop cond. is
transformed
  br i1 %exitcond, label %for.body6, label %for.cond3.for.end_crit_edge

*** IR Dump After Loop Strength Reduction ***
for.body6:                          ; preds = %for.body6.preheader, %for.body6
  %lsr.iv35 = phi i16* [ %InputData, %for.body6.preheader ], [ %scevgep,
%for.body6 ]
  %lsr.iv = phi i32 [ %indvars.iv.in, %for.body6.preheader ], [ %lsr.iv.next,
%for.body6 ]
.........
  %lsr.iv.next = add i32 %lsr.iv, -1
  %scevgep = getelementptr i16, i16* %lsr.iv35, i32 1
  %exitcond = icmp eq i32 %lsr.iv.next, 0              !! further
transformation
  br i1 %exitcond, label %for.end.loopexit, label %for.body6

vs.

r231018:
-------
*** IR Dump After Induction Variable Simplification ***
for.body:                               ; preds = %for.body.lr.ph, %for.end
  %indvars.iv = phi i32 [ %0, %for.body.lr.ph ], [ %indvars.iv.next, %for.end ]
  %lag.032 = phi i32 [ 0, %for.body.lr.ph ], [ %inc16, %for.end ]
  %sub = sub nsw i32 %conv2, %lag.032
  %cmp428 = icmp sgt i32 %sub, 0
  br i1 %cmp428, label %for.body6.lr.ph, label %for.end

for.end:                    ; preds = %for.cond3.for.end_crit_edge, %for.body
.......
  %inc16 = add nuw nsw i32 %lag.032, 1
  %indvars.iv.next = add nsw i32 %indvars.iv, -1
  %exitcond34 = icmp ne i32 %inc16, %1                !! inner loop cond. is
not transformed  
  br i1 %exitcond34, label %for.body, label %for.cond.for.end17_crit_edge

*** IR Dump After Loop Strength Reduction ***
for.body6:                          ; preds = %for.body6.preheader, %for.body6
  %lsr.iv = phi i16* [ %InputData, %for.body6.preheader ], [ %scevgep,
%for.body6 ]
.........
  %inc = add nuw nsw i32 %i.029, 1
  %scevgep = getelementptr i16, i16* %lsr.iv, i32 1
  %exitcond = icmp eq i32 %indvars.iv, %inc            !! 
  br i1 %exitcond, label %for.end.loopexit, label %for.body6

And resultant code of the loop of version before degradation (r231017) is
obtained more optimal – less instructions and the length of loop iteration is
less by 1 clock. Corresponding asm codes of inner loop are the following.

r231017:
-------
xf7723c00 26 2787 movswl (%edx),%edi           !!
0xf7723c03 27 228 movswl (%edx,%ebx,2),%esi    !!
0xf7723c07 28 2614 add    $0x2,%edx
0xf7723c0a 29 117 imul   %edi,%esi
0xf7723c0d 30 3498 sar    %cl,%esi
0xf7723c0f 31 3245 add    %esi,%eax
0xf7723c11 32 2636 add    $0xffffffff,%ebp     !!
0xf7723c14 33 136 jne    f7723c00 <fxpAutoCorrelation+0x50>

vs.

r231018:
-------
0xf7736c10 27 1816 mov    0x8(%esp),%ebp       !! additional fill-instr.
0xf7736c14 28 934 movswl (%edx),%esi
0xf7736c17 29 1874 add    $0x1,%edi            !! add is operand of 'cmp'
0xf7736c1a 30 879 movswl (%edx,%ebp,2),%ebp    !! + 1 clock in the loop
0xf7736c1e 31 1868 add    $0x2,%edx
0xf7736c21 32 924 imul   %esi,%ebp
0xf7736c24 33 2417 sar    %cl,%ebp
0xf7736c26 34 3499 add    %ebp,%eax
0xf7736c28 35 2084 cmp    %edi,%ebx            !! add -> cmp instead of --i
0xf7736c2a 36 917 jne    f7736c10 <fxpAutoCorrelation+0x60>

Okunev Sergey,
Software Engineer
Intel Compiler Team

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20150330/92459a42/attachment.html>