<html>
<head>
<base href="https://llvm.org/bugs/" />
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW --- - Performance degradations of tests from eembc.1.1 suite on x86 Avoton-1.7 due to ‘SCEVExpander’ changes"
href="https://llvm.org/bugs/show_bug.cgi?id=23070">23070</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>Performance degradations of tests from eembc.1.1 suite on x86 Avoton-1.7 due to ‘SCEVExpander’ changes
</td>
</tr>
<tr>
<th>Product</th>
<td>libraries
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>normal
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Scalar Optimizations
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>sergey.k.okunev@gmail.com
</td>
</tr>
<tr>
<th>CC</th>
<td>david.l.kreitzer@intel.com, denis.briltz@intel.com, elena.demikhovsky@intel.com, llvmbugs@cs.uiuc.edu, michael.m.kuperstein@intel.com, sanjoy@playingwithpointers.com, sergos.gnu@gmail.com, zia.ansari@intel.com
</td>
</tr>
<tr>
<th>Classification</th>
<td>Unclassified
</td>
</tr></table>
<p>
<div>
<pre>While our performance testing regressions on tests autocor00, aifftr01,
aiifft01 from eembc.1.1 suite were detected. Bisect analysis showed LLVM
revision 231018 is responsible for these degradations. The comments to commit
are the following.
commit caee94bbb4fb44971f594fe09fd61692dc4aa719
Author: Sanjoy Das <<a href="mailto:sanjoy@playingwithpointers.com">sanjoy@playingwithpointers.com</a>>
Date: Mon Mar 2 21:41:07 2015 +0000
Revert some changes that were made to fix PR20680.
This re-lands change r230921. r230921 was reverted because it broke a
clang test; a checkin fixing the clang test will be commited shortly.
Summary:
As far as I can tell, the real bug causing the issue was fixed in
r230533. SCEVExpander should mark an increment operation as nuw or nsw
only if it can *prove* that the operation does not overflow. There
shouldn't be any situation where we have to do something different
because of no-wrap flags generated by SCEVExpander.
Revert "IndVarSimplify: Allow LFTR to fire more often"
This reverts commit 1ade0f0faa98877b688e0b9da58e876052c1e04e (SVN: 222213).
Revert "IndVarSimplify: Don't let LFTR compare against a poison value"
This reverts commit c0f2b8b528d8a37b0a1522aae90af649d6357eb5 (SVN: 217102).
Reviewers: majnemer, atrick, spatel
Differential Revision: <a href="http://reviews.llvm.org/D7979">http://reviews.llvm.org/D7979</a>
git-svn-id: <a href="https://llvm.org/svn/llvm-project/llvm/trunk@231018">https://llvm.org/svn/llvm-project/llvm/trunk@231018</a>
91177308-0d34-0410-b5e6-96231b3b80d8
Submitted changes prevent to enabling some following loop optimizations that
leads to additional operations in x86 loop code in degraded cases. Consider
example from eembc.1.1 with fragments of IR dumps and asm codes for revisions
before (r231017) and after degradations (r231018).
Options: -O2 -ffast-math -m32 -mfpmath=sse -march=slm -fPIE -pie
1) eembc_1_1/autcor00
---------------------
There is nested loop region with accumulator in the test. On second pass of
‘Induction Variable Simplification’ there are no recurrence optimizations in
r231018 case. Then “Loop Strength Reduction” could not apply further
transformation of loop condition. Corresponding loop IR dump fragments for two
versions are the following.
r231017:
-------
*** IR Dump After Induction Variable Simplification ***
for.body: ; preds = %for.body.lr.ph, %for.end
%indvars.iv = phi i32 [ %1, %for.body.lr.ph ], [ %indvars.iv.next, %for.end ]
%lag.032 = phi i32 [ 0, %for.body.lr.ph ], [ %inc16, %for.end ]
%sub = sub nsw i32 %conv2, %lag.032
%cmp428 = icmp sgt i32 %sub, 0
br i1 %cmp428, label %for.body6.lr.ph, label %for.end
for.end: ; preds = %for.cond3.for.end_crit_edge, %for.body
........
%inc16 = add nuw nsw i32 %lag.032, 1
%indvars.iv.next = add nsw i32 %indvars.iv, -1
%exitcond34 = icmp ne i32 %lag.032, %3 !!
br i1 %exitcond34, label %for.body, label %for.cond.for.end17_crit_edge
for.body6: ; preds = %for.body6.lr.ph, %for.body6
..........
%inc = add nuw nsw i32 %i.029, 1
%exitcond = icmp ne i32 %i.029, %indvars.iv !! inner loop cond. is
transformed
br i1 %exitcond, label %for.body6, label %for.cond3.for.end_crit_edge
*** IR Dump After Loop Strength Reduction ***
for.body6: ; preds = %for.body6.preheader, %for.body6
%lsr.iv35 = phi i16* [ %InputData, %for.body6.preheader ], [ %scevgep,
%for.body6 ]
%lsr.iv = phi i32 [ %indvars.iv.in, %for.body6.preheader ], [ %lsr.iv.next,
%for.body6 ]
.........
%lsr.iv.next = add i32 %lsr.iv, -1
%scevgep = getelementptr i16, i16* %lsr.iv35, i32 1
%exitcond = icmp eq i32 %lsr.iv.next, 0 !! further
transformation
br i1 %exitcond, label %for.end.loopexit, label %for.body6
vs.
r231018:
-------
*** IR Dump After Induction Variable Simplification ***
for.body: ; preds = %for.body.lr.ph, %for.end
%indvars.iv = phi i32 [ %0, %for.body.lr.ph ], [ %indvars.iv.next, %for.end ]
%lag.032 = phi i32 [ 0, %for.body.lr.ph ], [ %inc16, %for.end ]
%sub = sub nsw i32 %conv2, %lag.032
%cmp428 = icmp sgt i32 %sub, 0
br i1 %cmp428, label %for.body6.lr.ph, label %for.end
for.end: ; preds = %for.cond3.for.end_crit_edge, %for.body
.......
%inc16 = add nuw nsw i32 %lag.032, 1
%indvars.iv.next = add nsw i32 %indvars.iv, -1
%exitcond34 = icmp ne i32 %inc16, %1 !! inner loop cond. is
not transformed
br i1 %exitcond34, label %for.body, label %for.cond.for.end17_crit_edge
*** IR Dump After Loop Strength Reduction ***
for.body6: ; preds = %for.body6.preheader, %for.body6
%lsr.iv = phi i16* [ %InputData, %for.body6.preheader ], [ %scevgep,
%for.body6 ]
.........
%inc = add nuw nsw i32 %i.029, 1
%scevgep = getelementptr i16, i16* %lsr.iv, i32 1
%exitcond = icmp eq i32 %indvars.iv, %inc !!
br i1 %exitcond, label %for.end.loopexit, label %for.body6
And resultant code of the loop of version before degradation (r231017) is
obtained more optimal – less instructions and the length of loop iteration is
less by 1 clock. Corresponding asm codes of inner loop are the following.
r231017:
-------
xf7723c00 26 2787 movswl (%edx),%edi !!
0xf7723c03 27 228 movswl (%edx,%ebx,2),%esi !!
0xf7723c07 28 2614 add $0x2,%edx
0xf7723c0a 29 117 imul %edi,%esi
0xf7723c0d 30 3498 sar %cl,%esi
0xf7723c0f 31 3245 add %esi,%eax
0xf7723c11 32 2636 add $0xffffffff,%ebp !!
0xf7723c14 33 136 jne f7723c00 <fxpAutoCorrelation+0x50>
vs.
r231018:
-------
0xf7736c10 27 1816 mov 0x8(%esp),%ebp !! additional fill-instr.
0xf7736c14 28 934 movswl (%edx),%esi
0xf7736c17 29 1874 add $0x1,%edi !! add is operand of 'cmp'
0xf7736c1a 30 879 movswl (%edx,%ebp,2),%ebp !! + 1 clock in the loop
0xf7736c1e 31 1868 add $0x2,%edx
0xf7736c21 32 924 imul %esi,%ebp
0xf7736c24 33 2417 sar %cl,%ebp
0xf7736c26 34 3499 add %ebp,%eax
0xf7736c28 35 2084 cmp %edi,%ebx !! add -> cmp instead of --i
0xf7736c2a 36 917 jne f7736c10 <fxpAutoCorrelation+0x60>
Okunev Sergey,
Software Engineer
Intel Compiler Team</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>