[LLVMbugs] [Bug 18274] New: Performance degradation of test with rgb processing on Atom1.8 due to enabling of MI scheduling for x86
bugzilla-daemon at llvm.org
bugzilla-daemon at llvm.org
Wed Dec 18 06:33:36 PST 2013
http://llvm.org/bugs/show_bug.cgi?id=18274
Bug ID: 18274
Summary: Performance degradation of test with rgb processing on
Atom1.8 due to enabling of MI scheduling for x86
Product: new-bugs
Version: trunk
Hardware: PC
OS: Linux
Status: NEW
Keywords: performance
Severity: normal
Priority: P
Component: new bugs
Assignee: unassignedbugs at nondot.org
Reporter: sergey.k.okunev at gmail.com
CC: atrick at apple.com, llvmbugs at cs.uiuc.edu
Classification: Unclassified
While our performance testing ~17% performance regression of the test of RGB
-> YIQ conversion on Atom architecture was detected. Bisect analysis showed
LLVM revision 192750 is responsible for this degradation. This revision enabled
some scheduling of instructions from DAG as described in comments.
commit 6a7770b7ae43d784dec6f4d3c73ffed6166f3882
Author: Andrew Trick <atrick at apple.com>
Date: Tue Oct 15 23:33:07 2013 +0000
Enable MI Sched for x86.
This changes the SelectionDAG scheduling preference to source
order. Soon, the SelectionDAG scheduler can be bypassed saving
a nice chunk of compile time.
Performance differences that result from this change are often a
consequence of register coalescing. The register coalescer is far from
perfect. Bugs can be filed for deficiencies.
On x86 SandyBridge/Haswell, the source order schedule is often
preserved, particularly for small blocks.
Register pressure is generally improved over the SD scheduler's ILP
mode. However, we are still able to handle large blocks that require
latency hiding, unlike the SD scheduler's BURR mode. MI scheduler also
attempts to discover the critical path in single-block loops and
adjust heuristics accordingly.
The MI scheduler relies on the new machine model. This is currently
unimplemented for AVX, so we may not be generating the best code yet.
Unit tests are updated so they don't depend on SD scheduling heuristics.
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@192750
91177308-0d34-0410-b5e6-96231b3b80d8
This scheduling led to performance drawbacks in register allocation in terms of
spill-fill for mentioned above test. To illustrate the issue I prepared test
case with focus hot loop region (see attached tar-file with source code, asm
codes for two revisions and input data file). Additional spill-fill operations
may be seen in LLVM assembly code annotations of test inner loop as follows.
rev. 192749 (with one spill inside inner loop)
--------------------------------------------------------
.LBB0_4: # %for.body13
# Parent Loop BB0_3 Depth=1
# => This Inner Loop Header: Depth=2
movl 36(%esp), %esi # 4-byte Reload
movl 32(%esp), %ebx # 4-byte Reload
movzbl 1(%esi,%ecx), %edi
movzbl (%esi,%ecx), %edx
imull $38470, %edi, %ebp # imm = 0x9646
imull $19595, %edx, %eax # imm = 0x4C8B
addl %ebp, %eax
movzbl 2(%esi,%ecx), %ebp
movl %edi, 40(%esp) # 4-byte Spill !!
imull $7471, %ebp, %esi # imm = 0x1D2F
leal 32768(%esi,%eax), %eax
imull $-15119, %edi, %esi # imm = 0xFFFFFFFFFFFFC4F1
imull $32767, %edx, %edi # imm = 0x7FFF
shrl $16, %eax
addl %esi, %edi
movb %al, (%ebx,%ecx)
imull $-17648, %ebp, %eax # imm = 0xFFFFFFFFFFFFBB10
leal 32768(%eax,%edi), %eax
imull $13282, %edx, %esi # imm = 0x33E2
imull $-32767, 40(%esp), %edi # 4-byte Folded Reload
# imm = 0xFFFFFFFFFFFF8001
imull $19485, %ebp, %edx # imm = 0x4C1D
leal (%esi,%edi), %esi
leal 32768(%edx,%esi), %edx
shrl $16, %eax
shrl $16, %edx
leal 3(%ecx), %esi
movb %al, 1(%ebx,%ecx)
cmpl $361050, %esi # imm = 0x5825A
movb %dl, 2(%ebx,%ecx)
leal (%esi), %ecx
jne .LBB0_4
vs.
rev. 192750 (with additional 5 spills)
---------------------------------------------
.LBB0_4: # %for.body13
# Parent Loop BB0_3 Depth=1
# => This Inner Loop Header: Depth=2
movzbl (%ebp,%ecx), %eax
movzbl 2(%ebp,%ecx), %esi
movzbl 1(%ebp,%ecx), %ebx
imull $19595, %eax, %edx # imm = 0x4C8B
movl %edx, 40(%esp) # 4-byte Spill !!
imull $7471, %esi, %edx # imm = 0x1D2F
movl %edx, 36(%esp) # 4-byte Spill !!
imull $32767, %eax, %edx # imm = 0x7FFF
movl %eax, 24(%esp) # 4-byte Spill !!
imull $-17648, %esi, %eax # imm = 0xFFFFFFFFFFFFBB10
imull $38470, %ebx, %edi # imm = 0x9646
movl %eax, 28(%esp) # 4-byte Spill !!
imull $13282, 24(%esp), %eax # 4-byte Folded Reload
# imm = 0x33E2
movl %edx, 32(%esp) # 4-byte Spill !!
imull $-15119, %ebx, %edx # imm = 0xFFFFFFFFFFFFC4F1
movl %eax, 24(%esp) # 4-byte Spill !!
addl 40(%esp), %edi # 4-byte Folded Reload
movl 36(%esp), %eax # 4-byte Reload
imull $-32767, %ebx, %ebx # imm = 0xFFFFFFFFFFFF8001
leal 32768(%eax,%edi), %eax
addl 32(%esp), %edx # 4-byte Folded Reload
movl 28(%esp), %edi # 4-byte Reload
imull $19485, %esi, %esi # imm = 0x4C1D
leal 32768(%edi,%edx), %edx
movl 20(%esp), %edi # 4-byte Reload
addl 24(%esp), %ebx # 4-byte Folded Reload
shrl $16, %eax
movb %al, (%edi,%ecx)
leal 32768(%esi,%ebx), %eax
shrl $16, %edx
shrl $16, %eax
movb %dl, 1(%edi,%ecx)
movb %al, 2(%edi,%ecx)
leal 3(%ecx), %ecx
cmpl $361050, %ecx # imm = 0x5825A
Compilation command line is the following.
<path to build of rev. 192749 or rev. 192750>/install/bin/clang -ansi -O2
-ffast-math -msse2 -mfpmath=sse -m32 -static -march=atom -mtune=atom
./t_rgb.c -o./t_rgb.exe
And execution of compiled codes on Atom processor looks as follows (input data
file t_rgb.ppm should be in the run directory). There is ~17% performance
difference of obtained codes for attached test.
for r.192749 -
time t_rgb.exe
>> Input data file = t_rgb.ppm
>> Number of pixels = 120350
real 0m31.344s
user 0m31.324s
sys 0m0.005s
for r.192750 -
time t_rgb.exe
>> Input data file = t_rgb.ppm
>> Number of pixels = 120350
real 0m36.809s
user 0m36.787s !! diff. is 17.4%
sys 0m0.002s
Best regards, Okunev Sergey
Intel Corporation
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20131218/46b9462c/attachment.html>
More information about the llvm-bugs
mailing list