[LLVMbugs] [Bug 18274] New: Performance degradation of test with rgb processing on Atom1.8 due to enabling of MI scheduling for x86

Wed Dec 18 06:33:36 PST 2013

http://llvm.org/bugs/show_bug.cgi?id=18274

            Bug ID: 18274
           Summary: Performance degradation of test with rgb processing on
                    Atom1.8 due to enabling of MI scheduling for x86
           Product: new-bugs
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Keywords: performance
          Severity: normal
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: sergey.k.okunev at gmail.com
                CC: atrick at apple.com, llvmbugs at cs.uiuc.edu
    Classification: Unclassified

While our performance testing  ~17% performance regression of the test of RGB
-> YIQ conversion on Atom architecture  was detected. Bisect analysis showed
LLVM revision 192750 is responsible for this degradation. This revision enabled
some scheduling of instructions from DAG as described in comments.

commit 6a7770b7ae43d784dec6f4d3c73ffed6166f3882
Author: Andrew Trick <atrick at apple.com>
Date:   Tue Oct 15 23:33:07 2013 +0000

    Enable MI Sched for x86.

    This changes the SelectionDAG scheduling preference to source
    order. Soon, the SelectionDAG scheduler can be bypassed saving
    a nice chunk of compile time.

    Performance differences that result from this change are often a
    consequence of register coalescing. The register coalescer is far from
    perfect. Bugs can be filed for deficiencies.

    On x86 SandyBridge/Haswell, the source order schedule is often
    preserved, particularly for small blocks.

    Register pressure is generally improved over the SD scheduler's ILP
    mode. However, we are still able to handle large blocks that require
    latency hiding, unlike the SD scheduler's BURR mode. MI scheduler also
    attempts to discover the critical path in single-block loops and
    adjust heuristics accordingly.

    The MI scheduler relies on the new machine model. This is currently
    unimplemented for AVX, so we may not be generating the best code yet.

    Unit tests are updated so they don't depend on SD scheduling heuristics.

    git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@192750
91177308-0d34-0410-b5e6-96231b3b80d8

This scheduling led to performance drawbacks in register allocation in terms of
spill-fill for mentioned above test. To illustrate the issue I prepared test
case with focus hot loop region (see attached tar-file with source code, asm
codes for two revisions and input data file).  Additional spill-fill operations
may be seen in LLVM assembly code annotations of test inner loop as follows.

rev. 192749 (with one spill inside inner loop)
--------------------------------------------------------
.LBB0_4:                                # %for.body13
                                        #   Parent Loop BB0_3 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        movl    36(%esp), %esi          # 4-byte Reload
        movl    32(%esp), %ebx          # 4-byte Reload
        movzbl  1(%esi,%ecx), %edi
        movzbl  (%esi,%ecx), %edx
        imull   $38470, %edi, %ebp      # imm = 0x9646
        imull   $19595, %edx, %eax      # imm = 0x4C8B
        addl    %ebp, %eax
        movzbl  2(%esi,%ecx), %ebp
        movl    %edi, 40(%esp)          # 4-byte Spill              !! 
        imull   $7471, %ebp, %esi       # imm = 0x1D2F
        leal    32768(%esi,%eax), %eax
        imull   $-15119, %edi, %esi     # imm = 0xFFFFFFFFFFFFC4F1
        imull   $32767, %edx, %edi      # imm = 0x7FFF
        shrl    $16, %eax
        addl    %esi, %edi
        movb    %al, (%ebx,%ecx)
        imull   $-17648, %ebp, %eax     # imm = 0xFFFFFFFFFFFFBB10
        leal    32768(%eax,%edi), %eax
        imull   $13282, %edx, %esi      # imm = 0x33E2
        imull   $-32767, 40(%esp), %edi # 4-byte Folded Reload
                                        # imm = 0xFFFFFFFFFFFF8001
        imull   $19485, %ebp, %edx      # imm = 0x4C1D
        leal    (%esi,%edi), %esi
        leal    32768(%edx,%esi), %edx
        shrl    $16, %eax
        shrl    $16, %edx
        leal    3(%ecx), %esi
        movb    %al, 1(%ebx,%ecx)
        cmpl    $361050, %esi           # imm = 0x5825A
        movb    %dl, 2(%ebx,%ecx)
        leal    (%esi), %ecx
        jne     .LBB0_4

vs.
rev. 192750 (with additional 5 spills) 
---------------------------------------------
.LBB0_4:                                # %for.body13
                                        #   Parent Loop BB0_3 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        movzbl  (%ebp,%ecx), %eax        
        movzbl  2(%ebp,%ecx), %esi       
        movzbl  1(%ebp,%ecx), %ebx       
        imull   $19595, %eax, %edx      # imm = 0x4C8B
        movl    %edx, 40(%esp)          # 4-byte Spill               !!
        imull   $7471, %esi, %edx       # imm = 0x1D2F
        movl    %edx, 36(%esp)          # 4-byte Spill               !! 
        imull   $32767, %eax, %edx      # imm = 0x7FFF
        movl    %eax, 24(%esp)          # 4-byte Spill               !! 
        imull   $-17648, %esi, %eax     # imm = 0xFFFFFFFFFFFFBB10
        imull   $38470, %ebx, %edi      # imm = 0x9646
        movl    %eax, 28(%esp)          # 4-byte Spill               !! 
        imull   $13282, 24(%esp), %eax  # 4-byte Folded Reload 
                                        # imm = 0x33E2
        movl    %edx, 32(%esp)          # 4-byte Spill               !!
        imull   $-15119, %ebx, %edx     # imm = 0xFFFFFFFFFFFFC4F1
        movl    %eax, 24(%esp)          # 4-byte Spill               !! 
        addl    40(%esp), %edi          # 4-byte Folded Reload
        movl    36(%esp), %eax          # 4-byte Reload
        imull   $-32767, %ebx, %ebx     # imm = 0xFFFFFFFFFFFF8001
        leal    32768(%eax,%edi), %eax
        addl    32(%esp), %edx          # 4-byte Folded Reload
        movl    28(%esp), %edi          # 4-byte Reload
        imull   $19485, %esi, %esi      # imm = 0x4C1D
        leal    32768(%edi,%edx), %edx
        movl    20(%esp), %edi          # 4-byte Reload
        addl    24(%esp), %ebx          # 4-byte Folded Reload
        shrl    $16, %eax
        movb    %al, (%edi,%ecx)
        leal    32768(%esi,%ebx), %eax
        shrl    $16, %edx
        shrl    $16, %eax
        movb    %dl, 1(%edi,%ecx)
        movb    %al, 2(%edi,%ecx)
        leal    3(%ecx), %ecx
        cmpl    $361050, %ecx           # imm = 0x5825A

Compilation command line is the following.
<path to build of rev. 192749 or rev. 192750>/install/bin/clang -ansi  -O2
-ffast-math -msse2 -mfpmath=sse -m32 -static  -march=atom -mtune=atom  
./t_rgb.c   -o./t_rgb.exe

And execution of compiled codes on Atom processor looks as follows (input data
file t_rgb.ppm should be in the run directory). There is ~17% performance
difference of obtained codes for attached test.

for r.192749 -

time t_rgb.exe
>>  Input data file = t_rgb.ppm
>>  Number of pixels = 120350

real    0m31.344s
user    0m31.324s
sys     0m0.005s

for r.192750 -

time t_rgb.exe 
>>  Input data file = t_rgb.ppm 
>>  Number of pixels = 120350 

real    0m36.809s 
user    0m36.787s                     !! diff. is 17.4% 
sys     0m0.002s

Best regards, Okunev Sergey
Intel Corporation

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20131218/46b9462c/attachment.html>