<html>

    <head>

      <base href="http://llvm.org/bugs/" />

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW --- - Performance degradation of test with rgb processing on Atom1.8 due to enabling of MI scheduling for x86"

   href="http://llvm.org/bugs/show_bug.cgi?id=18274">18274</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Performance degradation of test with rgb processing on Atom1.8 due to enabling of MI scheduling for x86

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>new-bugs

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Keywords</th>

          <td>performance

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>new bugs

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>sergey.k.okunev@gmail.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>atrick@apple.com, llvmbugs@cs.uiuc.edu

          </td>

        </tr>

        <tr>

          <th>Classification</th>

          <td>Unclassified

          </td>

        </tr></table>

      <p>

        <div>

        <pre>While our performance testing  ~17% performance regression of the test of RGB

-> YIQ conversion on Atom architecture  was detected. Bisect analysis showed

LLVM revision 192750 is responsible for this degradation. This revision enabled

some scheduling of instructions from DAG as described in comments.

commit 6a7770b7ae43d784dec6f4d3c73ffed6166f3882

Author: Andrew Trick <<a href="mailto:atrick@apple.com">atrick@apple.com</a>>

Date:   Tue Oct 15 23:33:07 2013 +0000

    Enable MI Sched for x86.

    This changes the SelectionDAG scheduling preference to source

    order. Soon, the SelectionDAG scheduler can be bypassed saving

    a nice chunk of compile time.

    Performance differences that result from this change are often a

    consequence of register coalescing. The register coalescer is far from

    perfect. Bugs can be filed for deficiencies.

    On x86 SandyBridge/Haswell, the source order schedule is often

    preserved, particularly for small blocks.

    Register pressure is generally improved over the SD scheduler's ILP

    mode. However, we are still able to handle large blocks that require

    latency hiding, unlike the SD scheduler's BURR mode. MI scheduler also

    attempts to discover the critical path in single-block loops and

    adjust heuristics accordingly.

    The MI scheduler relies on the new machine model. This is currently

    unimplemented for AVX, so we may not be generating the best code yet.

    Unit tests are updated so they don't depend on SD scheduling heuristics.

    git-svn-id: <a href="https://llvm.org/svn/llvm-project/llvm/trunk@192750">https://llvm.org/svn/llvm-project/llvm/trunk@192750</a>

91177308-0d34-0410-b5e6-96231b3b80d8

This scheduling led to performance drawbacks in register allocation in terms of

spill-fill for mentioned above test. To illustrate the issue I prepared test

case with focus hot loop region (see attached tar-file with source code, asm

codes for two revisions and input data file).  Additional spill-fill operations

may be seen in LLVM assembly code annotations of test inner loop as follows.

rev. 192749 (with one spill inside inner loop)

--------------------------------------------------------

.LBB0_4:                                # %for.body13

                                        #   Parent Loop BB0_3 Depth=1

                                        # =>  This Inner Loop Header: Depth=2

        movl    36(%esp), %esi          # 4-byte Reload

        movl    32(%esp), %ebx          # 4-byte Reload

        movzbl  1(%esi,%ecx), %edi

        movzbl  (%esi,%ecx), %edx

        imull   $38470, %edi, %ebp      # imm = 0x9646

        imull   $19595, %edx, %eax      # imm = 0x4C8B

        addl    %ebp, %eax

        movzbl  2(%esi,%ecx), %ebp

        movl    %edi, 40(%esp)          # 4-byte Spill              !! 

        imull   $7471, %ebp, %esi       # imm = 0x1D2F

        leal    32768(%esi,%eax), %eax

        imull   $-15119, %edi, %esi     # imm = 0xFFFFFFFFFFFFC4F1

        imull   $32767, %edx, %edi      # imm = 0x7FFF

        shrl    $16, %eax

        addl    %esi, %edi

        movb    %al, (%ebx,%ecx)

        imull   $-17648, %ebp, %eax     # imm = 0xFFFFFFFFFFFFBB10

        leal    32768(%eax,%edi), %eax

        imull   $13282, %edx, %esi      # imm = 0x33E2

        imull   $-32767, 40(%esp), %edi # 4-byte Folded Reload

                                        # imm = 0xFFFFFFFFFFFF8001

        imull   $19485, %ebp, %edx      # imm = 0x4C1D

        leal    (%esi,%edi), %esi

        leal    32768(%edx,%esi), %edx

        shrl    $16, %eax

        shrl    $16, %edx

        leal    3(%ecx), %esi

        movb    %al, 1(%ebx,%ecx)

        cmpl    $361050, %esi           # imm = 0x5825A

        movb    %dl, 2(%ebx,%ecx)

        leal    (%esi), %ecx

        jne     .LBB0_4

vs.

rev. 192750 (with additional 5 spills) 

---------------------------------------------

.LBB0_4:                                # %for.body13

                                        #   Parent Loop BB0_3 Depth=1

                                        # =>  This Inner Loop Header: Depth=2

        movzbl  (%ebp,%ecx), %eax        

        movzbl  2(%ebp,%ecx), %esi       

        movzbl  1(%ebp,%ecx), %ebx       

        imull   $19595, %eax, %edx      # imm = 0x4C8B

        movl    %edx, 40(%esp)          # 4-byte Spill               !!

        imull   $7471, %esi, %edx       # imm = 0x1D2F

        movl    %edx, 36(%esp)          # 4-byte Spill               !! 

        imull   $32767, %eax, %edx      # imm = 0x7FFF

        movl    %eax, 24(%esp)          # 4-byte Spill               !! 

        imull   $-17648, %esi, %eax     # imm = 0xFFFFFFFFFFFFBB10

        imull   $38470, %ebx, %edi      # imm = 0x9646

        movl    %eax, 28(%esp)          # 4-byte Spill               !! 

        imull   $13282, 24(%esp), %eax  # 4-byte Folded Reload 

                                        # imm = 0x33E2

        movl    %edx, 32(%esp)          # 4-byte Spill               !!

        imull   $-15119, %ebx, %edx     # imm = 0xFFFFFFFFFFFFC4F1

        movl    %eax, 24(%esp)          # 4-byte Spill               !! 

        addl    40(%esp), %edi          # 4-byte Folded Reload

        movl    36(%esp), %eax          # 4-byte Reload

        imull   $-32767, %ebx, %ebx     # imm = 0xFFFFFFFFFFFF8001

        leal    32768(%eax,%edi), %eax

        addl    32(%esp), %edx          # 4-byte Folded Reload

        movl    28(%esp), %edi          # 4-byte Reload

        imull   $19485, %esi, %esi      # imm = 0x4C1D

        leal    32768(%edi,%edx), %edx

        movl    20(%esp), %edi          # 4-byte Reload

        addl    24(%esp), %ebx          # 4-byte Folded Reload

        shrl    $16, %eax

        movb    %al, (%edi,%ecx)

        leal    32768(%esi,%ebx), %eax

        shrl    $16, %edx

        shrl    $16, %eax

        movb    %dl, 1(%edi,%ecx)

        movb    %al, 2(%edi,%ecx)

        leal    3(%ecx), %ecx

        cmpl    $361050, %ecx           # imm = 0x5825A

Compilation command line is the following.

<path to build of rev. 192749 or rev. 192750>/install/bin/clang -ansi  -O2

-ffast-math -msse2 -mfpmath=sse -m32 -static  -march=atom -mtune=atom  

./t_rgb.c   -o./t_rgb.exe

And execution of compiled codes on Atom processor looks as follows (input data

file t_rgb.ppm should be in the run directory). There is ~17% performance

difference of obtained codes for attached test.

for r.192749 -

time t_rgb.exe

<span class="quote">>>  Input data file = t_rgb.ppm

>>  Number of pixels = 120350</span >

real    0m31.344s

user    0m31.324s

sys     0m0.005s

for r.192750 -

time t_rgb.exe 

<span class="quote">>>  Input data file = t_rgb.ppm 

>>  Number of pixels = 120350 </span >

real    0m36.809s 

user    0m36.787s                     !! diff. is 17.4% 

sys     0m0.002s

Best regards, Okunev Sergey

Intel Corporation</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>