[LLVMbugs] [Bug 23294] New: Performance degradation of eembc.1.1/idctrn01 test on x86 Avoton-1.7 due to adding of LICM pass after loop unrolling

Mon Apr 20 05:31:46 PDT 2015

https://llvm.org/bugs/show_bug.cgi?id=23294

            Bug ID: 23294
           Summary: Performance degradation of eembc.1.1/idctrn01 test on
                    x86 Avoton-1.7  due to adding of LICM pass after loop
                    unrolling
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P
         Component: Loop Optimizer
          Assignee: unassignedbugs at nondot.org
          Reporter: sergey.k.okunev at gmail.com
                CC: david.l.kreitzer at intel.com, denis.briltz at intel.com,
                    elena.demikhovsky at intel.com, llvmbugs at cs.uiuc.edu,
                    michael.m.kuperstein at intel.com, sergos.gnu at gmail.com,
                    zia.ansari at intel.com
    Classification: Unclassified

Created attachment 14231
  --> https://llvm.org/bugs/attachment.cgi?id=14231&action=edit
Initial ll-file of considered 't_run_test' function

Bisect analysis showed LLVM revision 232011  is responsible for the
degradation. The comments to commit are the following.

commit a56999c5decca0023e5ce481fc08571e227e3aa3
Author: Kevin Qin <Kevin.Qin at arm.com>
Date:   Thu Mar 12 05:36:01 2015 +0000

    Reapply 'Run LICM pass after loop unrolling pass.'

    It's firstly committed at r231630, and reverted at r231635.

    Function pass InstructionSimplifier is inserted as barrier to
    make sure loop unroll pass won't affect on LICM pass.

    git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@232011
91177308-0d34-0410-b5e6-96231b3b80d8

LLVM-clang options: O2 -ffast-math -m32 -mfpmath=sse -march=slm -fPIE –pie

The eembc_1_1/idctrn01 test has several hot loops with 3 levels of nesting (8 x
8 x 8 iterations). Each of these loops was fully unrolled. In adding the loop
invariant code motion pass (LICM) after the loop unroll pass (in r. 232011),
there are 16 loads hoisted up and the same 16 stores with constant addresses
sunk from the body of hot loop.
When machine specific code was generated for x86_32, AVT1.7 architecture these
16 loaded and stored values are transferred by stack spill, fill instructions
due to lack of xmm registers. As result additional loads and stores (fills and
spills) were generated inside, before and after loop in r232011 case. So, the
number of loads and stores inside loop are very close for both revisions and
‘additional’ loads and stores before and after loop cause the regression.

Changes enabled by considered revision in terms of simplified IR looks as
follows.

    +16 <4 x i32> loads   -> virt_regs1     !! hoisted up loads with const.
addr
    br label %l_loop
%l_loop:
    +16 <4 x i32> phi assignments
    calculations (virt_regs1 / virt_regs2)     !! loads and stores were
replaced by virtual regs
    br %exitcond, label %l_exit, label %l_loop
%l_exit:
    +16 <4 x i32> phi assignments
    +16 virt_regs2 -> <4 x i32> stores      !! sunk stores with const. addr

Corresponding asm loop code fragments with one load-store chain for two
revisions are the following.

r232010:
-------
xor    %ecx,%ecx
; start of loop1
l_f772f5a0:
    add    $0x20,%ecx

    movsbl -0x38(%eax),%edx
    movdqu 0x2d0(%ebx),%xmm6         !! load with const addr. is inside loop
    movd   %edx,%xmm0
    pshufd $0x0,%xmm0,%xmm2
    pmulld %xmm2,%xmm3
    paddd  %xmm3,%xmm6
    movdqu %xmm6,0x2d0(%ebx)         !! store with const. addr. is inside loop

    ; ... and 15 more block like that ...

    add    $0x1,%eax
    cmp    $0x100,%ecx
jne    l_f772f5a0 <t_run_test+0x840>
; end of loop1

vs.

r232011:

xor    %ecx,%ecx
movdqu 0x2d0(%ebx),%xmm0                !! hoisted up load with const. addr
movdqa %xmm0,0xd0(%esp)                 !! spill-instr.  before loop
; and 15 more movs like that
; start of loop1
l_f77c7680:
    add    $0x20,%ecx

    movsbl -0x38(%eax),%edx
    movdqa 0xd0(%esp),%xmm7             !! fill-instr. inside loop
    movd   %edx,%xmm2
    pshufd $0x0,%xmm2,%xmm2
    pmulld %xmm2,%xmm0
    paddd  %xmm0,%xmm7
    movdqa %xmm7,0xd0(%esp)             !! spill-instr.  inside loop

    ; … and 15 more block like that …

    add    $0x1,%eax
    cmp    $0x100,%ecx
jne    l_f77c7680 <t_run_test+0x920>
; end of loop1
movdqa 0xd0(%esp),%xmm0                 !! fill-instr. after loop 
movdqu %xmm0,0x2d0(%ebx)                !! sunk stores with const. addr
; and 15 more movs like that

Okunev Sergey,
Software Engineer
Intel Compiler Team

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20150420/d32c2527/attachment.html>