[LLVMbugs] [Bug 22790] New: Performance degradation of eembc.1.1/ rspeed01 test on x86 Avoton-1.7 due to ‘select’ transformation

Wed Mar 4 08:37:34 PST 2015

http://llvm.org/bugs/show_bug.cgi?id=22790

            Bug ID: 22790
           Summary: Performance degradation of eembc.1.1/rspeed01 test on
                    x86 Avoton-1.7  due to ‘select’ transformation
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P
         Component: Scalar Optimizations
          Assignee: unassignedbugs at nondot.org
          Reporter: sergey.k.okunev at gmail.com
                CC: david.l.kreitzer at intel.com, denis.briltz at intel.com,
                    llvmbugs at cs.uiuc.edu, matze at braunis.de,
                    michael.m.kuperstein at intel.com, sergos.gnu at gmail.com,
                    zia.ansari at intel.com
    Classification: Unclassified

Bisect analysis showed LLVM revision 228409  is responsible for this
degradation.
commit 2f2dec87fbd809c2f303cd38c92ccd2a84221b7a
Author: Matthias Braun <matze at braunis.de>
Date:   Fri Feb 6 17:49:36 2015 +0000

    InstCombine: Combine select sequences into a single select

    Normalize
    select(C0, select(C1, a, b), b) -> select((C0 & C1), a, b)
    select(C0, a, select(C1, a, b)) -> select((C0 | C1), a, b)

    This normal form may enable further combines on the And/Or and shortens
    paths for the values. Many targets prefer the other but can go back
    easily in CodeGen.

    Differential Revision: http://reviews.llvm.org/D7399

    git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@228409
91177308-0d34-0410-b5e6-96231b3b80d8

This transformation does not lead to code optimization if C0 is depended on
‘select(C1, a, b)’, since nested select could not be removed.
Mentioned test has such ‘select’ sequences. Corresponding IR dump fragments
after phase “Combine redundant instructions” with dependent ‘selects’ for
revisions before and after degradation are the following.

r228405:
-------
*** IR Dump After Combine redundant instructions ***
…………………
%5 = load i32* @tonewheelCounter, align 4, !tbaa !2
%6 = load i32* @t_run_test.tonewheelCounterLast1, align 4, !tbaa !2
%sub = sub nsw i32 %5, %6
%cmp7 = icmp slt i32 %5, %6
%add9 = add nsw i32 %sub, 32768
%add9.sub = select i1 %cmp7, i32 %add9, i32 %sub
store i32 %add9.sub, i32* @t_run_test.toothDeltaTime1, align 4, !tbaa !2
%cmp11 = icmp slt i32 %add9.sub, 100                                            
%7 = load i32* @t_run_test.toothDeltaTimeLast1, align 4, !tbaa !2
%.add9.sub = select i1 %cmp11, i32 %7, i32 %add9.sub         !! select 2 and 3
will be optimized while “Instruction selection”                              
%.add9.sub141 = select i1 %cmp11, i32 %7, i32 %add9.sub      !! to 1 cmov
%mul14 = shl nsw i32 %7, 2
%cmp15 = icmp sgt i32 %.add9.sub141, %mul14                  !! cmp15 is
depended on the third select                                                
%..add9.sub = select i1 %cmp15, i32 %7, i32 %.add9.sub       !! select3 ->
cmp15 -> select4                               
store i32 %..add9.sub, i32* @t_run_test.toothDeltaTime1, align 4, !tbaa !2
%..add9.sub141 = select i1 %cmp15, i32 %7, i32 %.add9.sub141  !! select 4 and 5
will be optimized while “Instruction selection” 

vs.

r228409:
--------
*** IR Dump After Combine redundant instructions ***
%5 = load i32* @tonewheelCounter, align 4, !tbaa !2
%6 = load i32* @t_run_test.tonewheelCounterLast1, align 4, !tbaa !2
%sub = sub nsw i32 %5, %6
%cmp7 = icmp slt i32 %5, %6
%add9 = add nsw i32 %sub, 32768
%add9.sub = select i1 %cmp7, i32 %add9, i32 %sub
store i32 %add9.sub, i32* @t_run_test.toothDeltaTime1, align 4, !tbaa !2
%cmp11 = icmp slt i32 %add9.sub, 100
%7 = load i32* @t_run_test.toothDeltaTimeLast1, align 4, !tbaa !2
%.add9.sub141 = select i1 %cmp11, i32 %7, i32 %add9.sub     !! ‘select’ is not 
removed due df-dep. to cmp15                       
%mul14 = shl nsw i32 %7, 2
%cmp15 = icmp sgt i32 %.add9.sub141, %mul14
%8 = or i1 %cmp15, %cmp11                                   !! ‘or’  instr.
increases CRP for this case
%..add9.sub = select i1 %8, i32 %7, i32 %add9.sub           !!transformed
‘select’ is df-dependent on the prev. ‘select’ through cmp15
store i32 %..add9.sub, i32* @t_run_test.toothDeltaTime1, align 4, !tbaa !2
%..add9.sub141 = select i1 %cmp15, i32 %7, i32 %.add9.sub141       !!

As result additional (not removed) cmov and ‘or’ instructions in r228409 leads
to  performance degradation as seen in instrumented resultant codes below.
Besides, in r228405 pattern optimization for selects while “Instruction
selection” phase gives 3 ‘cmov’ instructions from 5 ‘select’s. And 4 ‘cmov’s
remain in r228409 version of code.
Select/cmov optimization is subcase of considered transformation if (C0 == C1),
i.e., select(C0, a, select(C0, a, b)) -> select(C0, a, b).

#228405:
------- 
0xf775ce9f 108 392 sub    %ecx,%eax
0xf775cea1 109 53 add    $0x8000,%eax
0xf775cea6 110 166 sub    %ecx,%edx
0xf775cea8 111 67 cmovl  %eax,%edx             !!
0xf775ceab 112 671 mov    0x24c(%ebx),%eax
0xf775ceb1 113 75 cmp    $0x64,%edx
0xf775ceb4 114 185 mov    %edx,0x240(%ebx)
0xf775ceba 115 237 cmovl  %eax,%edx            !!
0xf775cebd 116 268 lea    0x0(,%eax,4),%ecx
0xf775cec4 117 64 cmp    %ecx,%edx
0xf775cec6 118 438 mov    0x24(%esp),%ecx
0xf775ceca 119 63 cmovg  %eax,%edx             !! res = %edx
0xf775cecd 120 489 mov    0x264(%ebx),%eax
0xf775ced3 121 69 add    $0x1,%esi
0xf775ced6 122 195 mov    %edx,0x240(%ebx)     !! store res
0xf775cedc 123 54 mov    %edx,0x24c(%ebx)      !! store res  
0xf775cee2 124 212 mov    %esi,0x270(%ebx)
0xf775cee8 125 51 mov    (%ecx),%ecx
0xf775ceea 126 197 add    %edx,%eax
0xf775ceec 127 72 mov    %eax,0x264(%ebx)

vs.

#228409:
-------
0xf7729e89 105 335 sub    %ecx,%eax
0xf7729e8b 106 483 add    $0x8000,%eax
0xf7729e90 107 242 sub    %ecx,%edx
0xf7729e92 108 121 mov    0x24c(%ebx),%ecx
0xf7729e98 109 134 cmovl  %eax,%edx           !!
0xf7729e9b 110 238 cmp    $0x64,%edx 
0xf7729e9e 111 490 mov    %edx,%esi
0xf7729ea0 112 129 mov    %edx,0x240(%ebx)
0xf7729ea6 113 125 cmovl  %ecx,%esi           !!
0xf7729ea9 114 234 lea    0x0(,%ecx,4),%edi
0xf7729eb0 115 122 setl   %al                
0xf7729eb3 116 352 cmp    %edi,%esi
0xf7729eb5 117 113 setg   %ah
0xf7729eb8 118 160 cmovg  %ecx,%esi           !!  res = %esi
0xf7729ebb 119 253 or     %al,%ah             !!! or 
0xf7729ebd 120 247 mov    0x264(%ebx),%eax
0xf7729ec3 121 112 mov    %esi,0x24c(%ebx)    !!  store res 
0xf7729ec9 122 118 cmovne %ecx,%edx           !!! additional ‘cmov’, res = %edx
0xf7729ecc 123 226 mov    0x24(%esp),%ecx
0xf7729ed0 124 126 mov    %edx,0x240(%ebx)    !!  store res
0xf7729ed6 125 120 mov    0x270(%ebx),%edx
0xf7729edc 126 134 add    %esi,%eax
0xf7729ede 127 114 mov    (%ecx),%ecx
0xf7729ee0 128 127 mov    %eax,0x264(%ebx)
0xf7729ee6 129 139 add    $0x1,%edx
0xf7729ee9 130 128 mov    %edx,0x270(%ebx)
0xf7729eef 131 128 mov    %ecx,%edi

Okunev Sergey,
Software Engineer
Intel Compiler Team

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20150304/0af00783/attachment.html>