[llvm-bugs] [Bug 25219] New: [ppc] LLVM built 470.lbm is 9.5% slower than gcc on power8

Fri Oct 16 14:22:39 PDT 2015

https://llvm.org/bugs/show_bug.cgi?id=25219

            Bug ID: 25219
           Summary: [ppc] LLVM built 470.lbm is 9.5% slower than gcc on
                    power8
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P
         Component: Backend: PowerPC
          Assignee: unassignedbugs at nondot.org
          Reporter: carrot at google.com
                CC: llvm-bugs at lists.llvm.org
    Classification: Unclassified

The following compiler options are used
-fno-strict-aliasing -O2 -m64 -mvsx -mcpu=power8 -ffp-contract=fast

more than 98% of execution time is in function LBM_performStreamCollide, it
contains a single loop, related code is:

void LBM_performStreamCollide( LBM_Grid srcGrid, LBM_Grid dstGrid ) {
  for (...)
  {
     ...

                ux = + SRC_E ( srcGrid ) - SRC_W ( srcGrid )
                     + SRC_NE( srcGrid ) - SRC_NW( srcGrid )
                     + SRC_SE( srcGrid ) - SRC_SW( srcGrid )
                     + SRC_ET( srcGrid ) + SRC_EB( srcGrid )
                     - SRC_WT( srcGrid ) - SRC_WB( srcGrid );
                uy = + SRC_N ( srcGrid ) - SRC_S ( srcGrid )
                     + SRC_NE( srcGrid ) + SRC_NW( srcGrid )
                     - SRC_SE( srcGrid ) - SRC_SW( srcGrid )
                     + SRC_NT( srcGrid ) + SRC_NB( srcGrid )
                     - SRC_ST( srcGrid ) - SRC_SB( srcGrid );
                uz = + SRC_T ( srcGrid ) - SRC_B ( srcGrid )
                     + SRC_NT( srcGrid ) - SRC_NB( srcGrid )
                     + SRC_ST( srcGrid ) - SRC_SB( srcGrid )
                     + SRC_ET( srcGrid ) - SRC_EB( srcGrid )
                     + SRC_WT( srcGrid ) - SRC_WB( srcGrid );

                ux /= rho;
                uy /= rho;
                uz /= rho;

                if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
                        ux = 0.005;
                        uy = 0.002;
                        uz = 0.000;
                }

        ...
  }
}

LLVM tranforms the code into

           if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
                ux = + SRC_E ( srcGrid ) - SRC_W ( srcGrid )
                     + SRC_NE( srcGrid ) - SRC_NW( srcGrid )
                     + SRC_SE( srcGrid ) - SRC_SW( srcGrid )
                     + SRC_ET( srcGrid ) + SRC_EB( srcGrid )
                     - SRC_WT( srcGrid ) - SRC_WB( srcGrid );
                ux /= rho;
            }
            else 
                ux = 0.005;

            if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
                uy = + SRC_N ( srcGrid ) - SRC_S ( srcGrid )
                     + SRC_NE( srcGrid ) + SRC_NW( srcGrid )
                     - SRC_SE( srcGrid ) - SRC_SW( srcGrid )
                     + SRC_NT( srcGrid ) + SRC_NB( srcGrid )
                     - SRC_ST( srcGrid ) - SRC_SB( srcGrid );
                uy /= rho;
             }
             else
                uy = 0.002;

             if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
                uz = + SRC_T ( srcGrid ) - SRC_B ( srcGrid )
                     + SRC_NT( srcGrid ) - SRC_NB( srcGrid )
                     + SRC_ST( srcGrid ) - SRC_SB( srcGrid )
                     + SRC_ET( srcGrid ) - SRC_EB( srcGrid )
                     + SRC_WT( srcGrid ) - SRC_WB( srcGrid );
                uz /= rho;
             }
             else
                uz = 0.000;

Note that following floating point expressions are dependence chain containing
10 floating instructions

                uz = + SRC_T ( srcGrid ) - SRC_B ( srcGrid )
                     + SRC_NT( srcGrid ) - SRC_NB( srcGrid )
                     + SRC_ST( srcGrid ) - SRC_SB( srcGrid )
                     + SRC_ET( srcGrid ) - SRC_EB( srcGrid )
                     + SRC_WT( srcGrid ) - SRC_WB( srcGrid );
                uz /= rho;

One power8 each fp instruction has 6 or more cycle latency, so it needs at
least 60 cycles to execute each of the three dependence chain.

GCC doesn't do the control flow transform, so it can interleave the 3
dependence chains, and the result code is much faster.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20151016/315f8fc3/attachment.html>