[llvm-bugs] [Bug 25219] New: [ppc] LLVM built 470.lbm is 9.5% slower than gcc on power8
via llvm-bugs
llvm-bugs at lists.llvm.org
Fri Oct 16 14:22:39 PDT 2015
https://llvm.org/bugs/show_bug.cgi?id=25219
Bug ID: 25219
Summary: [ppc] LLVM built 470.lbm is 9.5% slower than gcc on
power8
Product: libraries
Version: trunk
Hardware: PC
OS: Linux
Status: NEW
Severity: normal
Priority: P
Component: Backend: PowerPC
Assignee: unassignedbugs at nondot.org
Reporter: carrot at google.com
CC: llvm-bugs at lists.llvm.org
Classification: Unclassified
The following compiler options are used
-fno-strict-aliasing -O2 -m64 -mvsx -mcpu=power8 -ffp-contract=fast
more than 98% of execution time is in function LBM_performStreamCollide, it
contains a single loop, related code is:
void LBM_performStreamCollide( LBM_Grid srcGrid, LBM_Grid dstGrid ) {
for (...)
{
...
ux = + SRC_E ( srcGrid ) - SRC_W ( srcGrid )
+ SRC_NE( srcGrid ) - SRC_NW( srcGrid )
+ SRC_SE( srcGrid ) - SRC_SW( srcGrid )
+ SRC_ET( srcGrid ) + SRC_EB( srcGrid )
- SRC_WT( srcGrid ) - SRC_WB( srcGrid );
uy = + SRC_N ( srcGrid ) - SRC_S ( srcGrid )
+ SRC_NE( srcGrid ) + SRC_NW( srcGrid )
- SRC_SE( srcGrid ) - SRC_SW( srcGrid )
+ SRC_NT( srcGrid ) + SRC_NB( srcGrid )
- SRC_ST( srcGrid ) - SRC_SB( srcGrid );
uz = + SRC_T ( srcGrid ) - SRC_B ( srcGrid )
+ SRC_NT( srcGrid ) - SRC_NB( srcGrid )
+ SRC_ST( srcGrid ) - SRC_SB( srcGrid )
+ SRC_ET( srcGrid ) - SRC_EB( srcGrid )
+ SRC_WT( srcGrid ) - SRC_WB( srcGrid );
ux /= rho;
uy /= rho;
uz /= rho;
if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
ux = 0.005;
uy = 0.002;
uz = 0.000;
}
...
}
}
LLVM tranforms the code into
if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
ux = + SRC_E ( srcGrid ) - SRC_W ( srcGrid )
+ SRC_NE( srcGrid ) - SRC_NW( srcGrid )
+ SRC_SE( srcGrid ) - SRC_SW( srcGrid )
+ SRC_ET( srcGrid ) + SRC_EB( srcGrid )
- SRC_WT( srcGrid ) - SRC_WB( srcGrid );
ux /= rho;
}
else
ux = 0.005;
if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
uy = + SRC_N ( srcGrid ) - SRC_S ( srcGrid )
+ SRC_NE( srcGrid ) + SRC_NW( srcGrid )
- SRC_SE( srcGrid ) - SRC_SW( srcGrid )
+ SRC_NT( srcGrid ) + SRC_NB( srcGrid )
- SRC_ST( srcGrid ) - SRC_SB( srcGrid );
uy /= rho;
}
else
uy = 0.002;
if( TEST_FLAG_SWEEP( srcGrid, ACCEL )) {
uz = + SRC_T ( srcGrid ) - SRC_B ( srcGrid )
+ SRC_NT( srcGrid ) - SRC_NB( srcGrid )
+ SRC_ST( srcGrid ) - SRC_SB( srcGrid )
+ SRC_ET( srcGrid ) - SRC_EB( srcGrid )
+ SRC_WT( srcGrid ) - SRC_WB( srcGrid );
uz /= rho;
}
else
uz = 0.000;
Note that following floating point expressions are dependence chain containing
10 floating instructions
uz = + SRC_T ( srcGrid ) - SRC_B ( srcGrid )
+ SRC_NT( srcGrid ) - SRC_NB( srcGrid )
+ SRC_ST( srcGrid ) - SRC_SB( srcGrid )
+ SRC_ET( srcGrid ) - SRC_EB( srcGrid )
+ SRC_WT( srcGrid ) - SRC_WB( srcGrid );
uz /= rho;
One power8 each fp instruction has 6 or more cycle latency, so it needs at
least 60 cycles to execute each of the three dependence chain.
GCC doesn't do the control flow transform, so it can interleave the 3
dependence chains, and the result code is much faster.
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20151016/315f8fc3/attachment.html>
More information about the llvm-bugs
mailing list