[LLVMbugs] [Bug 14268] New: _mm_dp_ps generating 55% more inefficient instructions
bugzilla-daemon at llvm.org
bugzilla-daemon at llvm.org
Mon Nov 5 19:09:25 PST 2012
http://llvm.org/bugs/show_bug.cgi?id=14268
Bug #: 14268
Summary: _mm_dp_ps generating 55% more inefficient instructions
Product: libraries
Version: trunk
Platform: PC
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P
Component: Backend: X86
AssignedTo: unassignedbugs at nondot.org
ReportedBy: ramihg at gmail.com
CC: llvmbugs at cs.uiuc.edu
Classification: Unclassified
Created attachment 9499
--> http://llvm.org/bugs/attachment.cgi?id=9499
Simple test case
OS used: Mac OS X Mountain Lion
Clang/LLVM Used: 3.2-r167157
Processor: Intel Core i7
I was recently doing some SSE dot product implementations when I noticed a
severe performance drop after using _mm_dp_ps. This didn't make sense as the
x86 instruction latency manual showed that the dpps instruction is only 12
cycles latency.
Further investigation showed that dpps with a memory operand source: (e.g. dpps
$15, (%rex, %rdx), %xmm0) is about 55% slower than issuing a movaps then a dpps
with a register source/destination!
I've created and attached a simple test case that does dpps using intrinsics,
then using hand-written assembly with register operands, then using
hand-written assembly with memory source operand. The intrinsics version issued
instructions with memory operands.
Here is how I compiled the test:
clang++ -march=native -std=c++11 -stdlib=libc++ -O3 dotps.cpp
Here are the results I got (running each version 100000000 times):
DotPsIntrin: 249.332 ms
DotPsFast: 159.916 ms
DotPsSlow: 249.076 ms
I also tried the intrinsics version with Visual Studio 2012 and the Intel
compiler. Both generated the efficient movaps/dpps version even when specifying
"Optimize for Space".
--
Configure bugmail: http://llvm.org/bugs/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
More information about the llvm-bugs
mailing list