[PATCH] [X86] tranform insertps to blendps when possible for better performance
spatel at rotateright.com
Mon Mar 2 14:52:59 PST 2015
Comment at: lib/Target/X86/X86ISelLowering.cpp:22932-22935
@@ +22931,6 @@
+ // FIXME: If optimizing for size and there is a load folding opportunity,
+ // we should either not do this transform or we should undo it in
+ // PerformBLENDICombine. The above check for "MayFoldLoad" doesn't work
+ // because it doesn't look through a SCALAR_TO_VECTOR node.
> I think we need to fix this always, and I think we should just handle it here in the "may fold load" case.
> The primary reason to use insertps is to fold a scalar load into the shuffle. Switching to blendps is a huge mistake there because new we have to do some scalar -> vector first. In many cases, we may end up emitting insertps+blendps or movss+blendps which doesn't seem like the right lowering to me.
I think I understand the concern now. You're worried about the cases where we're loading into one of the higher lanes. That could easily require a bonus shuffle instruction if converted to a blendps. Let me resubmit the patch to only handle the single case of the low 32-bit lane because that should just be a movss from memory.
I was focused on the low element case. Ie, which of these exact sequences is better:
vmovss C0(%rip), %xmm1 <--- load into low lane; no shuffling before blendps
vblendps $1, $xmm1, $xmm0, $xmm0
vinsertps $0, C0(%rip), %xmm0, %xmm0
Sequences of movss+blendps have better throughput on SandyBridge and Haswell than insertps.
Here's what IACA shows for the load cases:
SandyBridge: 2x throughput
Haswell: 2x throughput (we're limited by the loads here; blendps has 3x throughput on its own)
I wrote a test program to confirm the load case performance on SandyBridge:
blendps : 5381572012 cycles for 150000000 iterations (35.88 cycles/iter).
insertps: 10387753446 cycles for 150000000 iterations (69.25 cycles/iter).
This is for a string of 100 independent shuffle ops like:
vmovss ones(%rip), %xmm0
vblendps $1, %xmm0, %xmm1, %xmm1
vmovss ones(%rip), %xmm0
vblendps $1, %xmm0, %xmm2, %xmm2
On AMD Jaguar, the independent strings of load op versions of blendps and insertps perform equivalently. For a string of dependent load+shuffle ops, I see the same 2x perf win for blendps due to the lower latency of the blendps instruction.
insertps is an extra special wart in the SSE instruction set: it can't be extended to longer vectors (AVX, AVX512), so it will never get any extra transistors thrown its way to improve its performance relative to better-defined vector instructions. We should be careful generating insertps...some day it may end up microcoded.
More information about the llvm-commits