[PATCH] [X86] tranform insertps to blendps when possible for better performance

Mon Mar 2 14:52:59 PST 2015

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:22932-22935
@@ +22931,6 @@
+
+  // FIXME: If optimizing for size and there is a load folding opportunity,
+  // we should either not do this transform or we should undo it in
+  // PerformBLENDICombine. The above check for "MayFoldLoad" doesn't work
+  // because it doesn't look through a SCALAR_TO_VECTOR node.
+  
----------------
chandlerc wrote:
> I think we need to fix this always, and I think we should just handle it here in the "may fold load" case.
> 
> The primary reason to use insertps is to fold a scalar load into the shuffle. Switching to blendps is a huge mistake there because new we have to do some scalar -> vector first. In many cases, we may end up emitting insertps+blendps or movss+blendps which doesn't seem like the right lowering to me.
I think I understand the concern now. You're worried about the cases where we're loading into one of the higher lanes. That could easily require a bonus shuffle instruction if converted to a blendps. Let me resubmit the patch to only handle the single case of the low 32-bit lane because that should just be a movss from memory.

I was focused on the low element case. Ie, which of these exact sequences is better:
  vmovss   C0(%rip), %xmm1   <--- load into low lane; no shuffling before blendps
  vblendps $1, $xmm1, $xmm0, $xmm0

vs.
  vinsertps $0, C0(%rip), %xmm0, %xmm0

Sequences of movss+blendps have better throughput on SandyBridge and Haswell than insertps. 

Here's what IACA shows for the load cases:
SandyBridge: 2x throughput
Haswell: 2x throughput (we're limited by the loads here; blendps has 3x throughput on its own)

I wrote a test program to confirm the load case performance on SandyBridge:
blendps : 5381572012 cycles for 150000000 iterations (35.88 cycles/iter).
insertps: 10387753446 cycles for 150000000 iterations (69.25 cycles/iter).

This is for a string of 100 independent shuffle ops like:
        vmovss          ones(%rip), %xmm0
        vblendps        $1, %xmm0, %xmm1, %xmm1
        vmovss          ones(%rip), %xmm0
        vblendps        $1, %xmm0, %xmm2, %xmm2
        ...

On AMD Jaguar, the independent strings of load op versions of blendps and insertps perform equivalently. For a string of dependent load+shuffle ops, I see the same 2x perf win for blendps due to the lower latency of the blendps instruction.

insertps is an extra special wart in the SSE instruction set: it can't be extended to longer vectors (AVX, AVX512), so it will never get any extra transistors thrown its way to improve its performance relative to better-defined vector instructions. We should be careful generating insertps...some day it may end up microcoded.

http://reviews.llvm.org/D7866

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/