[PATCH] Prefer blendps over insertps codegen for one special case [X86]

Quentin Colombet qcolombet at apple.com
Fri Mar 20 12:40:17 PDT 2015


LGTM with the MinSize fix.

Q.


================
Comment at: lib/Target/X86/X86ISelLowering.cpp:10520
@@ +10519,3 @@
+      const Function *F = DAG.getMachineFunction().getFunction();
+      bool OptForSize = F->hasFnAttribute(Attribute::OptimizeForSize);
+      if (IdxVal == 0 && (!OptForSize || !MayFoldLoad(N1))) {
----------------
spatel wrote:
> qcolombet wrote:
> > Instead of checking for OptimizeForSize, I would check for MinSize or both.
> 
> Hi Quentin -
> 
> Thanks for looking at the patch. I had not seen MinSize used before. That corresponds to -Oz?
Yes, it is Oz.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:10521
@@ +10520,3 @@
+      bool OptForSize = F->hasFnAttribute(Attribute::OptimizeForSize);
+      if (IdxVal == 0 && (!OptForSize || !MayFoldLoad(N1))) {
+        // If this is an insertion of 32-bits into the low 32-bits of
----------------
spatel wrote:
> qcolombet wrote:
> > As soon as there is a folding opportunity, shouldn’t it be better to use it?
> > Could you check that with IACA?
> I checked this with real code running on SandyBridge, Haswell, and Jaguar. Load folding does not improve performance here. The usage of insertps is the limiting factor because it can only execute on one port.
> 
> Here's the SB result from the earlier patch for a microbenchmark including loads:
> 
> blendps : 5381572012 cycles for 150000000 iterations (35.88 cycles/iter).
> insertps: 10387753446 cycles for 150000000 iterations (69.25 cycles/iter).
Thanks for checking.

http://reviews.llvm.org/D8332

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/






More information about the llvm-commits mailing list