[PATCH] Use rsqrt (X86) to speed up reciprocal square root calcs (PR20900)

Thu Oct 9 14:50:24 PDT 2014

>>! In D5658#4, @hfinkel wrote:
> In short, this is likely a big win *if* there is anything else going on (floating-point-wise), even on Intel cores. I could be wrong, but this very-much reminds me of the problem that the MachineCombiner pass tries to solve for FMAs, etc. on some targets, and I wonder if it could somehow be applied to this as well.

Yes, I agree with the throughput argument. And if the user can flip just this one bit of codegen with an attribute flag, that would be ideal.

I didn't want to overstep with this patch though, so I thought it'd be best to just start with the core where I know it's always a win to get rid of the sqrtss/divss. 

@tycho may be able to better assess the perf differences on Intel. Some informal benchmarking using an n-body program certainly does show big wins using the rsqrt code on Haswell.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:14341
@@ +14340,3 @@
+/// Override the default NR algorithm because the 2-constant implementation
+/// runs faster on Intel SandyBridge and AMD Jaguar (btver2). It has one
+/// less FP instruction in exchange for an extra constant load that should
----------------
hfinkel wrote:
> I'd really prefer that you put the 2-constant version of the algorithm into the DAGCombiner along side the 1-constant version, and just let the target pick. The algorithm itself is really a mathematical expression, and not at all really target dependent, and we should try to keep such things available to other targets without copy-and-paste.
> 
> Ideally, we'd then also have a flag to force one or the other, so that way PPC can default  to the 1-constant version, X86 can default to the 2-constant version, but there's a command-line option I can use to force the choice for benchmarking.
Agree - I'll rework this.

http://reviews.llvm.org/D5658