[PATCH] Use rsqrt (X86) to speed up reciprocal square root calcs (PR20900)

Thu Oct 9 12:42:46 PDT 2014

> We will probably never enable this codegen for any Intel Core* chips because the sqrt/divider circuits are just too fast. On SandyBridge, sqrtss + divss can be as fast as 20 cycles which is better than the 23 cycle critical path for the rsqrt + mul + mul + add + mul estimate.

Critical path latency is good, but throughput is normally much better. According to Intel's optimization manual, rsqrtss, for example, is fully pipelined on most Intel cores (on Westmere and Nehalem the dispatch delay is 3 cycles, but 1 cycle elsewhere). But the dispatch delay time for sqrtss is 7 cycles on Haswell, 7-14 cycles on Sandy Bridge, something under 16 cycles for Westmere and Nehalem, and 11 cycles for Silvermont. The throughput for divss is a little better than sqrtss, but not by much.

In short, this is likely a big win *if* there is anything else going on (floating-point-wise), even on Intel cores. I could be wrong, but this very-much reminds me of the problem that the MachineCombiner pass tries to solve for FMAs, etc. on some targets, and I wonder if it could somehow be applied to this as well.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:14341
@@ +14340,3 @@
+/// Override the default NR algorithm because the 2-constant implementation
+/// runs faster on Intel SandyBridge and AMD Jaguar (btver2). It has one
+/// less FP instruction in exchange for an extra constant load that should
----------------
I'd really prefer that you put the 2-constant version of the algorithm into the DAGCombiner along side the 1-constant version, and just let the target pick. The algorithm itself is really a mathematical expression, and not at all really target dependent, and we should try to keep such things available to other targets without copy-and-paste.

Ideally, we'd then also have a flag to force one or the other, so that way PPC can default  to the 1-constant version, X86 can default to the 2-constant version, but there's a command-line option I can use to force the choice for benchmarking.

http://reviews.llvm.org/D5658