[PATCH] Refactor reciprocal and reciprocal square root estimate into target-independent functions (part 2).

Sat Sep 27 18:22:18 PDT 2014

REPOSITORY
  rL LLVM

================
Comment at: include/llvm/Target/TargetLowering.h:2632
@@ +2631,3 @@
+  /// returned by one of the above methods.
+  virtual unsigned getNRSteps(EVT VT) const {
+    return 0;
----------------
spatel wrote:
> hfinkel wrote:
> > spatel wrote:
> > > hfinkel wrote:
> > > > The number of iterations necessary for the reciprocal estimate and for the reciprocal sqrt estimate might be different. Please provide a way to differentiate (and I'd want to make really sure the target actually overrides this). Maybe:
> > > > 
> > > >   virtual unsigned getNRSteps(EVT VT, bool SqrtEst) const {
> > > >     llvm_unreachable("Target must provide the number of iterations");
> > > >   }
> > > > 
> > > Sure - I'll make unique functions to return iteration counts for sqrte and rcpe. 
> > > 
> > > We may need one more refinement here regarding the rcpe(rsqrt(x)) transformation of a regular sqrt(x)...my guess is that's not a win on any recent X86 (and probably not PPC either?). But that change can come later if needed.
> > Regarding PPC, you might be right about some of them -- it is certainly a win on the embedded cores where the sqrt instruction is not fully pipelined. We'll need to do some measurements.
> It's coming back to me now (used to be at IBM and Apple)...
> I think the deciding factor is not whether the sqrt instruction is pipelined, but whether it exists at all. Eg, 7400/7450 had fre/frsqrte, but lacked fsqrt. In that case, the decision is between doing a long sequence of dependent ops using the estimates vs. making a call to libm sqrt(). If fsqrt exists, it should probably be used unless there's some truly horrible HW implementation out there.
> Certainly, this should be measured on as many targets as possible to see if it's true.
Obviously whether it exists at all matters, but the pipelining definitely also matters -- at least on some cores. On the A2 (an embedded core), for example, a full sqrt blocks the issuing thread from issuing any additional floating-point instructions for 69 cycles. There the pipelining definitely matters, but on other cores I'm less certain (which is why I said that we'd need to measure it).

http://reviews.llvm.org/D5484