[PATCH] D85031: [builtins] Unify the softfloat division implementation

Mon Aug 31 06:54:49 PDT 2020

scanon added inline comments.

================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:99
+  //   0 < x0(b) < 1
+  //   abs(x0(b) - 1/b) <= 3/4 - 1/sqrt(2)
+
----------------
sepavloff wrote:
> atrosinenko wrote:
> > sepavloff wrote:
> > > This estimation is absent from the original comment. Do you have reference where it came from? Also the original comment states `This is accurate to about 3.5 binary digits`. Is still true? If yes, it could be worth copying here.
> > This approximation was deduced by writing down the derivative of `f ` "in infinite precision" and finding its root. Then the values of `f` applied to its root, 1.0 and 2.0 were calculated -- as far as I remember, **all of them** were `3/4 - 1/sqrt(2)` or negated - //this is what "minimax polynomial" probably means, that term was just copied from the original implementation :)//.
> IIUC, you don't want to put this statement here because you are us not sure it is true? Sounds reasonable.
To be precise, what _minimax polynomial_ means is that p(x) = 3/4 + 1/sqrt(2) - x/2 is the first-order polynomial that minimizes the error term max(|1/x - p(x)|) on the interval [1,2]. I.e. every other linear polynomial would achieve a larger maximum error.

The bound of a minimax approximation to a well-behaved function is always achieved at the endpoints, so we can just evaluate at 1 to get the max error: |1/1 - 3/4 - 1/sqrt(2) + 1/2| = 3/4 - 1/sqrt(2) = 0.04289... (which is actually about _4.5_ bits).

================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:101-102
+
+  // Then, refine the reciprocal estimate using a Newton-Raphson iteration:
+  //     x_{n+1} = x_n * (2 - x_n * b)
+  // Let e_n := x_n - 1/b_hw
----------------
atrosinenko wrote:
> sepavloff wrote:
> > The original comment states:
> > ```
> >   // This doubles the number of correct binary digits in the approximation
> >   // with each iteration.
> > ```
> > It is true in this implementation? If yes, it could be worth copying here.
> For me this looks too vague. This is probably //approximately true// but I don't know how exactly this should be interpreted.
N-R is quadratically convergent under a bunch of assumptions on how good the initial guess is and bounds on the second derivative, which are all satisfied here, but probably not worth going into in the comments. IIRC the usual reference here is Montuschi and Mezzalama's "Survey of square rooting algorithms" (1990).

================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:109
+
+#if NUMBER_OF_HALF_ITERATIONS > 0
+  // Starting with (n-1) half-width iterations
----------------
sepavloff wrote:
> atrosinenko wrote:
> > sepavloff wrote:
> > > It is good optimization. Could you please put a comment shortly describing the idea of using half-sized temporaries?
> > The idea is just "I guess this takes less CPU time and I have managed to prove error bounds for it". :) Specifically, for float128, the rep_t * rep_t multiplication will be emulated with lots of CPU instructions while the lower half contain some noise at that point. This particular optimization did exist in the original implementation for float64 and float128. For float32 it had not much sense, I guess. Still, estimations were calculated for the case of float32 with half-size iterations as it may be useful for MSP430 and other 16-bit targets.
> The idea is clear but it require some study of the sources. I would propose to add a comment saying:
> ```
> At the first iterations number of significant digits is small, so we may use shorter type values. Operations on them are usually faster.
> ```
> or something like that.
This is absolutely standard in HW construction of pipelined iterative dividers and square root units, so I'm not sure how much explanation is really needed =)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85031/new/

https://reviews.llvm.org/D85031