[PATCH] D85031: [builtins] Unify the softfloat division implementation

Thu Aug 20 10:23:16 PDT 2020

atrosinenko added a comment.

//Some general context:// The final goal was to have an explanation why this particular number of iteration (3, 4 or 5, depending on type) are enough for any `a` and `b` passed as input arguments taking into account errors due to particular finite precision computations. Initially, I was trying to just "mechanically" unify the three implementations and their comments like this <https://github.com/access-softek/llvm-project/pull/24>. After trying for a while, turned out I cannot simply pick the original proof up and add the subnormal case. Some of statements were too vague, some seemed not explained enough, etc. Then, an attempt was made to re-prove it from scratch with an intention for the resulting explanation to be as much self-contained as possible. The **implementation**, on the other hand, gathers various features of three original functions and some hacks to make it easier to prove.

================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:99
+  //   0 < x0(b) < 1
+  //   abs(x0(b) - 1/b) <= 3/4 - 1/sqrt(2)
+
----------------
sepavloff wrote:
> This estimation is absent from the original comment. Do you have reference where it came from? Also the original comment states `This is accurate to about 3.5 binary digits`. Is still true? If yes, it could be worth copying here.
This approximation was deduced by writing down the derivative of `f ` "in infinite precision" and finding its root. Then the values of `f` applied to its root, 1.0 and 2.0 were calculated -- as far as I remember, **all of them** were `3/4 - 1/sqrt(2)` or negated - //this is what "minimax polynomial" probably means, that term was just copied from the original implementation :)//.

================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:101-102
+
+  // Then, refine the reciprocal estimate using a Newton-Raphson iteration:
+  //     x_{n+1} = x_n * (2 - x_n * b)
+  // Let e_n := x_n - 1/b_hw
----------------
sepavloff wrote:
> The original comment states:
> ```
>   // This doubles the number of correct binary digits in the approximation
>   // with each iteration.
> ```
> It is true in this implementation? If yes, it could be worth copying here.
For me this looks too vague. This is probably //approximately true// but I don't know how exactly this should be interpreted.

================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:109
+
+#if NUMBER_OF_HALF_ITERATIONS > 0
+  // Starting with (n-1) half-width iterations
----------------
sepavloff wrote:
> It is good optimization. Could you please put a comment shortly describing the idea of using half-sized temporaries?
The idea is just "I guess this takes less CPU time and I have managed to prove error bounds for it". :) Specifically, for float128, the rep_t * rep_t multiplication will be emulated with lots of CPU instructions while the lower half contain some noise at that point. This particular optimization did exist in the original implementation for float64 and float128. For float32 it had not much sense, I guess. Still, estimations were calculated for the case of float32 with half-size iterations as it may be useful for MSP430 and other 16-bit targets.

================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:114
+  // C is (3/4 + 1/sqrt(2)) - 1 truncated to W0 fractional bits as UQ0.HW
+#if defined(SINGLE_PRECISION)
+  const half_rep_t C_hw = HALF_REP_C(0x7504) << (HW - 16);
----------------
sepavloff wrote:
> In what cases 16-bit temporaries are used?  `NUMBER_OF_HALF_ITERATIONS` is set to zero in `divsf3.c`.
Agree, this needs to be re-evaluated and some comment should be added at least. This could be dead code for now, that was //expected// to speed things up on 16-bit targets that even lack hardware multiplication sometimes (such as MSP430).

================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:130
+    // Cannot overflow by construction and is larger than 2 - b*x by at most 1*Ulp.
+    half_rep_t corr_UQ1_hw = 0 /* = 2 */ - ((rep_t)x_UQ0_hw * b_UQ1_hw >> HW);
+    // Would be at most [1.]00000 after overflow if rounding (0 - x_UQ0_hw * b_UQ1_hw) down.
----------------
sepavloff wrote:
> It would be better to put short comment to explain using 0 instead of 2.
Agree, it was expected to be something like `/* = 2.0 in UQ1.(HW-1) */`. Naming things is especially painful here...

================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:184
+  // x_UQ0 * b_UQ1 = (x_UQ0_hw * 2^HW) * (b_UQ1_hw * 2^HW + blo) - b_UQ1
+  rep_t corr_UQ1 = 0 - (   (rep_t)x_UQ0_hw * b_UQ1_hw
+                        + ((rep_t)x_UQ0_hw * blo >> HW)
----------------
sepavloff wrote:
> `x_UQ0_hw` and `b_UQ1_hw` are declared inside the conditional block `#if NUMBER_OF_HALF_ITERATIONS > 0`. Does `NUMBER_OF_FULL_ITERATIONS != 1` always imply `NUMBER_OF_HALF_ITERATIONS > 0 `?
> Does NUMBER_OF_FULL_ITERATIONS != 1 always imply NUMBER_OF_HALF_ITERATIONS > 0 ?

Hmm... It should imply `== 0`, at first glance... Generally, total number of iterations should be 3 for f32, 4 for f64 and 5 for f128. Then, error bounds are calculated. There are generally only two modes: n-1 half-size iteration + 1 full-size iteration OR n full-size iteration (as one generally has no performance gains from using 16x16-bit multiplications on one hand, and that particular case turned out to require extra rounding, on the other).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85031/new/

https://reviews.llvm.org/D85031