[PATCH] D85031: [builtins] Unify the softfloat division implementation

Thu Aug 20 22:09:43 PDT 2020

sepavloff added inline comments.

================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:99
+  //   0 < x0(b) < 1
+  //   abs(x0(b) - 1/b) <= 3/4 - 1/sqrt(2)
+
----------------
atrosinenko wrote:
> sepavloff wrote:
> > This estimation is absent from the original comment. Do you have reference where it came from? Also the original comment states `This is accurate to about 3.5 binary digits`. Is still true? If yes, it could be worth copying here.
> This approximation was deduced by writing down the derivative of `f ` "in infinite precision" and finding its root. Then the values of `f` applied to its root, 1.0 and 2.0 were calculated -- as far as I remember, **all of them** were `3/4 - 1/sqrt(2)` or negated - //this is what "minimax polynomial" probably means, that term was just copied from the original implementation :)//.
IIUC, you don't want to put this statement here because you are us not sure it is true? Sounds reasonable.

================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:109
+
+#if NUMBER_OF_HALF_ITERATIONS > 0
+  // Starting with (n-1) half-width iterations
----------------
atrosinenko wrote:
> sepavloff wrote:
> > It is good optimization. Could you please put a comment shortly describing the idea of using half-sized temporaries?
> The idea is just "I guess this takes less CPU time and I have managed to prove error bounds for it". :) Specifically, for float128, the rep_t * rep_t multiplication will be emulated with lots of CPU instructions while the lower half contain some noise at that point. This particular optimization did exist in the original implementation for float64 and float128. For float32 it had not much sense, I guess. Still, estimations were calculated for the case of float32 with half-size iterations as it may be useful for MSP430 and other 16-bit targets.
The idea is clear but it require some study of the sources. I would propose to add a comment saying:
```
At the first iterations number of significant digits is small, so we may use shorter type values. Operations on them are usually faster.
```
or something like that.

================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:130
+    // Cannot overflow by construction and is larger than 2 - b*x by at most 1*Ulp.
+    half_rep_t corr_UQ1_hw = 0 /* = 2 */ - ((rep_t)x_UQ0_hw * b_UQ1_hw >> HW);
+    // Would be at most [1.]00000 after overflow if rounding (0 - x_UQ0_hw * b_UQ1_hw) down.
----------------
atrosinenko wrote:
> sepavloff wrote:
> > It would be better to put short comment to explain using 0 instead of 2.
> Agree, it was expected to be something like `/* = 2.0 in UQ1.(HW-1) */`. Naming things is especially painful here...
2.0 cannot be represented in UQ1.X. I would add comment line like:
```
Due to wrapping 2.0 in UQ1.X is equivalent to 0.
```
or something similar.

================
Comment at: compiler-rt/lib/builtins/fp_div_impl.inc:184
+  // x_UQ0 * b_UQ1 = (x_UQ0_hw * 2^HW) * (b_UQ1_hw * 2^HW + blo) - b_UQ1
+  rep_t corr_UQ1 = 0 - (   (rep_t)x_UQ0_hw * b_UQ1_hw
+                        + ((rep_t)x_UQ0_hw * blo >> HW)
----------------
atrosinenko wrote:
> sepavloff wrote:
> > `x_UQ0_hw` and `b_UQ1_hw` are declared inside the conditional block `#if NUMBER_OF_HALF_ITERATIONS > 0`. Does `NUMBER_OF_FULL_ITERATIONS != 1` always imply `NUMBER_OF_HALF_ITERATIONS > 0 `?
> > Does NUMBER_OF_FULL_ITERATIONS != 1 always imply NUMBER_OF_HALF_ITERATIONS > 0 ?
> 
> Hmm... It should imply `== 0`, at first glance... Generally, total number of iterations should be 3 for f32, 4 for f64 and 5 for f128. Then, error bounds are calculated. There are generally only two modes: n-1 half-size iteration + 1 full-size iteration OR n full-size iteration (as one generally has no performance gains from using 16x16-bit multiplications on one hand, and that particular case turned out to require extra rounding, on the other).
I have concern that `x_UQ0_hw` and `x_UQ1_hw` are declared in the block with condition `NUMBER_OF_HALF_ITERATIONS > 0` but uses in the other block with condition `!USE_NATIVE_FULL_ITERATIONS && NUMBER_OF_HALF_ITERATIONS > 0`, so probably there may be a combination of the macros when the variables are used but not declared. Maybe it is impossible due to some reason, in this case a proper check may be put into the block `#ifdef USE_NATIVE_FULL_ITERATIONS` which asserts that `NUMBER_OF_HALF_ITERATIONS > 0`. Otherwise `x_UQ0_hw` and `x_UQ1_hw` need to be moved out of the conditional block.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85031/new/

https://reviews.llvm.org/D85031