[PATCH] D21127: Remove redundant FMUL in Newton-Raphson SQRT code

Thu Jun 9 00:16:13 PDT 2016

n.bozhenov added a comment.

Thanks for great questions, Sanjay!

I slightly modified the example from PR21385 and compared these two
sequences to calculate square roots (the current one and the patched one):

  est1 = (-0.5f * est0) * (-3.0f + est0 * est0 * f) * f;
  float ae = est0 * f;
  est2 = (-0.5f * ae) * (-3.0f + est0 * ae);

And I obtained the following results:

                                     est1         est2
  Total tests:                 2130706432   2130706432
  Inexact results:              926539007    834159368
  Estimate missed by  1 ULP:    862814916    796017331
  Estimate missed by  2 ULP:     62179595     37665787
  Estimate missed by  3 ULP:      1537746       476250
  Estimate missed by  4 ULP:         6750            0
  Estimate missed by >4 ULP:            0            0

As you can see, with the patch square roots are significantly more
accurate on average, though I don't have a good explanation for this.
I performed testing on a number of Intel microarchitectures (including
Atom) and got exactly the same results for all of them.

As for improved hardware SQRT efficiency in modern Intel CPUs, I'm
working on a patch that address this particular issue and I will
share the patch very soon.

http://reviews.llvm.org/D21127