[PATCH] D21127: Remove redundant FMUL in Newton-Raphson SQRT code

Wed Jun 8 09:07:22 PDT 2016

spatel added a comment.

This seems like a good perf optimization to me in general, but a few high-level questions:

1. How does the refactoring of the multiplication affect the accuracy of the results? I attached some test programs that could be modified to answer this in https://llvm.org/bugs/show_bug.cgi?id=21385 .

2. The case of converting a sqrt into an estimate sequence is a problem in https://llvm.org/bugs/show_bug.cgi?id=24063 . Would this patch sidestep that bug by producing the more accurate/expected result for that case?

3. Given that we have 3 generations of Intel FMA FPUs now (and the non-x86 world has always had FMA), it would be interesting to know the accuracy (and codegen tests should probably be added) for the FMA variant. It would also be good to add tests with 2 refinement steps, so we know this patch is behaving as expected in that case, but...

4. (Apologies for straying from the immediate goal of the patch...) Given that Intel FPUs in Broadwell and Skylake have reduced IEEE-compliant div/sqrt to 4 and then 3 cycle throughputs according to Agner's tables, using an estimate is likely a perf misoptimization for current and future Intel big cores. I think that we can fix this using the existing hook 'isFsqrtCheap()', but please confirm that this patch preserves that opportunity.

http://reviews.llvm.org/D21127