[PATCH] D22975: Compute the Newton series natively

Tue Aug 9 15:40:30 PDT 2016

t.p.northover added inline comments.

================
Comment at: llvm/test/CodeGen/X86/sqrt-fastmath.ll:42-45
@@ -41,6 +41,6 @@
 ; ESTIMATE-NEXT:    vmulss %xmm1, %xmm2, %xmm1
 ; ESTIMATE-NEXT:    vxorps %xmm2, %xmm2, %xmm2
-; ESTIMATE-NEXT:    vcmpeqss %xmm2, %xmm0, %xmm0
-; ESTIMATE-NEXT:    vandnps %xmm1, %xmm0, %xmm0
+; ESTIMATE-NEXT:    vcmpeqss %xmm2, %xmm0, %xmm2
+; ESTIMATE-NEXT:    vblendvps %xmm2, %xmm0, %xmm1, %xmm0
 ; ESTIMATE-NEXT:    retq
   %call = tail call float @__sqrtf_finite(float %f) #1
----------------
evandro wrote:
> RKSimon wrote:
> > spatel wrote:
> > > No worries. Note that I've used a modified version of that script to generate checks for targets besides x86 - in case anyone would like to enhance the script and make test generation easier for AArch64. :)
> > As Sanjay said, the use of vblendvps over vandnps is a regression that could affect throughput quite badly.
> @t.p.northover, is Sanjay onto something that AArch64 could use a folding instead?  Otherwise, I could move the check for 0.0 inside `getSqrtEstimate()`.
We seem to catch the simple cases, based on:

    define <4 x float> @foo(<4 x float> %lhs, <4 x float> %rhs, <4 x float> %val) {
      %tst = fcmp oeq <4 x float> %lhs, %rhs
      %res = select <4 x i1> %tst, <4 x float> %val, <4 x float> zeroinitializer
      ret <4 x float> %res
    }

(I haven't checked but suspect it's actually the generic DAG combiner that's doing it). I'm not sure why we don't get this one, but fixing it would improve performance, probably beyond using `Op`.


Repository:
  rL LLVM

https://reviews.llvm.org/D22975