[PATCH] D78606: [DAGCombine] Adding a new Newton-Raphson implementation to leverage the FMA

Tue Apr 21 19:29:22 PDT 2020

steven.zhang created this revision.
steven.zhang added reviewers: spatel, RKSimon, hfinkel, renenkel, evandro, jsji, nemanjai, PowerPC.
Herald added subscribers: kerbowa, wuzish, kbarton, hiraditya, nhaehnle, jvesely, arsenm, jholewinski.
Herald added a project: LLVM.
steven.zhang marked an inline comment as done.
steven.zhang added inline comments.

================
Comment at: llvm/include/llvm/CodeGen/TargetLowering.h:4118

+  /// This enum inndicates the different methods we use to do the Newton
+  /// iterations for sqrt/rsqrt
----------------
See a typo here and will fix it later.

Adding a new Newton-Raphson implementation that leverage the FMA which save one instruction for 2 iterations. And it also improves the precision due to the use of FMA/FNMSUB.

This is the measurement from PowerPC(courtesy of @renenkel):

  The new algorithm is good to about 0.7 ulps for arguments >= 2^(-1022). The old algorithm is worse at about 1.7 ulps for arguments >= 2^(-1022).
  The new algorithm speedup 1.13x on Power9

FYI. The new implementation:

  sqrt(n) -> n * rsqrt(n) 

  Newton iteration formula: X{i+1} = X{i} - F(X{i})/F'(X{i})
  F(x) = 1/x^2 - n    # Find the 'x' to make this function as zero
  -->
  X{i+1} = X{i} * (1.5 - 0.5*n*X{i}^2)
  -->
  X{i+1} = X{i} + X{i} * (0.5 - 0.5*X{i}*n*X{i})

  sqrt(n) = n*X{i+1} = n*X{i} + n*X{i} * (0.5 - 0.5*X{i}*n*X{i})
  -->
  sqrt(n) = n*X{i} + 0.5*X{i}*(n - (n*X{i})^2)

So, what we need to do is just iteration the n*X{i}, 0.5*X{i} according to formula X{i+1} = X{i} + X{i} * (0.5 - 0.5*X{i}*n*X{i})  First.

  H{0} = 0.5*y0      # 0.5*X{i}  y0 is the estimate value   
  S{0} = n * y0      # n * X{i}
  D{0} = 0.5 - H*S   # 0.5 - 0.5*X{i}*n*X{i}

Then, we have:

  H{i+1} = 0.5 * X{i+1} = 0.5*X{i} + 0.5*X{i} * (0.5 - 0.5*X{i}*n*X{i})
  -->
  H{i+1} = H{i} + H{i} * D{i}

  S{i+1} = n * X{i+1} = n*X{i} + n * X{i} *  (0.5 - 0.5*X{i}*n*X{i})
  -->
  S{i+1} = S{i} + S{i} * D{i}

So, we can do the iteration for H{i} and S{i} to pursue better precision. After that,

  sqrt(n) = n*X{i} + 0.5*X{i}*(n - (n*X{i})^2) 
  -->
  sqrt(n) = S{i} + H{i} * (n - S{i}^2)

Thus, we have these 7 instructions and use one constant 0.5:

  H = 0.5*y0 # FMUL
  S = n * y0 # FMUL
  D = 0.5 - S * H # FNMSUB
  H = H * D + H # FMA
  S = S * D + S # FMA
  E = n - S * S # FNMSUB
  res = E * H + S # FMA

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D78606

Files:
  llvm/include/llvm/CodeGen/TargetLowering.h
  llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
  llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
  llvm/lib/Target/AArch64/AArch64ISelLowering.h
  llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
  llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
  llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp
  llvm/lib/Target/NVPTX/NVPTXISelLowering.h
  llvm/lib/Target/PowerPC/PPC.td
  llvm/lib/Target/PowerPC/PPCISelLowering.cpp
  llvm/lib/Target/PowerPC/PPCISelLowering.h
  llvm/lib/Target/X86/X86ISelLowering.cpp
  llvm/lib/Target/X86/X86ISelLowering.h
  llvm/test/CodeGen/PowerPC/fma-mutate.ll
  llvm/test/CodeGen/PowerPC/fmf-propagation.ll
  llvm/test/CodeGen/PowerPC/qpx-recipest.ll
  llvm/test/CodeGen/PowerPC/recipest.ll
  llvm/test/CodeGen/PowerPC/vsx-fma-mutate-trivial-copy.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D78606.259147.patch
Type: text/x-patch
Size: 38521 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20200422/4c1b4c47/attachment-0001.bin>