[PATCH] Use rsqrt (X86) to speed up reciprocal square root calcs (PR20900)

Tue Oct 7 17:52:55 PDT 2014

Hi hfinkel, nadav,

This is a first step for generating SSE rsqrt instructions for reciprocal square root calcs when fast-math is allowed.

For now, be conservative and only enable this for AMD btver2 where performance improves significantly - for example, 29% on llvm/projects/test-suite/SingleSource/Benchmarks/BenchmarkGame/n-body.c if we convert the data type to single-precision float.

We will probably never enable this codegen for any Intel Core* chips because the sqrt/divider circuits are just too fast. On SandyBridge, sqrtss + divss can be as fast as 20 cycles which is better than the 23 cycle critical path for the rsqrt + mul + mul + add + mul estimate.

Follow-on patches may allow reciprocal (rcpss) optimizations, add more vector data types, and enable the optimization for more chips.

More background here: http://llvm.org/bugs/show_bug.cgi?id=20900

http://reviews.llvm.org/D5658

Files:
  lib/Target/X86/X86.td
  lib/Target/X86/X86ISelLowering.cpp
  lib/Target/X86/X86ISelLowering.h
  lib/Target/X86/X86Subtarget.cpp
  lib/Target/X86/X86Subtarget.h
  test/CodeGen/X86/sqrt-fastmath.ll
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D5658.14534.patch
Type: text/x-patch
Size: 8454 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20141008/8e2c1083/attachment.bin>