[llvm-dev] RFC: SIMD math-function library
Naoki Shibata via llvm-dev
llvm-dev at lists.llvm.org
Wed Jul 13 04:45:38 PDT 2016
Dear LLVM contributors,
I am Naoki Shibata, an associate professor at Nara Institute of Science
I and Hal Finkel would like to jointly propose to add my vectorized math
library to LLVM.
The library has been available as public domain software for years, I am
going to double-license the library if necessary.
Below is a proposal to add my vectorized math library, SLEEF , for
evaluating elementary functions (trigonometry, log, exp, etc.) to LLVM.
The library can be used directly, or can be targeted by an
autovectorization infrastructure. Patches to tie SLEEF into LLVM's
autovectorizer have been developed by Hal Finkel as part of the bgclang
project (which provides LLVM/Clang ported to the IBM BG/Q supercomputer
architecture). Hal has also developed a user-facing header for the
library, in the style of Clang's intrinsics headers, which we can use as
part of this project. SLEEF has been used as part of bgclang in this way
for several years.
The library currently supports several architectures:
* x86 - SSE2, FMA4, AVX, AVX2+FMA3
* ARM - NEON (single-precision only)
* A pure C (scalar) version
* Hal's version supports PowerPC/QPX.
It is faily easy to port to other architectures. The library provides
similar functionality to Intel's Short Vector Math Library (available
with Intel's Compiler).
1) Get agreement on incorporating the library.
2) Renaming the public interface to use only the
implementation-reserved namespace (i.e. names starting with
underscores), as is appropriate for a compiler runtime library.
3) Convert the functions to use LLVM's naming conventions (including, if
desired, converting the source files to C++ allowing the use of function
4) Create and document a public interface to the library.
5) Add support for targeting the library to LLVM's autovectorizer.
6) Work with the community to port the library to other architectures.
Recent CPUs and GPUs have vectorized FP multipliers and adders for
improving throughput of FP computation. In order to extract the maximum
computation power from processors with vectorized ALUs, the software has
to be vectorized to use SIMD data structures. It is also preferred that
conditional branches and scatter/gather memory access are eliminated as
much as possible. However, rewriting existing software in this fashion
is a very hard and time consuming task that involves converting data
structures. Thus, realization of efficient libraries and automatic
vectorization is desired.
In this proposal, we are going to incorporate a vectorized math library,
currently named SLEEF, into LLVM runtime library. By doing this,
elementary functions can be directly evaluated using SIMD data types. We
can also expect extra performance improvements by allowing LLVM to
automatically target the functions (and inline them with LTO).
Functionality of the library:
For each elementary function, the library contains subroutines for
evaluation in single precision and double precision. Different accuracy
of the results can be chosen for a subset of the elementary functions;
for this subset there are versions with up to 1 ulp error and versions
with a few ulp error. Obviously, less accurate versions are faster.
Please note that we have 0.5 ulp maximum error when we convert a real
number into a floating point number. In Hal's bgclang port, the less
accurate versions are used with -ffast-math, and the more-accurate ones
For non-finite inputs and outputs, the library should return the same
results as libm. The library is tested if the evaluation error is within
the designed limit. The library is tested against high-precision
evaluation using the libmpfr library. Especially, we rigorously checked
the error of the trigonometric functions when the arguments are close to
an integral multiple of PI/2.
The size of the functions is very small.
Implementation of the library:
Basically, each function consists of reduction and kernel. For the
kernel, a polynomial approximation is used. The coefficients are
carefully set to minimize the number of multiplications and additions
while reducing the error. The reduction is devised so that the same
kernel can be used for all range of the input arguments. In order to
improve the accuracy in the functions with 1-ulp error, double-double
calculations are used. Use of fused multiply-add operations, which is
quite common recently, can further improve performance of these
functions. Some of the implementation techniques used in the library are
explained in .
More information about the llvm-dev