[llvm-dev] RFC: SIMD math-function library

Wed Jul 13 04:45:38 PDT 2016

Dear LLVM contributors,

I am Naoki Shibata, an associate professor at Nara Institute of Science 
and Technology.

I and Hal Finkel would like to jointly propose to add my vectorized math 
library to LLVM.

The library has been available as public domain software for years, I am 
going to double-license the library if necessary.

********

Below is a proposal to add my vectorized math library, SLEEF [1], for
evaluating elementary functions (trigonometry, log, exp, etc.) to LLVM. 
The library can be used directly, or can be targeted by an 
autovectorization infrastructure. Patches to tie SLEEF into LLVM's 
autovectorizer have been developed by Hal Finkel as part of the bgclang 
project (which provides LLVM/Clang ported to the IBM BG/Q supercomputer 
architecture). Hal has also developed a user-facing header for the 
library, in the style of Clang's intrinsics headers, which we can use as 
part of this project. SLEEF has been used as part of bgclang in this way 
for several years.

The library currently supports several architectures:
  * x86 - SSE2, FMA4, AVX, AVX2+FMA3
  * ARM - NEON (single-precision only)
  * A pure C (scalar) version
  * Hal's version supports PowerPC/QPX.

It is faily easy to port to other architectures. The library provides 
similar functionality to Intel's Short Vector Math Library (available 
with Intel's Compiler).

Roadmap:
--------
1) Get agreement on incorporating the library.
2) Renaming the public interface to use only the
    implementation-reserved namespace (i.e. names starting with
    underscores), as is appropriate for a compiler runtime library.
3) Convert the functions to use LLVM's naming conventions (including, if
    desired, converting the source files to C++ allowing the use of function
    overloading).
4) Create and document a public interface to the library.
5) Add support for targeting the library to LLVM's autovectorizer.
6) Work with the community to port the library to other architectures.

Motivation:

Recent CPUs and GPUs have vectorized FP multipliers and adders for 
improving throughput of FP computation. In order to extract the maximum 
computation power from processors with vectorized ALUs, the software has 
to be vectorized to use SIMD data structures. It is also preferred that 
conditional branches and scatter/gather memory access are eliminated as 
much as possible. However, rewriting existing software in this fashion 
is a very hard and time consuming task that involves converting data 
structures. Thus, realization of efficient libraries and automatic 
vectorization is desired.

In this proposal, we are going to incorporate a vectorized math library,
currently named SLEEF, into LLVM runtime library. By doing this, 
elementary functions can be directly evaluated using SIMD data types. We 
can also expect extra performance improvements by allowing LLVM to 
automatically target the functions (and inline them with LTO).

Functionality of the library:

For each elementary function, the library contains subroutines for 
evaluation in single precision and double precision. Different accuracy 
of the results can be chosen for a subset of the elementary functions; 
for this subset there are versions with up to 1 ulp error and versions 
with a few ulp error. Obviously, less accurate versions are faster. 
Please note that we have 0.5 ulp maximum error when we convert a real 
number into a floating point number. In Hal's bgclang port, the less 
accurate versions are used with -ffast-math, and the more-accurate ones 
otherwise.

For non-finite inputs and outputs, the library should return the same 
results as libm. The library is tested if the evaluation error is within 
the designed limit. The library is tested against high-precision 
evaluation using the libmpfr library. Especially, we rigorously checked 
the error of the trigonometric functions when the arguments are close to 
an integral multiple of PI/2.

The size of the functions is very small.

Implementation of the library:

Basically, each function consists of reduction and kernel. For the 
kernel, a polynomial approximation is used. The coefficients are 
carefully set to minimize the number of multiplications and additions 
while reducing the error. The reduction is devised so that the same 
kernel can be used for all range of the input arguments. In order to 
improve the accuracy in the functions with 1-ulp error, double-double 
calculations are used. Use of fused multiply-add operations, which is 
quite common recently, can further improve performance of these 
functions. Some of the implementation techniques used in the library are
explained in [3].

[1] https://github.com/shibatch/sleef
[2] https://github.com/hfinkel/sleef-bgq/blob/master/simd/qpxmath.h
[3] http://ito-lab.naist.jp/~n-sibata/pdfs/isc10simd.pdf

********

Regards,

Naoki Shibata