[llvm-dev] RFC: SIMD math-function library

Thu Jul 14 20:53:54 PDT 2016

Hi again,

As this RFC implies, I've been using the SLEEF library proposed here with Clang/LLVM for many years, and fully support its adoption into the LLVM project.

I'm CC'ing Matt and Xinmin from Intel who have started working on contributing support for their SVML library to LLVM (http://reviews.llvm.org/D19544), and I understand plan to contribute (some subset of) the vector math functions themselves. I'm also excited about Intel's planned contributions.

Here's how I currently see the situation: Regardless of what Intel contributes, we need a solution in this space for many different architectures. From personal experience, SLEEF is relatively easy to port to different architectures (i.e. different vector ISAs), and has already been ported to several. The performance is good as is the accuracy. I think it would make a great foundation for a vector-math-function runtime library for the LLVM project. I don't know what routines Intel is planning to contribute, or for what architectures they're tuned, but I expect we'll want to use those implementations on x86 platforms where appropriate.

Matt, Xinmin, what do you think?

Thanks again,
Hal

----- Original Message -----
> From: "Naoki Shibata" <shibatch.sf.net at gmail.com>
> To: llvm-dev at lists.llvm.org
> Cc: "Hal Finkel" <hfinkel at anl.gov>
> Sent: Wednesday, July 13, 2016 6:45:38 AM
> Subject: RFC: SIMD math-function library
> 
> 
> Dear LLVM contributors,
> 
> I am Naoki Shibata, an associate professor at Nara Institute of
> Science
> and Technology.
> 
> I and Hal Finkel would like to jointly propose to add my vectorized
> math
> library to LLVM.
> 
> The library has been available as public domain software for years, I
> am
> going to double-license the library if necessary.
> 
> ********
> 
> Below is a proposal to add my vectorized math library, SLEEF [1], for
> evaluating elementary functions (trigonometry, log, exp, etc.) to
> LLVM.
> The library can be used directly, or can be targeted by an
> autovectorization infrastructure. Patches to tie SLEEF into LLVM's
> autovectorizer have been developed by Hal Finkel as part of the
> bgclang
> project (which provides LLVM/Clang ported to the IBM BG/Q
> supercomputer
> architecture). Hal has also developed a user-facing header for the
> library, in the style of Clang's intrinsics headers, which we can use
> as
> part of this project. SLEEF has been used as part of bgclang in this
> way
> for several years.
> 
> The library currently supports several architectures:
>   * x86 - SSE2, FMA4, AVX, AVX2+FMA3
>   * ARM - NEON (single-precision only)
>   * A pure C (scalar) version
>   * Hal's version supports PowerPC/QPX.
> 
> It is faily easy to port to other architectures. The library provides
> similar functionality to Intel's Short Vector Math Library (available
> with Intel's Compiler).
> 
> Roadmap:
> --------
> 1) Get agreement on incorporating the library.
> 2) Renaming the public interface to use only the
>     implementation-reserved namespace (i.e. names starting with
>     underscores), as is appropriate for a compiler runtime library.
> 3) Convert the functions to use LLVM's naming conventions (including,
> if
>     desired, converting the source files to C++ allowing the use of
>     function
>     overloading).
> 4) Create and document a public interface to the library.
> 5) Add support for targeting the library to LLVM's autovectorizer.
> 6) Work with the community to port the library to other
> architectures.
> 
> Motivation:
> 
> Recent CPUs and GPUs have vectorized FP multipliers and adders for
> improving throughput of FP computation. In order to extract the
> maximum
> computation power from processors with vectorized ALUs, the software
> has
> to be vectorized to use SIMD data structures. It is also preferred
> that
> conditional branches and scatter/gather memory access are eliminated
> as
> much as possible. However, rewriting existing software in this
> fashion
> is a very hard and time consuming task that involves converting data
> structures. Thus, realization of efficient libraries and automatic
> vectorization is desired.
> 
> In this proposal, we are going to incorporate a vectorized math
> library,
> currently named SLEEF, into LLVM runtime library. By doing this,
> elementary functions can be directly evaluated using SIMD data types.
> We
> can also expect extra performance improvements by allowing LLVM to
> automatically target the functions (and inline them with LTO).
> 
> Functionality of the library:
> 
> For each elementary function, the library contains subroutines for
> evaluation in single precision and double precision. Different
> accuracy
> of the results can be chosen for a subset of the elementary
> functions;
> for this subset there are versions with up to 1 ulp error and
> versions
> with a few ulp error. Obviously, less accurate versions are faster.
> Please note that we have 0.5 ulp maximum error when we convert a real
> number into a floating point number. In Hal's bgclang port, the less
> accurate versions are used with -ffast-math, and the more-accurate
> ones
> otherwise.
> 
> For non-finite inputs and outputs, the library should return the same
> results as libm. The library is tested if the evaluation error is
> within
> the designed limit. The library is tested against high-precision
> evaluation using the libmpfr library. Especially, we rigorously
> checked
> the error of the trigonometric functions when the arguments are close
> to
> an integral multiple of PI/2.
> 
> The size of the functions is very small.
> 
> Implementation of the library:
> 
> Basically, each function consists of reduction and kernel. For the
> kernel, a polynomial approximation is used. The coefficients are
> carefully set to minimize the number of multiplications and additions
> while reducing the error. The reduction is devised so that the same
> kernel can be used for all range of the input arguments. In order to
> improve the accuracy in the functions with 1-ulp error, double-double
> calculations are used. Use of fused multiply-add operations, which is
> quite common recently, can further improve performance of these
> functions. Some of the implementation techniques used in the library
> are
> explained in [3].
> 
> [1] https://github.com/shibatch/sleef
> [2] https://github.com/hfinkel/sleef-bgq/blob/master/simd/qpxmath.h
> [3] http://ito-lab.naist.jp/~n-sibata/pdfs/isc10simd.pdf
> 
> 
> ********
> 
> Regards,
> 
> Naoki Shibata
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory