[PATCH] D19544: Pass for translating math intrinsics to math library calls.

Tue Apr 26 10:56:38 PDT 2016

mmasten created this revision.
mmasten added reviewers: hfinkel, mzolotukhin, spatel.
mmasten added a subscriber: llvm-commits.
Herald added a subscriber: joker.eph.

This is the first patch related to translating math intrinsics to math library calls. This set of changes relates specifically to translating vector math intrinsics to svml function calls and laying a foundation so that intrinsics can be translated to any library of choice regardless of target. The changes here correspond to the following RFC.

RFC: A proposal for vectorizing calls to math functions using the Intel short vector math library (SVML)

=========
Overview
=========

Very simply, SVML (short vector math library) functions are vector variants of scalar math functions that take vector arguments,
apply an operation to each element, and store the result in a vector register. These vector variants can be
generated by the compiler, based on precision requirements specified by the user, resulting in substantial performance gains.
This is an initial proposal to add a new LLVM IR transformation pass to translate scalar math calls to svml calls.

====================
Problem Description
====================

Currently, without the "#pragma clang loop vectorize(enable)", the loop vectorizer will not vectorize loops with math
calls due to cost model reasons. Additionally, When the loop pragma is used, the loop vectorizer will widen the math call
using an intrinsic, but the resulting code is inefficient because the intrinsic is replaced with scalarized function
calls. Please see the example below for a simple loop containing a sinf call. For demonstration purposes, the example was
compiled for an xmm target, thus VF = 4 given the float type.

Example sinf.c

#define N 1000

#pragma clang loop vectorize(enable)
for (i = 0; i < N; i++) {
  array[i] = sinf((float)i);
}

Without the loop pragma the loop vectorizer's cost model rejects the loop.

clang -c -ffast-math -O2 -Rpass-analysis=loop-vectorize -Rpass-missed=loop-vectorize sinf.c

sinf.c:19:3: remark: the cost-model indicates that vectorization is not beneficial [-Rpass-analysis=loop-vectorize]
  for (i = 0; i < N; i++) {
  ^
sinf.c:19:3: remark: the cost-model indicates that interleaving is not beneficial and is explicitly disabled or interleave
count is set to 1 [-Rpass-analysis=loop-vectorize]

When the the loop pragma is used, the loop is vectorized (i.e., @llvm.sin.v4f32 is generated), but the call is later scalarized with the
additional overhead of unpacking the scalar function arguments from a vector. This can be seen from inspection of the
resulting assembly code just below the LLVM IR.

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ], !dbg !6
  %0 = trunc i64 %index to i32, !dbg !7
  %broadcast.splatinsert6 = insertelement <4 x i32> undef, i32 %0, i32 0, !dbg !7
  %broadcast.splat7 = shufflevector <4 x i32> %broadcast.splatinsert6, <4 x i32> undef, <4 x i32> zeroinitializer, !dbg !7
  %induction8 = add <4 x i32> %broadcast.splat7, <i32 0, i32 1, i32 2, i32 3>, !dbg !7
  %1 = sitofp <4 x i32> %induction8 to <4 x float>, !dbg !7
  %2 = call <4 x float> @llvm.sin.v4f32(<4 x float> %1), !dbg !8
  %3 = getelementptr inbounds float, float* %array, i64 %index, !dbg !9
  %4 = bitcast float* %3 to <4 x float>*, !dbg !10
  store <4 x float> %2, <4 x float>* %4, align 4, !dbg !10, !tbaa !11
  %index.next = add i64 %index, 4, !dbg !6
  %5 = icmp eq i64 %index.next, 1000, !dbg !6
  br i1 %5, label %middle.block, label %vector.body, !dbg !6, !llvm.loop !15

.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        movd    %ebx, %xmm0
        pshufd  $0, %xmm0, %xmm0        # xmm0 = xmm0[0,0,0,0]
        paddd   .LCPI0_0(%rip), %xmm0
        cvtdq2ps        %xmm0, %xmm0
        movaps  %xmm0, 16(%rsp)         # 16-byte Spill
        shufps  $231, %xmm0, %xmm0      # xmm0 = xmm0[3,1,2,3]
        callq   sinf
        movaps  %xmm0, (%rsp)           # 16-byte Spill
        movaps  16(%rsp), %xmm0         # 16-byte Reload
        shufps  $229, %xmm0, %xmm0      # xmm0 = xmm0[1,1,2,3]
        callq   sinf
        unpcklps        (%rsp), %xmm0   # 16-byte Folded Reload
                                        # xmm0 = xmm0[0],mem[0],xmm0[1],mem[1]
        movaps  %xmm0, (%rsp)           # 16-byte Spill
        movaps  16(%rsp), %xmm0         # 16-byte Reload
        callq   sinf
        movaps  %xmm0, 32(%rsp)         # 16-byte Spill
        movapd  16(%rsp), %xmm0         # 16-byte Reload
        shufpd  $1, %xmm0, %xmm0        # xmm0 = xmm0[1,0]
        callq   sinf
        movaps  32(%rsp), %xmm1         # 16-byte Reload
        unpcklps        %xmm0, %xmm1    # xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
        unpcklps        (%rsp), %xmm1   # 16-byte Folded Reload
                                        # xmm1 = xmm1[0],mem[0],xmm1[1],mem[1]
        movups  %xmm1, (%r14,%rbx,4)
        addq    $4, %rbx
        cmpq    $1000, %rbx             # imm = 0x3E8
        jne     .LBB0_1

===========================
Proposed New Functionality
===========================

In order to take advantage of the performance benefits of the svml library, the proposed solution is to introduce a new
LLVM IR pass that is capable of translating the vector math intrinsics to svml calls. As an example, the LLVM IR above for
"vector.body", introduced in the Problem Description section, would serve as input to the proposed pass and be transformed
into the following LLVM IR. Special attention should be paid to the "__svml_sinf4_ha" call in the LLVM IR and resulting
assembly code snippet.

vector.body:                                      ; preds = %vector.body, %entry
  %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ], !dbg !6
  %0 = trunc i64 %index to i32, !dbg !7
  %broadcast.splatinsert6 = insertelement <4 x i32> undef, i32 %0, i32 0, !dbg !7
  %broadcast.splat7 = shufflevector <4 x i32> %broadcast.splatinsert6, <4 x i32> undef, <4 x i32> zeroinitializer, !dbg !7
  %induction8 = add <4 x i32> %broadcast.splat7, <i32 0, i32 1, i32 2, i32 3>, !dbg !7
  %1 = sitofp <4 x i32> %induction8 to <4 x float>, !dbg !7
  %vcall = call <4 x float> @__svml_sinf4_ha(<4 x float> %1)
  %2 = getelementptr inbounds float, float* %array, i64 %index, !dbg !8
  %3 = bitcast float* %2 to <4 x float>*, !dbg !9
  store <4 x float> %vcall, <4 x float>* %3, align 4, !dbg !9, !tbaa !10
  %index.next = add i64 %index, 4, !dbg !6
  %4 = icmp eq i64 %index.next, 1000, !dbg !6
  br i1 %4, label %for.end, label %vector.body, !dbg !6, !llvm.loop !14

The resulting assembly would appear as:

.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        movd    %ebx, %xmm0
        pshufd  $0, %xmm0, %xmm0        # xmm0 = xmm0[0,0,0,0]
        paddd   .LCPI0_0(%rip), %xmm0
        cvtdq2ps        %xmm0, %xmm0
        callq   __svml_sinf4_ha
        movups  %xmm0, (%r14,%rbx,4)
        addq    $4, %rbx
        cmpq    $1000, %rbx             # imm = 0x3E8
        jne     .LBB0_1

In order to perform the translation, several requirements must be met to guide code generation. Those include:

1) In addition to the -ffast-math flag, support is needed from clang to allow the user to be able to specify the desired
   precision requirements. The additional flags needed include the following, where imf is shorthand for "Intel math function".

   -fimf-absolute-error=value[:funclist]
          define the maximum allowable absolute error for math library
          function results
            value    - a positive, floating-point number conforming to the
                       format [digits][.digits][{e|E}[sign]digits]
            funclist - optional comma separated list of one or more math
                       library functions to which the attribute should be
                       applied

   -fimf-accuracy-bits=bits[:funclist]
          define the relative error, measured by the number of correct bits,
          for math library function results
            bits     - a positive, floating-point number
            funclist - optional comma separated list of one or more math
                       library functions to which the attribute should be
                       applied

   -fimf-arch-consistency=value[:funclist]
          ensures that the math library functions produce consistent results
          across different implementations of the same architecture
            value    - true or false
            funclist - optional comma separated list of one or more math
                       library functions to which the attribute should be
                       applied

   -fimf-max-error=ulps[:funclist]
          defines the maximum allowable relative error, measured in ulps, for
          math library function results
            ulps     - a positive, floating-point number conforming to the
                       format [digits][.digits][{e|E}[sign]digits]
            funclist - optional comma separated list of one or more math
                       library functions to which the attribute should be
                       applied

   -fimf-precision=value[:funclist]
          defines the accuracy (precision) for math library functions
            value    - defined as one of the following values
                       high   - equivalent to max-error = 0.6
                       medium - equivalent to max-error = 4 (DEFAULT)
                       low    - equivalent to accuracy-bits = 11 (single
                                precision); accuracy-bits = 26 (double
                                precision)
            funclist - optional comma separated list of one or more math
                       library functions to which the attribute should be
                       applied

   -fimf-domain-exclusion=classlist[:funclist]
          indicates the input arguments domain on which math functions
          must provide correct results.
           classlist - defined as one of the following values
                         nans, infinities, denormals, zeros
                         all, none, common
           funclist - optional list of one or more math library
                      functions to which the attribute should be applied.

Information from the flags can then be encoded as function attributes at each call site. In the future, this
functionality will enable more fine-grained control over specifying precision for individual calls/regions,
instead of setting the precision requirements for all call instances of a function. Please note that the
example translation presented so far does not have the IMF attributes attached to the @llvm.sin.v4f32 call,
and as a result the default is set to an svml variant marked with "_ha", which is short for high accuracy.
Other supported variants will include low precision, enhanced performance, bitwise reproducible, and correctly
rounded. Please refer to the IEEE-754 standard for additional information regarding supported precisions. The
compiler will select the most appropriate variant based on the IMF attributes. See #2.

2) An interface to query for the appropriate svml function variant based on the scalar function name and IMF
   attributes.

3) For calls to math functions that store to memory (e.g., sincos), additional analysis of the pointer
   arguments is beneficial in order to generate the best performing store instructions.

======================
GCC/ICC compatibility
======================

The initial implementation will involve the translation of 6 svml functions, which include sin, cos, log, pow, exp, and
sincos (both single and double precision variants). Support for these functions matches the current capabilities of GCC and a
subset of ICC. As more functions become open-sourced, the plan is to support them as part of the final solution determined
from this proposal. The flags referenced in the Proposed New FUnctionality section are required to maintain icc compatibility.

=======================
Current Implementation
=======================

To evaluate the feasibility of this proposal, a prototype transform pass has been developed, which performs
the following:

1) Searches for vector math intrinsics as candidates for translation to svml.
2) Reads function attributes to obtain precision requirements for the call. If none, default to attributes that will force 
   the selection of a high accuracy variant.
3) Since the vector factor of the intrinsic can be wider than what is legally supported by the target, type legalization is
   performed so that the correct svml variant is selected. For example, if a call to @llvm.sin.v8f32(<8 x float> %1) is made
   for an xmm target, the pass will generate two __svml_sinf4 calls and will do the appropriate splitting of the vectors required
   as arguments to each call. The pass is also capable of handling less than full vector cases. E.g., @llvm.sin.v2f32.
4) Special handling for sincos since the results are stored to a double wide vector and additional analysis is needed to optimize
   the stores to memory.
5) Vector intrinsics that are not translated to svml are scalarized.
6) The loop vectorizer has been taught to allow widening of sincos and additional utilities have been written to analyze arguments
   for sincos.

=========
Feedback
=========

I would appreciate those who are interested in this topic to review this proposal and provide feedback on the proposed
approach. Help is also welcome and much appreciated in the development process.

http://reviews.llvm.org/D19544

Files:
  include/llvm/Analysis/TargetLibraryInfo.def
  include/llvm/IR/Intrinsics.td
  include/llvm/InitializePasses.h
  include/llvm/LinkAllPasses.h
  include/llvm/Transforms/IntrinToMathLib/ImlAccuracyInterface.h
  include/llvm/Transforms/IntrinToMathLib/ImlAttrPrivate.h
  include/llvm/Transforms/IntrinToMathLib/ImlExp2f.h
  include/llvm/Transforms/IntrinToMathLib/ImlExpfTable.h
  include/llvm/Transforms/IntrinToMathLib/IntrinToMathLib.h
  include/llvm/Transforms/IntrinToMathLib/Messaging.h
  include/llvm/Transforms/IntrinToMathLib/Search.h
  lib/Transforms/CMakeLists.txt
  lib/Transforms/IPO/LLVMBuild.txt
  lib/Transforms/IPO/PassManagerBuilder.cpp
  lib/Transforms/IntrinToMathLib/CMakeLists.txt
  lib/Transforms/IntrinToMathLib/ImlAccuracyInterface.cpp
  lib/Transforms/IntrinToMathLib/ImlTableInline.inc
  lib/Transforms/IntrinToMathLib/ImlTableSvmlIA32.inc
  lib/Transforms/IntrinToMathLib/IntrinToMathLib.cpp
  lib/Transforms/IntrinToMathLib/LLVMBuild.txt
  lib/Transforms/IntrinToMathLib/Search.cpp
  lib/Transforms/LLVMBuild.txt
  test/Transforms/IntrinToMathLib/cosf.ll
  test/Transforms/IntrinToMathLib/expf.ll
  test/Transforms/IntrinToMathLib/logf.ll
  test/Transforms/IntrinToMathLib/powf.ll
  test/Transforms/IntrinToMathLib/sinf.ll
  tools/bugpoint/CMakeLists.txt
  tools/bugpoint/bugpoint.cpp
  tools/opt/CMakeLists.txt
  tools/opt/opt.cpp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D19544.55036.patch
Type: text/x-patch
Size: 1178465 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160426/28166e36/attachment-0001.bin>