[llvm-dev] RFC: A proposal for vectorizing loops with calls to math functions using SVML

Mon Apr 4 10:57:19 PDT 2016

Hi Matt -

Are you using the same TLI hook as Darwin's Accelerate framework:
addVectorizableFunctionsFromVecLib()? If not, why not?

On Thu, Mar 31, 2016 at 6:20 PM, Masten, Matt via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> RFC: A proposal for vectorizing loops with calls to math functions using
> SVML (short
> vector math library).
>
> =========
> Overview
> =========
>
> Very simply, SVML (Intel short vector math library) functions are vector
> variants of
> scalar math functions that take vector arguments, apply an operation to
> each
> element, and store the result in a vector register. These vector variants
> can be
> generated by the compiler, based on precision requirements specified by the
> user, resulting in substantial performance gains. This is an initial
> proposal to
> add a new LLVM IR transformation pass that will translate scalar math
> calls to
> svml calls with the help of the loop vectorizer.
>
> ====================
> Problem Description
> ====================
>
> Currently, without the "#pragma clang loop vectorize(enable)", the loop
> vectorizer will not vectorize loops with math calls due to cost model
> reasons.
> Additionally, When the loop pragma is used, the loop vectorizer will widen
> the
> math call using an intrinsic, but the resulting code is inefficient
> because the
> intrinsic is replaced with scalarized function calls. Please see the
> example
> below for a simple loop containing a sinf call. For demonstration
> purposes, the
> example was compiled for an xmm target, thus VF = 4 given the float type.
>
> Example: sinf.c
>
> #define N 1000
>
> #pragma clang loop vectorize(enable)
> for (i = 0; i < N; i++) {
>   array[i] = sinf((float)i);
> }
>
> Without the loop pragma the loop vectorizer's cost model rejects the loop.
>
> clang -c -ffast-math -O2 -Rpass-analysis=loop-vectorize
> -Rpass-missed=loop-vectorize sinf.c
>
> sinf.c:19:3: remark: the cost-model indicates that vectorization is not
> beneficial [-Rpass-analysis=loop-vectorize]
>   for (i = 0; i < N; i++) {
>   ^
> sinf.c:19:3: remark: the cost-model indicates that interleaving is not
> beneficial and is explicitly disabled or interleave count is set to 1
> [-Rpass-analysis=loop-vectorize]
>
> When the the loop pragma is used, the loop is vectorized and the call to
> @llvm.sin.v4f32 is generated, but the call is later scalarized with the
> additional overhead of unpacking the scalar function arguments from a
> vector.
> This can be seen from inspection of the resulting assembly code just below
> the
> LLVM IR.
>
> vector.body:                                ; preds = %vector.body, %
> vector.ph
>   %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ], !dbg
> !6
>   %0 = trunc i64 %index to i32, !dbg !7
>   %broadcast.splatinsert6 = insertelement <4 x i32> undef, i32 %0, i32 0,
>     !dbg !7
>   %broadcast.splat7 = shufflevector <4 x i32> %broadcast.splatinsert6,
>     <4 x i32> undef, <4 x i32> zeroinitializer, !dbg !7
>   %induction8 = add <4 x i32> %broadcast.splat7, <i32 0, i32 1, i32 2, i32
> 3>,
>     !dbg !7
>   %1 = sitofp <4 x i32> %induction8 to <4 x float>, !dbg !7
>   %2 = call <4 x float> @llvm.sin.v4f32(<4 x float> %1), !dbg !8
>   %3 = getelementptr inbounds float, float* %array, i64 %index, !dbg !9
>   %4 = bitcast float* %3 to <4 x float>*, !dbg !10
>   store <4 x float> %2, <4 x float>* %4, align 4, !dbg !10, !tbaa !11
>   %index.next = add i64 %index, 4, !dbg !6
>   %5 = icmp eq i64 %index.next, 1000, !dbg !6
>   br i1 %5, label %middle.block, label %vector.body, !dbg !6, !llvm.loop
> !15
>
>
> .LBB0_1:                                # %vector.body
>                                         # =>This Inner Loop Header: Depth=1
>         movd    %ebx, %xmm0
>         pshufd  $0, %xmm0, %xmm0        # xmm0 = xmm0[0,0,0,0]
>         paddd   .LCPI0_0(%rip), %xmm0
>         cvtdq2ps        %xmm0, %xmm0
>         movaps  %xmm0, 16(%rsp)         # 16-byte Spill
>         shufps  $231, %xmm0, %xmm0      # xmm0 = xmm0[3,1,2,3]
>         callq   sinf
>         movaps  %xmm0, (%rsp)           # 16-byte Spill
>         movaps  16(%rsp), %xmm0         # 16-byte Reload
>         shufps  $229, %xmm0, %xmm0      # xmm0 = xmm0[1,1,2,3]
>         callq   sinf
>         unpcklps        (%rsp), %xmm0   # 16-byte Folded Reload
>                                         # xmm0 =
> xmm0[0],mem[0],xmm0[1],mem[1]
>         movaps  %xmm0, (%rsp)           # 16-byte Spill
>         movaps  16(%rsp), %xmm0         # 16-byte Reload
>         callq   sinf
>         movaps  %xmm0, 32(%rsp)         # 16-byte Spill
>         movapd  16(%rsp), %xmm0         # 16-byte Reload
>         shufpd  $1, %xmm0, %xmm0        # xmm0 = xmm0[1,0]
>         callq   sinf
>         movaps  32(%rsp), %xmm1         # 16-byte Reload
>         unpcklps        %xmm0, %xmm1    # xmm1 =
> xmm1[0],xmm0[0],xmm1[1],xmm0[1]
>         unpcklps        (%rsp), %xmm1   # 16-byte Folded Reload
>                                         # xmm1 =
> xmm1[0],mem[0],xmm1[1],mem[1]
>         movups  %xmm1, (%r14,%rbx,4)
>         addq    $4, %rbx
>         cmpq    $1000, %rbx             # imm = 0x3E8
>         jne     .LBB0_1
>
> ===========================
> Proposed New Functionality
> ===========================
>
> In order to take advantage of the performance benefits of the svml
> library, the
> proposed solution is to introduce a new LLVM IR pass that is capable of
> translating the vector math intrinsics to svml calls. As an example, the
> LLVM IR
> above for "vector.body", introduced in the Problem Description section,
> would
> serve as input to the proposed pass and be transformed into the following
> LLVM
> IR. Special attention should be paid to the "__svml_sinf4_ha" call in the
> LLVM
> IR and resulting assembly code snippet.
>
> vector.body:                                   ; preds = %vector.body,
> %entry
>   %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ], !dbg !6
>   %0 = trunc i64 %index to i32, !dbg !7
>   %broadcast.splatinsert6 = insertelement <4 x i32> undef, i32 %0, i32 0,
>     !dbg !7
>   %broadcast.splat7 = shufflevector <4 x i32> %broadcast.splatinsert6,
>     <4 x i32> undef, <4 x i32> zeroinitializer, !dbg !7
>   %induction8 = add <4 x i32> %broadcast.splat7, <i32 0, i32 1, i32 2, i32
> 3>,
>     !dbg !7
>   %1 = sitofp <4 x i32> %induction8 to <4 x float>, !dbg !7
>   %vcall = call <4 x float> @__svml_sinf4_ha(<4 x float> %1)
>   %2 = getelementptr inbounds float, float* %array, i64 %index, !dbg !8
>   %3 = bitcast float* %2 to <4 x float>*, !dbg !9
>   store <4 x float> %vcall, <4 x float>* %3, align 4, !dbg !9, !tbaa !10
>   %index.next = add i64 %index, 4, !dbg !6
>   %4 = icmp eq i64 %index.next, 1000, !dbg !6
>   br i1 %4, label %for.end, label %vector.body, !dbg !6, !llvm.loop !14
>
> The resulting assembly would appear as:
>
> .LBB0_1:                                # %vector.body
>                                         # =>This Inner Loop Header: Depth=1
>         movd    %ebx, %xmm0
>         pshufd  $0, %xmm0, %xmm0        # xmm0 = xmm0[0,0,0,0]
>         paddd   .LCPI0_0(%rip), %xmm0
>         cvtdq2ps        %xmm0, %xmm0
>         callq   __svml_sinf4_ha
>         movups  %xmm0, (%r14,%rbx,4)
>         addq    $4, %rbx
>         cmpq    $1000, %rbx             # imm = 0x3E8
>         jne     .LBB0_1
>
> In order to perform the translation, several requirements must be met to
> guide
> code generation. Those include:
>
> 1) In addition to the -ffast-math flag, support is needed from clang to
> allow
>    the user to be able to specify the desired precision requirements. The
>    additional flags needed include the following, where "imf" is shorthand
> for
>    "Intel math function".
>
>    -fimf-absolute-error=value[:funclist]
>           define the maximum allowable absolute error for math library
>           function results
>             value    - a positive, floating-point number conforming to the
>                        format [digits][.digits][{e|E}[sign]digits]
>             funclist - optional comma separated list of one or more math
>                        library functions to which the attribute should be
>                        applied
>
>    -fimf-accuracy-bits=bits[:funclist]
>           define the relative error, measured by the number of correct
> bits,
>           for math library function results
>             bits     - a positive, floating-point number
>             funclist - optional comma separated list of one or more math
>                        library functions to which the attribute should be
>                        applied
>
>    -fimf-arch-consistency=value[:funclist]
>           ensures that the math library functions produce consistent
> results
>           across different implementations of the same architecture
>             value    - true or false
>             funclist - optional comma separated list of one or more math
>                        library functions to which the attribute should be
>                        applied
>
>    -fimf-max-error=ulps[:funclist]
>           defines the maximum allowable relative error, measured in ulps,
> for
>           math library function results
>             ulps     - a positive, floating-point number conforming to the
>                        format [digits][.digits][{e|E}[sign]digits]
>             funclist - optional comma separated list of one or more math
>                        library functions to which the attribute should be
>                        applied
>
>    -fimf-precision=value[:funclist]
>           defines the accuracy (precision) for math library functions
>             value    - defined as one of the following values
>                        high   - equivalent to max-error = 0.6
>                        medium - equivalent to max-error = 4
>                        low    - equivalent to accuracy-bits = 11 (single
>                                 precision); accuracy-bits = 26 (double
>                                 precision)
>             funclist - optional comma separated list of one or more math
>                        library functions to which the attribute should be
>                        applied
>
>    -fimf-domain-exclusion=classlist[:funclist]
>           indicates the input arguments domain on which math functions
>           must provide correct results.
>            classlist - defined as one of the following values
>                          nans, infinities, denormals, zeros
>                          all, none, common
>            funclist - optional list of one or more math library
>                       functions to which the attribute should be applied.
>
> Information from the flags can then be encoded as function attributes at
> each
> call site. In the future, this functionality will enable more fine-grained
> control over specifying precision for individual calls/regions, instead of
> setting the precision requirements for all call instances of a function.
> Please
> note that the example translation presented so far does not have the IMF
> attributes attached to the @llvm.sin.v4f32 call, and as a result the
> default is
> set to an svml variant marked with "_ha" (max-error = 0.6), which is short
> for
> high accuracy. Other supported variants will include low precision,
> enhanced
> performance, bitwise reproducible, and correctly rounded. Please refer to
> the
> IEEE-754 standard for additional information regarding supported
> precisions.
> The compiler will select the most appropriate variant based on the IMF
> attributes. See #2.
>
> 2) An interface to query for the appropriate svml function variant based
> on the
>    scalar function name and IMF attributes.
>
> 3) For calls to math functions that store to memory (e.g., sincos),
> additional
>    analysis of the pointer arguments is beneficial in order to generate
> the best
>    performing load/store instructions.
>
> ======================
> GCC/ICC compatibility
> ======================
>
> The initial implementation will involve the translation of 6 svml
> functions,
> which include sin, cos, log, pow, exp, and sincos (both single and double
> precision variants). Support for these functions matches the current
> capabilities of GCC and a subset of ICC. As more functions become
> open-sourced,
> the plan is to support them as part of the final solution determined from
> this
> proposal. The flags referenced in the Proposed New Functionality section
> are
> required to maintain icc compatibility.
>
> =======================
> Current Implementation
> =======================
>
> To evaluate the feasibility of this proposal, a prototype transform pass
> has
> been developed, which performs the following:
>
> 1) Searches for vector math intrinsics as candidates for translation to
> svml.
>
> 2) Reads function attributes to obtain precision requirements for the
> call. If
>    none, default to attributes that will force the selection of a high
> accuracy
>    variant.
>
> 3) Since the vector factor of the intrinsic can be wider than what is
> legally
>    supported by the target, type legalization is performed so that the
> correct
>    svml variant is selected. For example, if a call to
>    @llvm.sin.v8f32(<8 x float> %1) is made for an xmm target, the pass will
>    generate two __svml_sinf4 calls and will do the appropriate splitting
> of %1
>    to create the new arguments for each call. In addition, the multiple
> return
>    vectors are recombined and users of the original return vector are
> updated.
>    The pass is also capable of handling less than full vector cases. E.g.,
>    @llvm.sin.v2f32.
>
> 4) Special handling for sincos since the results are stored to a double
> wide
>    vector and additional analysis is needed to optimize the stores to
> memory.
>    Additional shuffling is required to obtain the sin and cos results from
>    the double wide vector.
>
> 5) Vector intrinsics that are not translated to svml are scalarized.
>
> 6) The loop vectorizer has been taught to allow widening of sincos and
>    additional utilities have been written to analyze arguments for sincos.
>
> =========
> Feedback
> =========
>
> For those who are interested in this topic, I would like to ask for your
> review
> of this proposal and would definitely appreciate any/all feedback on the
> proposed approach. Help is also very welcome and much appreciated in the
> development process.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160404/e2773916/attachment-0001.html>