<div dir="ltr"><div>Hi Matt -<br><br></div>Are you using the same TLI hook as Darwin's Accelerate framework: addVectorizableFunctionsFromVecLib()? If not, why not?<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Mar 31, 2016 at 6:20 PM, Masten, Matt via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">RFC: A proposal for vectorizing loops with calls to math functions using SVML (short<br>

vector math library).<br>

<br>

=========<br>

Overview<br>

=========<br>

<br>

Very simply, SVML (Intel short vector math library) functions are vector variants of<br>

scalar math functions that take vector arguments, apply an operation to each<br>

element, and store the result in a vector register. These vector variants can be<br>

generated by the compiler, based on precision requirements specified by the<br>

user, resulting in substantial performance gains. This is an initial proposal to<br>

add a new LLVM IR transformation pass that will translate scalar math calls to<br>

svml calls with the help of the loop vectorizer.<br>

<br>

====================<br>

Problem Description<br>

====================<br>

<br>

Currently, without the "#pragma clang loop vectorize(enable)", the loop<br>

vectorizer will not vectorize loops with math calls due to cost model reasons.<br>

Additionally, When the loop pragma is used, the loop vectorizer will widen the<br>

math call using an intrinsic, but the resulting code is inefficient because the<br>

intrinsic is replaced with scalarized function calls. Please see the example<br>

below for a simple loop containing a sinf call. For demonstration purposes, the<br>

example was compiled for an xmm target, thus VF = 4 given the float type.<br>

<br>

Example: sinf.c<br>

<br>

#define N 1000<br>

<br>

#pragma clang loop vectorize(enable)<br>

for (i = 0; i < N; i++) {<br>

  array[i] = sinf((float)i);<br>

}<br>

<br>

Without the loop pragma the loop vectorizer's cost model rejects the loop.<br>

<br>

clang -c -ffast-math -O2 -Rpass-analysis=loop-vectorize<br>

-Rpass-missed=loop-vectorize sinf.c<br>

<br>

sinf.c:19:3: remark: the cost-model indicates that vectorization is not<br>

beneficial [-Rpass-analysis=loop-vectorize]<br>

  for (i = 0; i < N; i++) {<br>

  ^<br>

sinf.c:19:3: remark: the cost-model indicates that interleaving is not<br>

beneficial and is explicitly disabled or interleave count is set to 1<br>

[-Rpass-analysis=loop-vectorize]<br>

<br>

When the the loop pragma is used, the loop is vectorized and the call to<br>

@llvm.sin.v4f32 is generated, but the call is later scalarized with the<br>

additional overhead of unpacking the scalar function arguments from a vector.<br>

This can be seen from inspection of the resulting assembly code just below the<br>

LLVM IR.<br>

<br>

vector.body:                                ; preds = %vector.body, %<a href="http://vector.ph" rel="noreferrer" target="_blank">vector.ph</a><br>

  %index = phi i64 [ 0, %<a href="http://vector.ph" rel="noreferrer" target="_blank">vector.ph</a> ], [ %index.next, %vector.body ], !dbg !6<br>

  %0 = trunc i64 %index to i32, !dbg !7<br>

  %broadcast.splatinsert6 = insertelement <4 x i32> undef, i32 %0, i32 0,<br>

    !dbg !7<br>

  %broadcast.splat7 = shufflevector <4 x i32> %broadcast.splatinsert6,<br>

    <4 x i32> undef, <4 x i32> zeroinitializer, !dbg !7<br>

  %induction8 = add <4 x i32> %broadcast.splat7, <i32 0, i32 1, i32 2, i32 3>,<br>

    !dbg !7<br>

  %1 = sitofp <4 x i32> %induction8 to <4 x float>, !dbg !7<br>

  %2 = call <4 x float> @llvm.sin.v4f32(<4 x float> %1), !dbg !8<br>

  %3 = getelementptr inbounds float, float* %array, i64 %index, !dbg !9<br>

  %4 = bitcast float* %3 to <4 x float>*, !dbg !10<br>

  store <4 x float> %2, <4 x float>* %4, align 4, !dbg !10, !tbaa !11<br>

  %index.next = add i64 %index, 4, !dbg !6<br>

  %5 = icmp eq i64 %index.next, 1000, !dbg !6<br>

  br i1 %5, label %middle.block, label %vector.body, !dbg !6, !llvm.loop !15<br>

<br>

<br>

.LBB0_1:                                # %vector.body<br>

                                        # =>This Inner Loop Header: Depth=1<br>

        movd    %ebx, %xmm0<br>

        pshufd  $0, %xmm0, %xmm0        # xmm0 = xmm0[0,0,0,0]<br>

        paddd   .LCPI0_0(%rip), %xmm0<br>

        cvtdq2ps        %xmm0, %xmm0<br>

        movaps  %xmm0, 16(%rsp)         # 16-byte Spill<br>

        shufps  $231, %xmm0, %xmm0      # xmm0 = xmm0[3,1,2,3]<br>

        callq   sinf<br>

        movaps  %xmm0, (%rsp)           # 16-byte Spill<br>

        movaps  16(%rsp), %xmm0         # 16-byte Reload<br>

        shufps  $229, %xmm0, %xmm0      # xmm0 = xmm0[1,1,2,3]<br>

        callq   sinf<br>

        unpcklps        (%rsp), %xmm0   # 16-byte Folded Reload<br>

                                        # xmm0 = xmm0[0],mem[0],xmm0[1],mem[1]<br>

        movaps  %xmm0, (%rsp)           # 16-byte Spill<br>

        movaps  16(%rsp), %xmm0         # 16-byte Reload<br>

        callq   sinf<br>

        movaps  %xmm0, 32(%rsp)         # 16-byte Spill<br>

        movapd  16(%rsp), %xmm0         # 16-byte Reload<br>

        shufpd  $1, %xmm0, %xmm0        # xmm0 = xmm0[1,0]<br>

        callq   sinf<br>

        movaps  32(%rsp), %xmm1         # 16-byte Reload<br>

        unpcklps        %xmm0, %xmm1    # xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]<br>

        unpcklps        (%rsp), %xmm1   # 16-byte Folded Reload<br>

                                        # xmm1 = xmm1[0],mem[0],xmm1[1],mem[1]<br>

        movups  %xmm1, (%r14,%rbx,4)<br>

        addq    $4, %rbx<br>

        cmpq    $1000, %rbx             # imm = 0x3E8<br>

        jne     .LBB0_1<br>

<br>

===========================<br>

Proposed New Functionality<br>

===========================<br>

<br>

In order to take advantage of the performance benefits of the svml library, the<br>

proposed solution is to introduce a new LLVM IR pass that is capable of<br>

translating the vector math intrinsics to svml calls. As an example, the LLVM IR<br>

above for "vector.body", introduced in the Problem Description section, would<br>

serve as input to the proposed pass and be transformed into the following LLVM<br>

IR. Special attention should be paid to the "__svml_sinf4_ha" call in the LLVM<br>

IR and resulting assembly code snippet.<br>

<br>

vector.body:                                   ; preds = %vector.body, %entry<br>

  %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ], !dbg !6<br>

  %0 = trunc i64 %index to i32, !dbg !7<br>

  %broadcast.splatinsert6 = insertelement <4 x i32> undef, i32 %0, i32 0,<br>

    !dbg !7<br>

  %broadcast.splat7 = shufflevector <4 x i32> %broadcast.splatinsert6,<br>

    <4 x i32> undef, <4 x i32> zeroinitializer, !dbg !7<br>

  %induction8 = add <4 x i32> %broadcast.splat7, <i32 0, i32 1, i32 2, i32 3>,<br>

    !dbg !7<br>

  %1 = sitofp <4 x i32> %induction8 to <4 x float>, !dbg !7<br>

  %vcall = call <4 x float> @__svml_sinf4_ha(<4 x float> %1)<br>

  %2 = getelementptr inbounds float, float* %array, i64 %index, !dbg !8<br>

  %3 = bitcast float* %2 to <4 x float>*, !dbg !9<br>

  store <4 x float> %vcall, <4 x float>* %3, align 4, !dbg !9, !tbaa !10<br>

  %index.next = add i64 %index, 4, !dbg !6<br>

  %4 = icmp eq i64 %index.next, 1000, !dbg !6<br>

  br i1 %4, label %for.end, label %vector.body, !dbg !6, !llvm.loop !14<br>

<br>

The resulting assembly would appear as:<br>

<br>

.LBB0_1:                                # %vector.body<br>

                                        # =>This Inner Loop Header: Depth=1<br>

        movd    %ebx, %xmm0<br>

        pshufd  $0, %xmm0, %xmm0        # xmm0 = xmm0[0,0,0,0]<br>

        paddd   .LCPI0_0(%rip), %xmm0<br>

        cvtdq2ps        %xmm0, %xmm0<br>

        callq   __svml_sinf4_ha<br>

        movups  %xmm0, (%r14,%rbx,4)<br>

        addq    $4, %rbx<br>

        cmpq    $1000, %rbx             # imm = 0x3E8<br>

        jne     .LBB0_1<br>

<br>

In order to perform the translation, several requirements must be met to guide<br>

code generation. Those include:<br>

<br>

1) In addition to the -ffast-math flag, support is needed from clang to allow<br>

   the user to be able to specify the desired precision requirements. The<br>

   additional flags needed include the following, where "imf" is shorthand for<br>

   "Intel math function".<br>

<br>

   -fimf-absolute-error=value[:funclist]<br>

          define the maximum allowable absolute error for math library<br>

          function results<br>

            value    - a positive, floating-point number conforming to the<br>

                       format [digits][.digits][{e|E}[sign]digits]<br>

            funclist - optional comma separated list of one or more math<br>

                       library functions to which the attribute should be<br>

                       applied<br>

<br>

   -fimf-accuracy-bits=bits[:funclist]<br>

          define the relative error, measured by the number of correct bits,<br>

          for math library function results<br>

            bits     - a positive, floating-point number<br>

            funclist - optional comma separated list of one or more math<br>

                       library functions to which the attribute should be<br>

                       applied<br>

<br>

   -fimf-arch-consistency=value[:funclist]<br>

          ensures that the math library functions produce consistent results<br>

          across different implementations of the same architecture<br>

            value    - true or false<br>

            funclist - optional comma separated list of one or more math<br>

                       library functions to which the attribute should be<br>

                       applied<br>

<br>

   -fimf-max-error=ulps[:funclist]<br>

          defines the maximum allowable relative error, measured in ulps, for<br>

          math library function results<br>

            ulps     - a positive, floating-point number conforming to the<br>

                       format [digits][.digits][{e|E}[sign]digits]<br>

            funclist - optional comma separated list of one or more math<br>

                       library functions to which the attribute should be<br>

                       applied<br>

<br>

   -fimf-precision=value[:funclist]<br>

          defines the accuracy (precision) for math library functions<br>

            value    - defined as one of the following values<br>

                       high   - equivalent to max-error = 0.6<br>

                       medium - equivalent to max-error = 4<br>

                       low    - equivalent to accuracy-bits = 11 (single<br>

                                precision); accuracy-bits = 26 (double<br>

                                precision)<br>

            funclist - optional comma separated list of one or more math<br>

                       library functions to which the attribute should be<br>

                       applied<br>

<br>

   -fimf-domain-exclusion=classlist[:funclist]<br>

          indicates the input arguments domain on which math functions<br>

          must provide correct results.<br>

           classlist - defined as one of the following values<br>

                         nans, infinities, denormals, zeros<br>

                         all, none, common<br>

           funclist - optional list of one or more math library<br>

                      functions to which the attribute should be applied.<br>

<br>

Information from the flags can then be encoded as function attributes at each<br>

call site. In the future, this functionality will enable more fine-grained<br>

control over specifying precision for individual calls/regions, instead of<br>

setting the precision requirements for all call instances of a function. Please<br>

note that the example translation presented so far does not have the IMF<br>

attributes attached to the @llvm.sin.v4f32 call, and as a result the default is<br>

set to an svml variant marked with "_ha" (max-error = 0.6), which is short for<br>

high accuracy. Other supported variants will include low precision, enhanced<br>

performance, bitwise reproducible, and correctly rounded. Please refer to the<br>

IEEE-754 standard for additional information regarding supported precisions.<br>

The compiler will select the most appropriate variant based on the IMF<br>

attributes. See #2.<br>

<br>

2) An interface to query for the appropriate svml function variant based on the<br>

   scalar function name and IMF attributes.<br>

<br>

3) For calls to math functions that store to memory (e.g., sincos), additional<br>

   analysis of the pointer arguments is beneficial in order to generate the best<br>

   performing load/store instructions.<br>

<br>

======================<br>

GCC/ICC compatibility<br>

======================<br>

<br>

The initial implementation will involve the translation of 6 svml functions,<br>

which include sin, cos, log, pow, exp, and sincos (both single and double<br>

precision variants). Support for these functions matches the current<br>

capabilities of GCC and a subset of ICC. As more functions become open-sourced,<br>

the plan is to support them as part of the final solution determined from this<br>

proposal. The flags referenced in the Proposed New Functionality section are<br>

required to maintain icc compatibility.<br>

<br>

=======================<br>

Current Implementation<br>

=======================<br>

<br>

To evaluate the feasibility of this proposal, a prototype transform pass has<br>

been developed, which performs the following:<br>

<br>

1) Searches for vector math intrinsics as candidates for translation to svml.<br>

<br>

2) Reads function attributes to obtain precision requirements for the call. If<br>

   none, default to attributes that will force the selection of a high accuracy<br>

   variant.<br>

<br>

3) Since the vector factor of the intrinsic can be wider than what is legally<br>

   supported by the target, type legalization is performed so that the correct<br>

   svml variant is selected. For example, if a call to<br>

   @llvm.sin.v8f32(<8 x float> %1) is made for an xmm target, the pass will<br>

   generate two __svml_sinf4 calls and will do the appropriate splitting of %1<br>

   to create the new arguments for each call. In addition, the multiple return<br>

   vectors are recombined and users of the original return vector are updated.<br>

   The pass is also capable of handling less than full vector cases. E.g.,<br>

   @llvm.sin.v2f32.<br>

<br>

4) Special handling for sincos since the results are stored to a double wide<br>

   vector and additional analysis is needed to optimize the stores to memory.<br>

   Additional shuffling is required to obtain the sin and cos results from<br>

   the double wide vector.<br>

<br>

5) Vector intrinsics that are not translated to svml are scalarized.<br>

<br>

6) The loop vectorizer has been taught to allow widening of sincos and<br>

   additional utilities have been written to analyze arguments for sincos.<br>

<br>

=========<br>

Feedback<br>

=========<br>

<br>

For those who are interested in this topic, I would like to ask for your review<br>

of this proposal and would definitely appreciate any/all feedback on the<br>

proposed approach. Help is also very welcome and much appreciated in the<br>

development process.<br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

</blockquote></div><br></div>