[llvm-dev] RFC: A proposal for vectorizing loops with calls to math functions using SVML
Sanjay Patel via llvm-dev
llvm-dev at lists.llvm.org
Mon Apr 4 10:57:19 PDT 2016
Hi Matt -
Are you using the same TLI hook as Darwin's Accelerate framework:
addVectorizableFunctionsFromVecLib()? If not, why not?
On Thu, Mar 31, 2016 at 6:20 PM, Masten, Matt via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> RFC: A proposal for vectorizing loops with calls to math functions using
> SVML (short
> vector math library).
>
> =========
> Overview
> =========
>
> Very simply, SVML (Intel short vector math library) functions are vector
> variants of
> scalar math functions that take vector arguments, apply an operation to
> each
> element, and store the result in a vector register. These vector variants
> can be
> generated by the compiler, based on precision requirements specified by the
> user, resulting in substantial performance gains. This is an initial
> proposal to
> add a new LLVM IR transformation pass that will translate scalar math
> calls to
> svml calls with the help of the loop vectorizer.
>
> ====================
> Problem Description
> ====================
>
> Currently, without the "#pragma clang loop vectorize(enable)", the loop
> vectorizer will not vectorize loops with math calls due to cost model
> reasons.
> Additionally, When the loop pragma is used, the loop vectorizer will widen
> the
> math call using an intrinsic, but the resulting code is inefficient
> because the
> intrinsic is replaced with scalarized function calls. Please see the
> example
> below for a simple loop containing a sinf call. For demonstration
> purposes, the
> example was compiled for an xmm target, thus VF = 4 given the float type.
>
> Example: sinf.c
>
> #define N 1000
>
> #pragma clang loop vectorize(enable)
> for (i = 0; i < N; i++) {
> array[i] = sinf((float)i);
> }
>
> Without the loop pragma the loop vectorizer's cost model rejects the loop.
>
> clang -c -ffast-math -O2 -Rpass-analysis=loop-vectorize
> -Rpass-missed=loop-vectorize sinf.c
>
> sinf.c:19:3: remark: the cost-model indicates that vectorization is not
> beneficial [-Rpass-analysis=loop-vectorize]
> for (i = 0; i < N; i++) {
> ^
> sinf.c:19:3: remark: the cost-model indicates that interleaving is not
> beneficial and is explicitly disabled or interleave count is set to 1
> [-Rpass-analysis=loop-vectorize]
>
> When the the loop pragma is used, the loop is vectorized and the call to
> @llvm.sin.v4f32 is generated, but the call is later scalarized with the
> additional overhead of unpacking the scalar function arguments from a
> vector.
> This can be seen from inspection of the resulting assembly code just below
> the
> LLVM IR.
>
> vector.body: ; preds = %vector.body, %
> vector.ph
> %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ], !dbg
> !6
> %0 = trunc i64 %index to i32, !dbg !7
> %broadcast.splatinsert6 = insertelement <4 x i32> undef, i32 %0, i32 0,
> !dbg !7
> %broadcast.splat7 = shufflevector <4 x i32> %broadcast.splatinsert6,
> <4 x i32> undef, <4 x i32> zeroinitializer, !dbg !7
> %induction8 = add <4 x i32> %broadcast.splat7, <i32 0, i32 1, i32 2, i32
> 3>,
> !dbg !7
> %1 = sitofp <4 x i32> %induction8 to <4 x float>, !dbg !7
> %2 = call <4 x float> @llvm.sin.v4f32(<4 x float> %1), !dbg !8
> %3 = getelementptr inbounds float, float* %array, i64 %index, !dbg !9
> %4 = bitcast float* %3 to <4 x float>*, !dbg !10
> store <4 x float> %2, <4 x float>* %4, align 4, !dbg !10, !tbaa !11
> %index.next = add i64 %index, 4, !dbg !6
> %5 = icmp eq i64 %index.next, 1000, !dbg !6
> br i1 %5, label %middle.block, label %vector.body, !dbg !6, !llvm.loop
> !15
>
>
> .LBB0_1: # %vector.body
> # =>This Inner Loop Header: Depth=1
> movd %ebx, %xmm0
> pshufd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0]
> paddd .LCPI0_0(%rip), %xmm0
> cvtdq2ps %xmm0, %xmm0
> movaps %xmm0, 16(%rsp) # 16-byte Spill
> shufps $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3]
> callq sinf
> movaps %xmm0, (%rsp) # 16-byte Spill
> movaps 16(%rsp), %xmm0 # 16-byte Reload
> shufps $229, %xmm0, %xmm0 # xmm0 = xmm0[1,1,2,3]
> callq sinf
> unpcklps (%rsp), %xmm0 # 16-byte Folded Reload
> # xmm0 =
> xmm0[0],mem[0],xmm0[1],mem[1]
> movaps %xmm0, (%rsp) # 16-byte Spill
> movaps 16(%rsp), %xmm0 # 16-byte Reload
> callq sinf
> movaps %xmm0, 32(%rsp) # 16-byte Spill
> movapd 16(%rsp), %xmm0 # 16-byte Reload
> shufpd $1, %xmm0, %xmm0 # xmm0 = xmm0[1,0]
> callq sinf
> movaps 32(%rsp), %xmm1 # 16-byte Reload
> unpcklps %xmm0, %xmm1 # xmm1 =
> xmm1[0],xmm0[0],xmm1[1],xmm0[1]
> unpcklps (%rsp), %xmm1 # 16-byte Folded Reload
> # xmm1 =
> xmm1[0],mem[0],xmm1[1],mem[1]
> movups %xmm1, (%r14,%rbx,4)
> addq $4, %rbx
> cmpq $1000, %rbx # imm = 0x3E8
> jne .LBB0_1
>
> ===========================
> Proposed New Functionality
> ===========================
>
> In order to take advantage of the performance benefits of the svml
> library, the
> proposed solution is to introduce a new LLVM IR pass that is capable of
> translating the vector math intrinsics to svml calls. As an example, the
> LLVM IR
> above for "vector.body", introduced in the Problem Description section,
> would
> serve as input to the proposed pass and be transformed into the following
> LLVM
> IR. Special attention should be paid to the "__svml_sinf4_ha" call in the
> LLVM
> IR and resulting assembly code snippet.
>
> vector.body: ; preds = %vector.body,
> %entry
> %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ], !dbg !6
> %0 = trunc i64 %index to i32, !dbg !7
> %broadcast.splatinsert6 = insertelement <4 x i32> undef, i32 %0, i32 0,
> !dbg !7
> %broadcast.splat7 = shufflevector <4 x i32> %broadcast.splatinsert6,
> <4 x i32> undef, <4 x i32> zeroinitializer, !dbg !7
> %induction8 = add <4 x i32> %broadcast.splat7, <i32 0, i32 1, i32 2, i32
> 3>,
> !dbg !7
> %1 = sitofp <4 x i32> %induction8 to <4 x float>, !dbg !7
> %vcall = call <4 x float> @__svml_sinf4_ha(<4 x float> %1)
> %2 = getelementptr inbounds float, float* %array, i64 %index, !dbg !8
> %3 = bitcast float* %2 to <4 x float>*, !dbg !9
> store <4 x float> %vcall, <4 x float>* %3, align 4, !dbg !9, !tbaa !10
> %index.next = add i64 %index, 4, !dbg !6
> %4 = icmp eq i64 %index.next, 1000, !dbg !6
> br i1 %4, label %for.end, label %vector.body, !dbg !6, !llvm.loop !14
>
> The resulting assembly would appear as:
>
> .LBB0_1: # %vector.body
> # =>This Inner Loop Header: Depth=1
> movd %ebx, %xmm0
> pshufd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0]
> paddd .LCPI0_0(%rip), %xmm0
> cvtdq2ps %xmm0, %xmm0
> callq __svml_sinf4_ha
> movups %xmm0, (%r14,%rbx,4)
> addq $4, %rbx
> cmpq $1000, %rbx # imm = 0x3E8
> jne .LBB0_1
>
> In order to perform the translation, several requirements must be met to
> guide
> code generation. Those include:
>
> 1) In addition to the -ffast-math flag, support is needed from clang to
> allow
> the user to be able to specify the desired precision requirements. The
> additional flags needed include the following, where "imf" is shorthand
> for
> "Intel math function".
>
> -fimf-absolute-error=value[:funclist]
> define the maximum allowable absolute error for math library
> function results
> value - a positive, floating-point number conforming to the
> format [digits][.digits][{e|E}[sign]digits]
> funclist - optional comma separated list of one or more math
> library functions to which the attribute should be
> applied
>
> -fimf-accuracy-bits=bits[:funclist]
> define the relative error, measured by the number of correct
> bits,
> for math library function results
> bits - a positive, floating-point number
> funclist - optional comma separated list of one or more math
> library functions to which the attribute should be
> applied
>
> -fimf-arch-consistency=value[:funclist]
> ensures that the math library functions produce consistent
> results
> across different implementations of the same architecture
> value - true or false
> funclist - optional comma separated list of one or more math
> library functions to which the attribute should be
> applied
>
> -fimf-max-error=ulps[:funclist]
> defines the maximum allowable relative error, measured in ulps,
> for
> math library function results
> ulps - a positive, floating-point number conforming to the
> format [digits][.digits][{e|E}[sign]digits]
> funclist - optional comma separated list of one or more math
> library functions to which the attribute should be
> applied
>
> -fimf-precision=value[:funclist]
> defines the accuracy (precision) for math library functions
> value - defined as one of the following values
> high - equivalent to max-error = 0.6
> medium - equivalent to max-error = 4
> low - equivalent to accuracy-bits = 11 (single
> precision); accuracy-bits = 26 (double
> precision)
> funclist - optional comma separated list of one or more math
> library functions to which the attribute should be
> applied
>
> -fimf-domain-exclusion=classlist[:funclist]
> indicates the input arguments domain on which math functions
> must provide correct results.
> classlist - defined as one of the following values
> nans, infinities, denormals, zeros
> all, none, common
> funclist - optional list of one or more math library
> functions to which the attribute should be applied.
>
> Information from the flags can then be encoded as function attributes at
> each
> call site. In the future, this functionality will enable more fine-grained
> control over specifying precision for individual calls/regions, instead of
> setting the precision requirements for all call instances of a function.
> Please
> note that the example translation presented so far does not have the IMF
> attributes attached to the @llvm.sin.v4f32 call, and as a result the
> default is
> set to an svml variant marked with "_ha" (max-error = 0.6), which is short
> for
> high accuracy. Other supported variants will include low precision,
> enhanced
> performance, bitwise reproducible, and correctly rounded. Please refer to
> the
> IEEE-754 standard for additional information regarding supported
> precisions.
> The compiler will select the most appropriate variant based on the IMF
> attributes. See #2.
>
> 2) An interface to query for the appropriate svml function variant based
> on the
> scalar function name and IMF attributes.
>
> 3) For calls to math functions that store to memory (e.g., sincos),
> additional
> analysis of the pointer arguments is beneficial in order to generate
> the best
> performing load/store instructions.
>
> ======================
> GCC/ICC compatibility
> ======================
>
> The initial implementation will involve the translation of 6 svml
> functions,
> which include sin, cos, log, pow, exp, and sincos (both single and double
> precision variants). Support for these functions matches the current
> capabilities of GCC and a subset of ICC. As more functions become
> open-sourced,
> the plan is to support them as part of the final solution determined from
> this
> proposal. The flags referenced in the Proposed New Functionality section
> are
> required to maintain icc compatibility.
>
> =======================
> Current Implementation
> =======================
>
> To evaluate the feasibility of this proposal, a prototype transform pass
> has
> been developed, which performs the following:
>
> 1) Searches for vector math intrinsics as candidates for translation to
> svml.
>
> 2) Reads function attributes to obtain precision requirements for the
> call. If
> none, default to attributes that will force the selection of a high
> accuracy
> variant.
>
> 3) Since the vector factor of the intrinsic can be wider than what is
> legally
> supported by the target, type legalization is performed so that the
> correct
> svml variant is selected. For example, if a call to
> @llvm.sin.v8f32(<8 x float> %1) is made for an xmm target, the pass will
> generate two __svml_sinf4 calls and will do the appropriate splitting
> of %1
> to create the new arguments for each call. In addition, the multiple
> return
> vectors are recombined and users of the original return vector are
> updated.
> The pass is also capable of handling less than full vector cases. E.g.,
> @llvm.sin.v2f32.
>
> 4) Special handling for sincos since the results are stored to a double
> wide
> vector and additional analysis is needed to optimize the stores to
> memory.
> Additional shuffling is required to obtain the sin and cos results from
> the double wide vector.
>
> 5) Vector intrinsics that are not translated to svml are scalarized.
>
> 6) The loop vectorizer has been taught to allow widening of sincos and
> additional utilities have been written to analyze arguments for sincos.
>
> =========
> Feedback
> =========
>
> For those who are interested in this topic, I would like to ask for your
> review
> of this proposal and would definitely appreciate any/all feedback on the
> proposed approach. Help is also very welcome and much appreciated in the
> development process.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160404/e2773916/attachment-0001.html>
More information about the llvm-dev
mailing list