<div dir="ltr"><div>Hi Matt -<br><br></div>Are you using the same TLI hook as Darwin's Accelerate framework: addVectorizableFunctionsFromVecLib()? If not, why not?<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Mar 31, 2016 at 6:20 PM, Masten, Matt via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">RFC: A proposal for vectorizing loops with calls to math functions using SVML (short<br>
vector math library).<br>
<br>
=========<br>
Overview<br>
=========<br>
<br>
Very simply, SVML (Intel short vector math library) functions are vector variants of<br>
scalar math functions that take vector arguments, apply an operation to each<br>
element, and store the result in a vector register. These vector variants can be<br>
generated by the compiler, based on precision requirements specified by the<br>
user, resulting in substantial performance gains. This is an initial proposal to<br>
add a new LLVM IR transformation pass that will translate scalar math calls to<br>
svml calls with the help of the loop vectorizer.<br>
<br>
====================<br>
Problem Description<br>
====================<br>
<br>
Currently, without the "#pragma clang loop vectorize(enable)", the loop<br>
vectorizer will not vectorize loops with math calls due to cost model reasons.<br>
Additionally, When the loop pragma is used, the loop vectorizer will widen the<br>
math call using an intrinsic, but the resulting code is inefficient because the<br>
intrinsic is replaced with scalarized function calls. Please see the example<br>
below for a simple loop containing a sinf call. For demonstration purposes, the<br>
example was compiled for an xmm target, thus VF = 4 given the float type.<br>
<br>
Example: sinf.c<br>
<br>
#define N 1000<br>
<br>
#pragma clang loop vectorize(enable)<br>
for (i = 0; i < N; i++) {<br>
array[i] = sinf((float)i);<br>
}<br>
<br>
Without the loop pragma the loop vectorizer's cost model rejects the loop.<br>
<br>
clang -c -ffast-math -O2 -Rpass-analysis=loop-vectorize<br>
-Rpass-missed=loop-vectorize sinf.c<br>
<br>
sinf.c:19:3: remark: the cost-model indicates that vectorization is not<br>
beneficial [-Rpass-analysis=loop-vectorize]<br>
for (i = 0; i < N; i++) {<br>
^<br>
sinf.c:19:3: remark: the cost-model indicates that interleaving is not<br>
beneficial and is explicitly disabled or interleave count is set to 1<br>
[-Rpass-analysis=loop-vectorize]<br>
<br>
When the the loop pragma is used, the loop is vectorized and the call to<br>
@llvm.sin.v4f32 is generated, but the call is later scalarized with the<br>
additional overhead of unpacking the scalar function arguments from a vector.<br>
This can be seen from inspection of the resulting assembly code just below the<br>
LLVM IR.<br>
<br>
vector.body: ; preds = %vector.body, %<a href="http://vector.ph" rel="noreferrer" target="_blank">vector.ph</a><br>
%index = phi i64 [ 0, %<a href="http://vector.ph" rel="noreferrer" target="_blank">vector.ph</a> ], [ %index.next, %vector.body ], !dbg !6<br>
%0 = trunc i64 %index to i32, !dbg !7<br>
%broadcast.splatinsert6 = insertelement <4 x i32> undef, i32 %0, i32 0,<br>
!dbg !7<br>
%broadcast.splat7 = shufflevector <4 x i32> %broadcast.splatinsert6,<br>
<4 x i32> undef, <4 x i32> zeroinitializer, !dbg !7<br>
%induction8 = add <4 x i32> %broadcast.splat7, <i32 0, i32 1, i32 2, i32 3>,<br>
!dbg !7<br>
%1 = sitofp <4 x i32> %induction8 to <4 x float>, !dbg !7<br>
%2 = call <4 x float> @llvm.sin.v4f32(<4 x float> %1), !dbg !8<br>
%3 = getelementptr inbounds float, float* %array, i64 %index, !dbg !9<br>
%4 = bitcast float* %3 to <4 x float>*, !dbg !10<br>
store <4 x float> %2, <4 x float>* %4, align 4, !dbg !10, !tbaa !11<br>
%index.next = add i64 %index, 4, !dbg !6<br>
%5 = icmp eq i64 %index.next, 1000, !dbg !6<br>
br i1 %5, label %middle.block, label %vector.body, !dbg !6, !llvm.loop !15<br>
<br>
<br>
.LBB0_1: # %vector.body<br>
# =>This Inner Loop Header: Depth=1<br>
movd %ebx, %xmm0<br>
pshufd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0]<br>
paddd .LCPI0_0(%rip), %xmm0<br>
cvtdq2ps %xmm0, %xmm0<br>
movaps %xmm0, 16(%rsp) # 16-byte Spill<br>
shufps $231, %xmm0, %xmm0 # xmm0 = xmm0[3,1,2,3]<br>
callq sinf<br>
movaps %xmm0, (%rsp) # 16-byte Spill<br>
movaps 16(%rsp), %xmm0 # 16-byte Reload<br>
shufps $229, %xmm0, %xmm0 # xmm0 = xmm0[1,1,2,3]<br>
callq sinf<br>
unpcklps (%rsp), %xmm0 # 16-byte Folded Reload<br>
# xmm0 = xmm0[0],mem[0],xmm0[1],mem[1]<br>
movaps %xmm0, (%rsp) # 16-byte Spill<br>
movaps 16(%rsp), %xmm0 # 16-byte Reload<br>
callq sinf<br>
movaps %xmm0, 32(%rsp) # 16-byte Spill<br>
movapd 16(%rsp), %xmm0 # 16-byte Reload<br>
shufpd $1, %xmm0, %xmm0 # xmm0 = xmm0[1,0]<br>
callq sinf<br>
movaps 32(%rsp), %xmm1 # 16-byte Reload<br>
unpcklps %xmm0, %xmm1 # xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]<br>
unpcklps (%rsp), %xmm1 # 16-byte Folded Reload<br>
# xmm1 = xmm1[0],mem[0],xmm1[1],mem[1]<br>
movups %xmm1, (%r14,%rbx,4)<br>
addq $4, %rbx<br>
cmpq $1000, %rbx # imm = 0x3E8<br>
jne .LBB0_1<br>
<br>
===========================<br>
Proposed New Functionality<br>
===========================<br>
<br>
In order to take advantage of the performance benefits of the svml library, the<br>
proposed solution is to introduce a new LLVM IR pass that is capable of<br>
translating the vector math intrinsics to svml calls. As an example, the LLVM IR<br>
above for "vector.body", introduced in the Problem Description section, would<br>
serve as input to the proposed pass and be transformed into the following LLVM<br>
IR. Special attention should be paid to the "__svml_sinf4_ha" call in the LLVM<br>
IR and resulting assembly code snippet.<br>
<br>
vector.body: ; preds = %vector.body, %entry<br>
%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ], !dbg !6<br>
%0 = trunc i64 %index to i32, !dbg !7<br>
%broadcast.splatinsert6 = insertelement <4 x i32> undef, i32 %0, i32 0,<br>
!dbg !7<br>
%broadcast.splat7 = shufflevector <4 x i32> %broadcast.splatinsert6,<br>
<4 x i32> undef, <4 x i32> zeroinitializer, !dbg !7<br>
%induction8 = add <4 x i32> %broadcast.splat7, <i32 0, i32 1, i32 2, i32 3>,<br>
!dbg !7<br>
%1 = sitofp <4 x i32> %induction8 to <4 x float>, !dbg !7<br>
%vcall = call <4 x float> @__svml_sinf4_ha(<4 x float> %1)<br>
%2 = getelementptr inbounds float, float* %array, i64 %index, !dbg !8<br>
%3 = bitcast float* %2 to <4 x float>*, !dbg !9<br>
store <4 x float> %vcall, <4 x float>* %3, align 4, !dbg !9, !tbaa !10<br>
%index.next = add i64 %index, 4, !dbg !6<br>
%4 = icmp eq i64 %index.next, 1000, !dbg !6<br>
br i1 %4, label %for.end, label %vector.body, !dbg !6, !llvm.loop !14<br>
<br>
The resulting assembly would appear as:<br>
<br>
.LBB0_1: # %vector.body<br>
# =>This Inner Loop Header: Depth=1<br>
movd %ebx, %xmm0<br>
pshufd $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0]<br>
paddd .LCPI0_0(%rip), %xmm0<br>
cvtdq2ps %xmm0, %xmm0<br>
callq __svml_sinf4_ha<br>
movups %xmm0, (%r14,%rbx,4)<br>
addq $4, %rbx<br>
cmpq $1000, %rbx # imm = 0x3E8<br>
jne .LBB0_1<br>
<br>
In order to perform the translation, several requirements must be met to guide<br>
code generation. Those include:<br>
<br>
1) In addition to the -ffast-math flag, support is needed from clang to allow<br>
the user to be able to specify the desired precision requirements. The<br>
additional flags needed include the following, where "imf" is shorthand for<br>
"Intel math function".<br>
<br>
-fimf-absolute-error=value[:funclist]<br>
define the maximum allowable absolute error for math library<br>
function results<br>
value - a positive, floating-point number conforming to the<br>
format [digits][.digits][{e|E}[sign]digits]<br>
funclist - optional comma separated list of one or more math<br>
library functions to which the attribute should be<br>
applied<br>
<br>
-fimf-accuracy-bits=bits[:funclist]<br>
define the relative error, measured by the number of correct bits,<br>
for math library function results<br>
bits - a positive, floating-point number<br>
funclist - optional comma separated list of one or more math<br>
library functions to which the attribute should be<br>
applied<br>
<br>
-fimf-arch-consistency=value[:funclist]<br>
ensures that the math library functions produce consistent results<br>
across different implementations of the same architecture<br>
value - true or false<br>
funclist - optional comma separated list of one or more math<br>
library functions to which the attribute should be<br>
applied<br>
<br>
-fimf-max-error=ulps[:funclist]<br>
defines the maximum allowable relative error, measured in ulps, for<br>
math library function results<br>
ulps - a positive, floating-point number conforming to the<br>
format [digits][.digits][{e|E}[sign]digits]<br>
funclist - optional comma separated list of one or more math<br>
library functions to which the attribute should be<br>
applied<br>
<br>
-fimf-precision=value[:funclist]<br>
defines the accuracy (precision) for math library functions<br>
value - defined as one of the following values<br>
high - equivalent to max-error = 0.6<br>
medium - equivalent to max-error = 4<br>
low - equivalent to accuracy-bits = 11 (single<br>
precision); accuracy-bits = 26 (double<br>
precision)<br>
funclist - optional comma separated list of one or more math<br>
library functions to which the attribute should be<br>
applied<br>
<br>
-fimf-domain-exclusion=classlist[:funclist]<br>
indicates the input arguments domain on which math functions<br>
must provide correct results.<br>
classlist - defined as one of the following values<br>
nans, infinities, denormals, zeros<br>
all, none, common<br>
funclist - optional list of one or more math library<br>
functions to which the attribute should be applied.<br>
<br>
Information from the flags can then be encoded as function attributes at each<br>
call site. In the future, this functionality will enable more fine-grained<br>
control over specifying precision for individual calls/regions, instead of<br>
setting the precision requirements for all call instances of a function. Please<br>
note that the example translation presented so far does not have the IMF<br>
attributes attached to the @llvm.sin.v4f32 call, and as a result the default is<br>
set to an svml variant marked with "_ha" (max-error = 0.6), which is short for<br>
high accuracy. Other supported variants will include low precision, enhanced<br>
performance, bitwise reproducible, and correctly rounded. Please refer to the<br>
IEEE-754 standard for additional information regarding supported precisions.<br>
The compiler will select the most appropriate variant based on the IMF<br>
attributes. See #2.<br>
<br>
2) An interface to query for the appropriate svml function variant based on the<br>
scalar function name and IMF attributes.<br>
<br>
3) For calls to math functions that store to memory (e.g., sincos), additional<br>
analysis of the pointer arguments is beneficial in order to generate the best<br>
performing load/store instructions.<br>
<br>
======================<br>
GCC/ICC compatibility<br>
======================<br>
<br>
The initial implementation will involve the translation of 6 svml functions,<br>
which include sin, cos, log, pow, exp, and sincos (both single and double<br>
precision variants). Support for these functions matches the current<br>
capabilities of GCC and a subset of ICC. As more functions become open-sourced,<br>
the plan is to support them as part of the final solution determined from this<br>
proposal. The flags referenced in the Proposed New Functionality section are<br>
required to maintain icc compatibility.<br>
<br>
=======================<br>
Current Implementation<br>
=======================<br>
<br>
To evaluate the feasibility of this proposal, a prototype transform pass has<br>
been developed, which performs the following:<br>
<br>
1) Searches for vector math intrinsics as candidates for translation to svml.<br>
<br>
2) Reads function attributes to obtain precision requirements for the call. If<br>
none, default to attributes that will force the selection of a high accuracy<br>
variant.<br>
<br>
3) Since the vector factor of the intrinsic can be wider than what is legally<br>
supported by the target, type legalization is performed so that the correct<br>
svml variant is selected. For example, if a call to<br>
@llvm.sin.v8f32(<8 x float> %1) is made for an xmm target, the pass will<br>
generate two __svml_sinf4 calls and will do the appropriate splitting of %1<br>
to create the new arguments for each call. In addition, the multiple return<br>
vectors are recombined and users of the original return vector are updated.<br>
The pass is also capable of handling less than full vector cases. E.g.,<br>
@llvm.sin.v2f32.<br>
<br>
4) Special handling for sincos since the results are stored to a double wide<br>
vector and additional analysis is needed to optimize the stores to memory.<br>
Additional shuffling is required to obtain the sin and cos results from<br>
the double wide vector.<br>
<br>
5) Vector intrinsics that are not translated to svml are scalarized.<br>
<br>
6) The loop vectorizer has been taught to allow widening of sincos and<br>
additional utilities have been written to analyze arguments for sincos.<br>
<br>
=========<br>
Feedback<br>
=========<br>
<br>
For those who are interested in this topic, I would like to ask for your review<br>
of this proposal and would definitely appreciate any/all feedback on the<br>
proposed approach. Help is also very welcome and much appreciated in the<br>
development process.<br>
_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
</blockquote></div><br></div>