[cfe-dev] [llvm-dev] Proposal for function vectorization and loop vectorization with function calls

Fri Mar 18 03:03:26 PDT 2016

Pinging David and Richard!

Yours,
Andrey

> 3 марта 2016 г., в 11:27, Andrey Bokhanko <andreybokhanko at gmail.com> написал(а):
> 
> Hi David [Majnemer], Richard [Smith],
> 
> Front-end wise, the biggest change in this proposal is introduction of
> new mangling for vector functions.
> 
> May I ask you to look at the mangling part (sections 1 and 2 in the
> "Proposed Implementation" chapter) and review it?
> 
> (Obviously, others who are concerned with how mangling is done in
> Clang are welcome to chime in as well!)
> 
> Yours,
> Andrey
> 
> 
> On Wed, Mar 2, 2016 at 10:49 PM, Tian, Xinmin via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>> Proposal for function vectorization and loop vectorization with function calls
>> ==============================================================================
>> Intel Corporation (3/2/2016)
>> 
>> This is a proposal for an initial work towards Clang and LLVM implementation of
>> vectorizing a function annotated with  OpenMP 4.5's "#pragma omp declare simd"
>> (named SIMD-enabled function) and its associated clauses based on the VectorABI
>> [2]. On the caller side, we propose to improve LLVM loopVectorizer such that
>> the code that calls the SIMD-enabled function can be vectorized. On the callee
>> side, we propose to add Clang FE support for "#pragma omp declare simd" syntax
>> and a new pass to transform the SIMD-enabled function body into a SIMD loop.
>> This newly created loop can then be fed to LLVM loopVectorizer (or its future
>> enhancement) for vectorization. This work does leverage LLVM's existing
>> LoopVectorizer.
>> 
>> 
>> Problem Statement
>> =================
>> Currently, if a loop calls a user-defined function or a 3rd party library
>> function, the loop can't be vectorized unless the function is inlined. In the
>> example below the LoopVectorizer fails to vectorize the k loop due to its
>> function call to "dowork" because "dowork" is an external function. Note that
>> inlining the "dowork" function may result in vectorization for some of the
>> cases, but that is not a generally applicable solution. Also, there may be
>> reasons why compiler may not (or can't) inline the "dowork" function call.
>> Therefore, there is value in being able to vectorize the loop with a call to
>> "dowork" function in it.
>> 
>> #include<stdio.h>
>> extern float dowork(float *a, int k);
>> 
>> float a[4096];
>> int main()
>> { int k;
>> #pragma clang loop vectorize(enable)
>>  for (k = 0; k < 4096; k++) {
>>    a[k] = k * 0.5;
>>    a[k] = dowork(a, k);
>>  }
>>  printf("passed %f\n", a[1024]);
>> }
>> 
>> sh-4.1$ clang -c -O2 -Rpass=loop-vectorize -Rpass-missed=loop-vectorize
>>                     -Rpass-analysis=loop-vectorize loopvec.c
>> loopvec.c:15:12: remark: loop not vectorized: call instruction cannot be
>>      vectorized [-Rpass-analysis]
>>    a[k] = dowork(a, k);
>>           ^
>> loopvec.c:13:3: remark: loop not vectorized: use -Rpass-analysis=loop-vectorize
>>      for more info (Force=true) [-Rpass-missed=loop-vectorize]
>>  for (k = 0; k < 4096; k++) {
>>  ^
>> loopvec.c:13:3: warning: loop not vectorized: failed explicitly specified
>>                loop vectorization [-Wpass-failed]
>> 1 warning generated.
>> 
>> 
>> New functionality of Vectorization
>> ==================================
>> New functionalities and enhancements are proposed to address the issues
>> stated above which include: a) Vectorize a function annotated by the
>> programmer using OpenMP* SIMD extensions; b) Enhance LLVM's LoopVectorizer
>> to vectorize a loop containing a call to SIMD-enabled function.
>> 
>> For example, when writing:
>> 
>> #include<stdio.h>
>> 
>> #pragma omp declare simd uniform(a) linear(k)
>> extern float dowork(float *a, int k);
>> 
>> float a[4096];
>> int main()
>> { int k;
>> #pragma clang loop vectorize(enable)
>>  for (k = 0; k < 4096; k++) {
>>    a[k] = k * 0.5;
>>    a[k] = dowork(a, k);
>>  }
>>  printf("passed %f\n", a[1024]);
>> }
>> 
>> the programmer asserts that
>>  a) there will be a vector version of "dowork" available for the compiler to
>>     use (link with, with appropriate signature, explained below) when
>>     vectorizing the k loop; and that
>>  b) no loop-carried backward dependencies are introduced by the "dowork"
>>     call that prevent the vectorization of the k loop.
>> 
>> The expected vector loop (shown as pseudo code, ignoring leftover iterations)
>> resulting from LLVM's LoopVectorizer is
>> 
>>  ... ...
>>  vectorized_for (k = 0; k < 4096; k += VL) {
>>    a[k:VL] = {k, k+1, k+2, k+VL-1} * 0.5;
>>    a[k:VL] = _ZGVb4Nul_dowork(a, k);
>>  }
>>  ... ...
>> 
>> In this example "_ZGVb4Nul_dowork" is a special name mangling where:
>> _ZGV is a prefix based on C/C++ name mangling rule suggested by GCC community,
>> 'b' indicates "xmm" (assume we vectorize here to 128bit xmm vector registers),
>> '4' is VL (assume we vectorize here for length 4),
>> 'N' indicates that the function is vectorized without a mask, M indicates that
>>     the function is vecrized with a mask.
>> 'u' indicates that the first parameter has the "uniform" property,
>> 'l' indicates that the second argement has the "linear" property.
>> 
>> More details (including name mangling scheme) can be found in the following
>> references [2].
>> 
>> References
>> ==========
>> 
>> 1. OpenMP SIMD language extensions: http://www.openmp.org/mp-documents/openmp-4.
>> 5.pdf
>> 
>> 2. VectorABI Documentation:
>> https://www.cilkplus.org/sites/default/files/open_specifications/Intel-ABI-Vecto
>> r-Function-2012-v0.9.5.pdf
>> https://sourceware.org/glibc/wiki/libmvec?action=AttachFile&do=view&target=Vecto
>> rABI.txt
>> 
>> [[Note: VectorABI was reviewed at X86-64 System V Application Binary Interface
>>        mailing list. The discussion was recorded at
>>        https://groups.google.com/forum/#!topic/x86-64-abi/LmppCfN1rZ4 ]]
>> 
>> 3. The first paper on SIMD extensions and implementations:
>> "Compiling C/C++ SIMD Extensions for Function and Loop Vectorizaion on
>> Multicore-SIMD Processors" by Xinmin Tian, Hideki Saito, Milind Girkar,
>> Serguei Preis, Sergey Kozhukhov, et al., IPDPS Workshops 2012, pages 2349--2358
>> [[Note: the first implementation and the paper were done before VectorABI was
>>        finalized with the GCC community and Redhat. The latest VectorABI
>>        version for OpenMP 4.5 is ready to be published]]
>> 
>> 
>> Proposed Implementation
>> =======================
>> 1. Clang FE parses "#pragma omp declare simd [clauses]" and generates mangled
>>   name including these prefixes as vector signatures. These mangled name
>>   prefixes are recorded as function attributes in LLVM function attribute
>>   group. Note that it may be possible to have several mangled names associated
>>   with the same function, which correspond to several desired vectorized
>>   versions. Clang FE generates all function attributes for expected vector
>>   variants to be generated by the back-end. E.g.,
>> 
>>   #pragma omp delcare simd uniform(a) linear(k)
>>   float dowork(float *a, int k)
>>   {
>>      a[k] = sinf(a[k]) + 9.8f;
>>   }
>> 
>>   define __stdcall f32 @_dowork(f32* %a, i32 %k) #0
>>   ... ...
>>   attributes #0 = { nounwind uwtable "_ZGVbM4ul_" "_ZGVbN4ul_" ...}
>> 
>> 2. A new vector function generation pass is introduced to generate vector
>>   variants of the original scalar function based on VectorABI (see [2, 3]).
>>   For example, one vector variant is generated for "_ZGVbN4ul_" attribute
>>   as follows (pseudo code):
>> 
>>   define __stdcall <4 x f32> @_ZGVbN4ul_dowork(f32* %a, i32 %k) #0
>>   {
>>     #pragma clang loop vectorize(enable)
>>     for (int %t = k; %t < %k + 4; %t++) {
>>       %a[t] = sinf(%a[t]) + 9.8f;
>>     }
>>     vec_load xmm0, %a[k:VL]
>>     return xmm0;
>>   }
>> 
>>   The body of the function is wrapped inside a loop having VL iterations,
>>   which correspond to the vector lanes.
>> 
>>   The LLVM LoopVectorizer will vectorize the generated %t loop, expected
>>   to produce the following vectorized code eliminating the loop (pseudo code):
>> 
>>   define __stdcall <4 x f32> @_ZGVbN4ul_dowork(f32* %a, i32 %k) #0
>>   {
>>     vec_load xmm1,  %a[k: VL]
>>     xmm2 = call __svml_sinf(xmm1)
>>     xmm0 = vec_add  xmm2, [9,8f, 9.8f, 9.8f, 9.8f]
>>     store %a[k:VL], xmm0
>>     return xmm0;
>>   }
>> 
>>   [[Note: Vectorizer support for the Short Vector Math Library (SVML)
>>           functions will be a seperate proposal. ]]
>> 
>> 3. The LLVM LoopVectorizer is enhanced to
>>   a) identify loops with calls that have been annotated with
>>      "#pragma omp declare simd" by checking function attribute groups;
>>   b) analyze each call instruction and its parameters in the loop, to
>>      determine if each parameter has the following properties:
>>        * uniform
>>        * linear + stride
>>        * vector
>>        * aligned
>>        * called inside a conditional branch or not
>>          ... ...
>>      Based on these properties, the signature of the vectorized call is
>>      generated; and
>>   c) performs signature matching to obtain the suitable vector variant
>>      among the signatures available for the called function. If no such
>>      signature is found, the call cannot be vectorized.
>> 
>>   Note that a similar enhancement can and should be made also to LLVM's
>>   SLP vectorizer.
>> 
>>   For example:
>> 
>>   #pragma omp declare simd uniform(a) linear(k)
>>   extern float dowork(float *a, int k);
>> 
>>   ... ...
>>   #pragma clang loop vectorize(enable)
>>   for (k = 0; k < 4096; k++) {
>>     a[k] = k * 0.5;
>>     a[k] = dowork(a, k);
>>   }
>>   ... ...
>> 
>>   Step a: "dowork" function is marked as SIMD-enabled function
>>           attributes #0 = { nounwind uwtable "_ZGVbM4ul_" "_ZGVbN4ul_" ...}
>> 
>>   Step b: 1) 'a' is uniform, as it is the base address of array 'a'
>>           2) 'k' is linear, as 'k' is the induction variable with stride=1
>>           3) SIMD "dowork" is called unconditionally in the candidate k loop.
>>           4) it is compiled for SSE4.1 with the Vector Length VL=4.
>>              based on these properties, the signature is "_ZGVbN4ul_"
>> 
>>   [[Notes: For conditional call in the loop, it needs masking support,
>>            the implementation details seen in reference [1][2][3] ]]
>> 
>>   Step c: Check if the signature "_ZGVbN4ul_" exists in function attribute #0;
>>           if yes the suitable vectorized version is found and will be linked
>>           with.
>> 
>>   The below loop is expected to be produced by the LoopVectorizer:
>>   ... ...
>>   vectorized_for (k = 0; k < 4096; k += 4) {
>>     a[k:4] = {k, k+1, k+2, k+3} * 0.5;
>>     a[k:4] = _ZGVb4Nul_dowork(a, k);
>>   }
>>   ... ...
>> 
>> [[Note: Vectorizer support for the Short Vector Math Library (SVML) functions
>>        will be a seperate proposal. ]]
>> 
>> 
>> GCC and ICC Compatibility
>> =========================
>> With this proposal the callee function and the loop containing a call to it
>> can each be compiled and vectorized by a different compiler, including
>> Clang+LLVM with its LoopVectorizer as outlined above, GCC and ICC. The
>> vectorized loop will then be linked with the vectorized callee function.
>> Of-course each of these compilers can also be used to compile both loop and
>> callee function.
>> 
>> 
>> Current Implementation Status and Plan
>> ======================================
>> 1. Clang FE is done by Intel Clang FE team according to #1. Note: Clang FE
>>   syntax process patch is implemented and under community review
>>   (http://reviews.llvm.org/D10599). In general, the review feedback is
>>   very positive from the Clang community.
>> 
>> 2. A new pass for function vectorization is implemented to support #2 and
>>   to be prepared for LLVM community review.
>> 
>> 3. Work is in progress to teach LLVM's LoopVectorizer to vectorize a loop
>>   with user-defined function calls according to #3.
>> 
>> Call for Action
>> ===============
>> 1. Please review this proposal and provide constructive feedback on its
>>   direction and key ideas.
>> 
>> 2. Feel free to ask any technical questions related to this proposal and
>>   to read the associated references.
>> 
>> 3. Help is also highly welcome and appreciated in the development and
>>   upstreaming process.
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev