[llvm-dev] Proposal for function vectorization and loop vectorization with function calls

Thu Mar 3 02:27:47 PST 2016

Hi David [Majnemer], Richard [Smith],

Front-end wise, the biggest change in this proposal is introduction of
new mangling for vector functions.

May I ask you to look at the mangling part (sections 1 and 2 in the
"Proposed Implementation" chapter) and review it?

(Obviously, others who are concerned with how mangling is done in
Clang are welcome to chime in as well!)

Yours,
Andrey

On Wed, Mar 2, 2016 at 10:49 PM, Tian, Xinmin via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> Proposal for function vectorization and loop vectorization with function calls
> ==============================================================================
> Intel Corporation (3/2/2016)
>
> This is a proposal for an initial work towards Clang and LLVM implementation of
> vectorizing a function annotated with  OpenMP 4.5's "#pragma omp declare simd"
> (named SIMD-enabled function) and its associated clauses based on the VectorABI
> [2]. On the caller side, we propose to improve LLVM loopVectorizer such that
> the code that calls the SIMD-enabled function can be vectorized. On the callee
> side, we propose to add Clang FE support for "#pragma omp declare simd" syntax
> and a new pass to transform the SIMD-enabled function body into a SIMD loop.
> This newly created loop can then be fed to LLVM loopVectorizer (or its future
> enhancement) for vectorization. This work does leverage LLVM's existing
> LoopVectorizer.
>
>
> Problem Statement
> =================
> Currently, if a loop calls a user-defined function or a 3rd party library
> function, the loop can't be vectorized unless the function is inlined. In the
> example below the LoopVectorizer fails to vectorize the k loop due to its
> function call to "dowork" because "dowork" is an external function. Note that
> inlining the "dowork" function may result in vectorization for some of the
> cases, but that is not a generally applicable solution. Also, there may be
> reasons why compiler may not (or can't) inline the "dowork" function call.
> Therefore, there is value in being able to vectorize the loop with a call to
> "dowork" function in it.
>
> #include<stdio.h>
> extern float dowork(float *a, int k);
>
> float a[4096];
> int main()
> { int k;
> #pragma clang loop vectorize(enable)
>   for (k = 0; k < 4096; k++) {
>     a[k] = k * 0.5;
>     a[k] = dowork(a, k);
>   }
>   printf("passed %f\n", a[1024]);
> }
>
> sh-4.1$ clang -c -O2 -Rpass=loop-vectorize -Rpass-missed=loop-vectorize
>                      -Rpass-analysis=loop-vectorize loopvec.c
> loopvec.c:15:12: remark: loop not vectorized: call instruction cannot be
>       vectorized [-Rpass-analysis]
>     a[k] = dowork(a, k);
>            ^
> loopvec.c:13:3: remark: loop not vectorized: use -Rpass-analysis=loop-vectorize
>       for more info (Force=true) [-Rpass-missed=loop-vectorize]
>   for (k = 0; k < 4096; k++) {
>   ^
> loopvec.c:13:3: warning: loop not vectorized: failed explicitly specified
>                 loop vectorization [-Wpass-failed]
> 1 warning generated.
>
>
> New functionality of Vectorization
> ==================================
> New functionalities and enhancements are proposed to address the issues
> stated above which include: a) Vectorize a function annotated by the
> programmer using OpenMP* SIMD extensions; b) Enhance LLVM's LoopVectorizer
> to vectorize a loop containing a call to SIMD-enabled function.
>
> For example, when writing:
>
> #include<stdio.h>
>
> #pragma omp declare simd uniform(a) linear(k)
> extern float dowork(float *a, int k);
>
> float a[4096];
> int main()
> { int k;
> #pragma clang loop vectorize(enable)
>   for (k = 0; k < 4096; k++) {
>     a[k] = k * 0.5;
>     a[k] = dowork(a, k);
>   }
>   printf("passed %f\n", a[1024]);
> }
>
> the programmer asserts that
>   a) there will be a vector version of "dowork" available for the compiler to
>      use (link with, with appropriate signature, explained below) when
>      vectorizing the k loop; and that
>   b) no loop-carried backward dependencies are introduced by the "dowork"
>      call that prevent the vectorization of the k loop.
>
> The expected vector loop (shown as pseudo code, ignoring leftover iterations)
> resulting from LLVM's LoopVectorizer is
>
>   ... ...
>   vectorized_for (k = 0; k < 4096; k += VL) {
>     a[k:VL] = {k, k+1, k+2, k+VL-1} * 0.5;
>     a[k:VL] = _ZGVb4Nul_dowork(a, k);
>   }
>   ... ...
>
> In this example "_ZGVb4Nul_dowork" is a special name mangling where:
>  _ZGV is a prefix based on C/C++ name mangling rule suggested by GCC community,
>  'b' indicates "xmm" (assume we vectorize here to 128bit xmm vector registers),
>  '4' is VL (assume we vectorize here for length 4),
>  'N' indicates that the function is vectorized without a mask, M indicates that
>      the function is vecrized with a mask.
>  'u' indicates that the first parameter has the "uniform" property,
>  'l' indicates that the second argement has the "linear" property.
>
> More details (including name mangling scheme) can be found in the following
> references [2].
>
> References
> ==========
>
> 1. OpenMP SIMD language extensions: http://www.openmp.org/mp-documents/openmp-4.
> 5.pdf
>
> 2. VectorABI Documentation:
> https://www.cilkplus.org/sites/default/files/open_specifications/Intel-ABI-Vecto
> r-Function-2012-v0.9.5.pdf
> https://sourceware.org/glibc/wiki/libmvec?action=AttachFile&do=view&target=Vecto
> rABI.txt
>
> [[Note: VectorABI was reviewed at X86-64 System V Application Binary Interface
>         mailing list. The discussion was recorded at
>         https://groups.google.com/forum/#!topic/x86-64-abi/LmppCfN1rZ4 ]]
>
> 3. The first paper on SIMD extensions and implementations:
> "Compiling C/C++ SIMD Extensions for Function and Loop Vectorizaion on
> Multicore-SIMD Processors" by Xinmin Tian, Hideki Saito, Milind Girkar,
> Serguei Preis, Sergey Kozhukhov, et al., IPDPS Workshops 2012, pages 2349--2358
> [[Note: the first implementation and the paper were done before VectorABI was
>         finalized with the GCC community and Redhat. The latest VectorABI
>         version for OpenMP 4.5 is ready to be published]]
>
>
> Proposed Implementation
> =======================
> 1. Clang FE parses "#pragma omp declare simd [clauses]" and generates mangled
>    name including these prefixes as vector signatures. These mangled name
>    prefixes are recorded as function attributes in LLVM function attribute
>    group. Note that it may be possible to have several mangled names associated
>    with the same function, which correspond to several desired vectorized
>    versions. Clang FE generates all function attributes for expected vector
>    variants to be generated by the back-end. E.g.,
>
>    #pragma omp delcare simd uniform(a) linear(k)
>    float dowork(float *a, int k)
>    {
>       a[k] = sinf(a[k]) + 9.8f;
>    }
>
>    define __stdcall f32 @_dowork(f32* %a, i32 %k) #0
>    ... ...
>    attributes #0 = { nounwind uwtable "_ZGVbM4ul_" "_ZGVbN4ul_" ...}
>
> 2. A new vector function generation pass is introduced to generate vector
>    variants of the original scalar function based on VectorABI (see [2, 3]).
>    For example, one vector variant is generated for "_ZGVbN4ul_" attribute
>    as follows (pseudo code):
>
>    define __stdcall <4 x f32> @_ZGVbN4ul_dowork(f32* %a, i32 %k) #0
>    {
>      #pragma clang loop vectorize(enable)
>      for (int %t = k; %t < %k + 4; %t++) {
>        %a[t] = sinf(%a[t]) + 9.8f;
>      }
>      vec_load xmm0, %a[k:VL]
>      return xmm0;
>    }
>
>    The body of the function is wrapped inside a loop having VL iterations,
>    which correspond to the vector lanes.
>
>    The LLVM LoopVectorizer will vectorize the generated %t loop, expected
>    to produce the following vectorized code eliminating the loop (pseudo code):
>
>    define __stdcall <4 x f32> @_ZGVbN4ul_dowork(f32* %a, i32 %k) #0
>    {
>      vec_load xmm1,  %a[k: VL]
>      xmm2 = call __svml_sinf(xmm1)
>      xmm0 = vec_add  xmm2, [9,8f, 9.8f, 9.8f, 9.8f]
>      store %a[k:VL], xmm0
>      return xmm0;
>    }
>
>    [[Note: Vectorizer support for the Short Vector Math Library (SVML)
>            functions will be a seperate proposal. ]]
>
> 3. The LLVM LoopVectorizer is enhanced to
>    a) identify loops with calls that have been annotated with
>       "#pragma omp declare simd" by checking function attribute groups;
>    b) analyze each call instruction and its parameters in the loop, to
>       determine if each parameter has the following properties:
>         * uniform
>         * linear + stride
>         * vector
>         * aligned
>         * called inside a conditional branch or not
>           ... ...
>       Based on these properties, the signature of the vectorized call is
>       generated; and
>    c) performs signature matching to obtain the suitable vector variant
>       among the signatures available for the called function. If no such
>       signature is found, the call cannot be vectorized.
>
>    Note that a similar enhancement can and should be made also to LLVM's
>    SLP vectorizer.
>
>    For example:
>
>    #pragma omp declare simd uniform(a) linear(k)
>    extern float dowork(float *a, int k);
>
>    ... ...
>    #pragma clang loop vectorize(enable)
>    for (k = 0; k < 4096; k++) {
>      a[k] = k * 0.5;
>      a[k] = dowork(a, k);
>    }
>    ... ...
>
>    Step a: "dowork" function is marked as SIMD-enabled function
>            attributes #0 = { nounwind uwtable "_ZGVbM4ul_" "_ZGVbN4ul_" ...}
>
>    Step b: 1) 'a' is uniform, as it is the base address of array 'a'
>            2) 'k' is linear, as 'k' is the induction variable with stride=1
>            3) SIMD "dowork" is called unconditionally in the candidate k loop.
>            4) it is compiled for SSE4.1 with the Vector Length VL=4.
>               based on these properties, the signature is "_ZGVbN4ul_"
>
>    [[Notes: For conditional call in the loop, it needs masking support,
>             the implementation details seen in reference [1][2][3] ]]
>
>    Step c: Check if the signature "_ZGVbN4ul_" exists in function attribute #0;
>            if yes the suitable vectorized version is found and will be linked
>            with.
>
>    The below loop is expected to be produced by the LoopVectorizer:
>    ... ...
>    vectorized_for (k = 0; k < 4096; k += 4) {
>      a[k:4] = {k, k+1, k+2, k+3} * 0.5;
>      a[k:4] = _ZGVb4Nul_dowork(a, k);
>    }
>    ... ...
>
> [[Note: Vectorizer support for the Short Vector Math Library (SVML) functions
>         will be a seperate proposal. ]]
>
>
> GCC and ICC Compatibility
> =========================
> With this proposal the callee function and the loop containing a call to it
> can each be compiled and vectorized by a different compiler, including
> Clang+LLVM with its LoopVectorizer as outlined above, GCC and ICC. The
> vectorized loop will then be linked with the vectorized callee function.
> Of-course each of these compilers can also be used to compile both loop and
> callee function.
>
>
> Current Implementation Status and Plan
> ======================================
> 1. Clang FE is done by Intel Clang FE team according to #1. Note: Clang FE
>    syntax process patch is implemented and under community review
>    (http://reviews.llvm.org/D10599). In general, the review feedback is
>    very positive from the Clang community.
>
> 2. A new pass for function vectorization is implemented to support #2 and
>    to be prepared for LLVM community review.
>
> 3. Work is in progress to teach LLVM's LoopVectorizer to vectorize a loop
>    with user-defined function calls according to #3.
>
> Call for Action
> ===============
> 1. Please review this proposal and provide constructive feedback on its
>    direction and key ideas.
>
> 2. Feel free to ask any technical questions related to this proposal and
>    to read the associated references.
>
> 3. Help is also highly welcome and appreciated in the development and
>    upstreaming process.
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev