[cfe-dev] Proposal for function vectorization and loop vectorization with function calls

Wed Mar 2 11:49:05 PST 2016

Proposal for function vectorization and loop vectorization with function calls
==============================================================================
Intel Corporation (3/2/2016)

This is a proposal for an initial work towards Clang and LLVM implementation of
vectorizing a function annotated with  OpenMP 4.5's "#pragma omp declare simd"
(named SIMD-enabled function) and its associated clauses based on the VectorABI
[2]. On the caller side, we propose to improve LLVM loopVectorizer such that
the code that calls the SIMD-enabled function can be vectorized. On the callee
side, we propose to add Clang FE support for "#pragma omp declare simd" syntax
and a new pass to transform the SIMD-enabled function body into a SIMD loop.
This newly created loop can then be fed to LLVM loopVectorizer (or its future
enhancement) for vectorization. This work does leverage LLVM's existing
LoopVectorizer.

Problem Statement
=================
Currently, if a loop calls a user-defined function or a 3rd party library
function, the loop can't be vectorized unless the function is inlined. In the
example below the LoopVectorizer fails to vectorize the k loop due to its
function call to "dowork" because "dowork" is an external function. Note that
inlining the "dowork" function may result in vectorization for some of the
cases, but that is not a generally applicable solution. Also, there may be
reasons why compiler may not (or can't) inline the "dowork" function call.
Therefore, there is value in being able to vectorize the loop with a call to
"dowork" function in it.

#include<stdio.h>
extern float dowork(float *a, int k);

float a[4096];
int main()
{ int k;
#pragma clang loop vectorize(enable)
  for (k = 0; k < 4096; k++) {
    a[k] = k * 0.5;
    a[k] = dowork(a, k);
  }
  printf("passed %f\n", a[1024]);
}

sh-4.1$ clang -c -O2 -Rpass=loop-vectorize -Rpass-missed=loop-vectorize
                     -Rpass-analysis=loop-vectorize loopvec.c
loopvec.c:15:12: remark: loop not vectorized: call instruction cannot be
      vectorized [-Rpass-analysis]
    a[k] = dowork(a, k);
           ^
loopvec.c:13:3: remark: loop not vectorized: use -Rpass-analysis=loop-vectorize
      for more info (Force=true) [-Rpass-missed=loop-vectorize]
  for (k = 0; k < 4096; k++) {
  ^
loopvec.c:13:3: warning: loop not vectorized: failed explicitly specified
                loop vectorization [-Wpass-failed]
1 warning generated.

New functionality of Vectorization
==================================
New functionalities and enhancements are proposed to address the issues
stated above which include: a) Vectorize a function annotated by the
programmer using OpenMP* SIMD extensions; b) Enhance LLVM's LoopVectorizer
to vectorize a loop containing a call to SIMD-enabled function.

For example, when writing:

#include<stdio.h>

#pragma omp declare simd uniform(a) linear(k)
extern float dowork(float *a, int k);

float a[4096];
int main()
{ int k;
#pragma clang loop vectorize(enable)
  for (k = 0; k < 4096; k++) {
    a[k] = k * 0.5;
    a[k] = dowork(a, k);
  }
  printf("passed %f\n", a[1024]);
}

the programmer asserts that
  a) there will be a vector version of "dowork" available for the compiler to
     use (link with, with appropriate signature, explained below) when
     vectorizing the k loop; and that
  b) no loop-carried backward dependencies are introduced by the "dowork"
     call that prevent the vectorization of the k loop.

The expected vector loop (shown as pseudo code, ignoring leftover iterations)
resulting from LLVM's LoopVectorizer is

  ... ...
  vectorized_for (k = 0; k < 4096; k += VL) {
    a[k:VL] = {k, k+1, k+2, k+VL-1} * 0.5;
    a[k:VL] = _ZGVb4Nul_dowork(a, k);
  }
  ... ...

In this example "_ZGVb4Nul_dowork" is a special name mangling where:
 _ZGV is a prefix based on C/C++ name mangling rule suggested by GCC community,
 'b' indicates "xmm" (assume we vectorize here to 128bit xmm vector registers),
 '4' is VL (assume we vectorize here for length 4),
 'N' indicates that the function is vectorized without a mask, M indicates that
     the function is vecrized with a mask.
 'u' indicates that the first parameter has the "uniform" property,
 'l' indicates that the second argement has the "linear" property.

More details (including name mangling scheme) can be found in the following
references [2].

References
==========

1. OpenMP SIMD language extensions: http://www.openmp.org/mp-documents/openmp-4.
5.pdf

2. VectorABI Documentation:
https://www.cilkplus.org/sites/default/files/open_specifications/Intel-ABI-Vecto
r-Function-2012-v0.9.5.pdf
https://sourceware.org/glibc/wiki/libmvec?action=AttachFile&do=view&target=Vecto
rABI.txt

[[Note: VectorABI was reviewed at X86-64 System V Application Binary Interface
        mailing list. The discussion was recorded at
        https://groups.google.com/forum/#!topic/x86-64-abi/LmppCfN1rZ4 ]]

3. The first paper on SIMD extensions and implementations:
"Compiling C/C++ SIMD Extensions for Function and Loop Vectorizaion on
Multicore-SIMD Processors" by Xinmin Tian, Hideki Saito, Milind Girkar,
Serguei Preis, Sergey Kozhukhov, et al., IPDPS Workshops 2012, pages 2349--2358
[[Note: the first implementation and the paper were done before VectorABI was
        finalized with the GCC community and Redhat. The latest VectorABI
        version for OpenMP 4.5 is ready to be published]]

Proposed Implementation
=======================
1. Clang FE parses "#pragma omp declare simd [clauses]" and generates mangled
   name including these prefixes as vector signatures. These mangled name
   prefixes are recorded as function attributes in LLVM function attribute
   group. Note that it may be possible to have several mangled names associated
   with the same function, which correspond to several desired vectorized
   versions. Clang FE generates all function attributes for expected vector
   variants to be generated by the back-end. E.g.,

   #pragma omp delcare simd uniform(a) linear(k)
   float dowork(float *a, int k)
   {
      a[k] = sinf(a[k]) + 9.8f;
   }

   define __stdcall f32 @_dowork(f32* %a, i32 %k) #0
   ... ...
   attributes #0 = { nounwind uwtable "_ZGVbM4ul_" "_ZGVbN4ul_" ...}

2. A new vector function generation pass is introduced to generate vector
   variants of the original scalar function based on VectorABI (see [2, 3]).
   For example, one vector variant is generated for "_ZGVbN4ul_" attribute
   as follows (pseudo code):

   define __stdcall <4 x f32> @_ZGVbN4ul_dowork(f32* %a, i32 %k) #0
   {
     #pragma clang loop vectorize(enable)
     for (int %t = k; %t < %k + 4; %t++) {
       %a[t] = sinf(%a[t]) + 9.8f;
     }
     vec_load xmm0, %a[k:VL]
     return xmm0;
   }

   The body of the function is wrapped inside a loop having VL iterations,
   which correspond to the vector lanes.

   The LLVM LoopVectorizer will vectorize the generated %t loop, expected
   to produce the following vectorized code eliminating the loop (pseudo code):

   define __stdcall <4 x f32> @_ZGVbN4ul_dowork(f32* %a, i32 %k) #0
   {
     vec_load xmm1,  %a[k: VL]
     xmm2 = call __svml_sinf(xmm1)
     xmm0 = vec_add  xmm2, [9,8f, 9.8f, 9.8f, 9.8f]
     store %a[k:VL], xmm0
     return xmm0;
   }

   [[Note: Vectorizer support for the Short Vector Math Library (SVML)
           functions will be a seperate proposal. ]]

3. The LLVM LoopVectorizer is enhanced to
   a) identify loops with calls that have been annotated with
      "#pragma omp declare simd" by checking function attribute groups;
   b) analyze each call instruction and its parameters in the loop, to
      determine if each parameter has the following properties:
        * uniform
        * linear + stride
        * vector
        * aligned
        * called inside a conditional branch or not
          ... ...
      Based on these properties, the signature of the vectorized call is
      generated; and
   c) performs signature matching to obtain the suitable vector variant
      among the signatures available for the called function. If no such
      signature is found, the call cannot be vectorized.

   Note that a similar enhancement can and should be made also to LLVM's
   SLP vectorizer.

   For example:

   #pragma omp declare simd uniform(a) linear(k)
   extern float dowork(float *a, int k);

   ... ...
   #pragma clang loop vectorize(enable)
   for (k = 0; k < 4096; k++) {
     a[k] = k * 0.5;
     a[k] = dowork(a, k);
   }
   ... ...

   Step a: "dowork" function is marked as SIMD-enabled function
           attributes #0 = { nounwind uwtable "_ZGVbM4ul_" "_ZGVbN4ul_" ...}

   Step b: 1) 'a' is uniform, as it is the base address of array 'a'
           2) 'k' is linear, as 'k' is the induction variable with stride=1
           3) SIMD "dowork" is called unconditionally in the candidate k loop.
           4) it is compiled for SSE4.1 with the Vector Length VL=4.
              based on these properties, the signature is "_ZGVbN4ul_"

   [[Notes: For conditional call in the loop, it needs masking support,
            the implementation details seen in reference [1][2][3] ]]

   Step c: Check if the signature "_ZGVbN4ul_" exists in function attribute #0;
           if yes the suitable vectorized version is found and will be linked
           with.

   The below loop is expected to be produced by the LoopVectorizer:
   ... ...
   vectorized_for (k = 0; k < 4096; k += 4) {
     a[k:4] = {k, k+1, k+2, k+3} * 0.5;
     a[k:4] = _ZGVb4Nul_dowork(a, k);
   }
   ... ...

[[Note: Vectorizer support for the Short Vector Math Library (SVML) functions
        will be a seperate proposal. ]]

GCC and ICC Compatibility
=========================
With this proposal the callee function and the loop containing a call to it
can each be compiled and vectorized by a different compiler, including
Clang+LLVM with its LoopVectorizer as outlined above, GCC and ICC. The
vectorized loop will then be linked with the vectorized callee function.
Of-course each of these compilers can also be used to compile both loop and
callee function.

Current Implementation Status and Plan
======================================
1. Clang FE is done by Intel Clang FE team according to #1. Note: Clang FE
   syntax process patch is implemented and under community review
   (http://reviews.llvm.org/D10599). In general, the review feedback is
   very positive from the Clang community.

2. A new pass for function vectorization is implemented to support #2 and
   to be prepared for LLVM community review.

3. Work is in progress to teach LLVM's LoopVectorizer to vectorize a loop
   with user-defined function calls according to #3.

Call for Action
===============
1. Please review this proposal and provide constructive feedback on its
   direction and key ideas.

2. Feel free to ask any technical questions related to this proposal and
   to read the associated references.

3. Help is also highly welcome and appreciated in the development and
   upstreaming process.