[llvm-dev] Proposal for function vectorization and loop vectorization with function calls

Wed Mar 2 16:07:40 PST 2016

Hi Tian,

> On Mar 2, 2016, at 3:48 PM, Tian, Xinmin <xinmin.tian at intel.com> wrote:
> 
> Hi Michael. Thank for your feedback and questions/comments.  See below.  
> 
>>>>>> I think it should be possible to vectorize such loop even without openmp clauses. We just need to gather a vector value from several scalar calls, and vectorizer already knows how to do that, we just need not to bail out early. Dealing with calls is tricky, but in this case we have the pragma, so we can assume it should be fine. What do you think, will it make sense to start from here?
> 
> Yes, we can vectorize this loop by calling scalar code VL times as an emulation of SIMD execution if the SIMD version does not exist  to start with. See below example, we need this functionality anyway as a fall back for vecotizing this loop when there is no SIMD version of dowork exist.   E.g. 
> 
> #pragma clang loop vectorize(enable)
> for (k = 0; k < 4096; k++) {
>    a[k] = k * 0.5;
>    a[k] = dowork(a, k);
> }
> 
> ==>
> 
> Vectorized_for (k = 0; k < 4096; k+=VL) {      // assume VL = 4.  No vector version of dowork function exist. 
>    a[k:VL] = {k, k+1, K+2, k+3) * 0.5.;    //   Broadcast 0.5 to SIMD register, vector mul, with {k, k+1, k+2, k+3}, vector store to a[k:VL] 
>    t0 = dowork(a, k)                                  //   emulate SIMD execution with scalar calls. 
>    t1 = dowork(a, k+1)
>    t2 = dowork(a, k+2)
>    t3 = dowork(a, k+3)
>    a[k:VL] = {t0, t1, t2, t3};                        // SIMD store 
> }
Yes, that’s what I meant.

>>>>>> Am I getting it right, that you're going to emit declarations for all possible vector types, and then implement only used ones? If not, how does frontend know which vector-width to use? If the dowork function and its caller are in different modules, how does compiler communicate what vector width are needed?
> 
> Yes, you are right in general, that is defined by VectorABI used by GCC and ICC.  E.g. GCC generation 7 versions by default for x86 (scalar, SSE(mask, nomask), AVX(mask, nomask), AVX2 (mask, nomask).
How does it play with other architectures? Should it be described in more general terms, like vector/element width? I realize that you might be mostly concerned about x86, but this feature looks pretty generic, so I think it should be kept target-independent.

> There are several options we can optimize to reduce the # of version we need to generate w.r.t compile-time and code-size. We can provide detailed info. 
I’ll be interested in looking into this, as I find this part the most challenging in this changeset (other parts look to me like clear improvements of what we have now).

Thanks,
Michael
> 
>>>>>> Loop Vectorizer already supports math functions and math functions libraries. You might need just to expand this support to SVML (i.e. add tables of correspondence between scalar and vector function variants).
> 
> Correct, that is the Step 3 in the doc we are working on. 
> 
>>>>>> Again, thanks for writing it up. I think this would be a valuable improvement of the vectorizer and I'm looking forward to further discussion and/or patches!
> 
> Thanks for the positive feedback! We are also looking forward to further discussion and sending patches with help from you and other LLVM community members.  
> 
> Thanks,
> Xinmin 
> 
> -----Original Message-----
> From: mzolotukhin at apple.com [mailto:mzolotukhin at apple.com] 
> Sent: Wednesday, March 2, 2016 2:42 PM
> To: Tian, Xinmin <xinmin.tian at intel.com>
> Cc: llvm-dev at lists.llvm.org; Clang Dev <cfe-dev at lists.llvm.org>; llvm-dev-bounces at lists.llvm.org
> Subject: Re: [llvm-dev] Proposal for function vectorization and loop vectorization with function calls
> 
> Hi Tian,
> 
> Thanks for the writeup, it sounds very interesting! Please find some questions/comments inline:
> 
> 
>> On Mar 2, 2016, at 11:49 AM, Tian, Xinmin via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>> 
>> Proposal for function vectorization and loop vectorization with 
>> function calls 
>> ======================================================================
>> ========
>> Intel Corporation (3/2/2016)
>> 
>> This is a proposal for an initial work towards Clang and LLVM 
>> implementation of vectorizing a function annotated with  OpenMP 4.5's "#pragma omp declare simd"
>> (named SIMD-enabled function) and its associated clauses based on the 
>> VectorABI [2]. On the caller side, we propose to improve LLVM 
>> loopVectorizer such that the code that calls the SIMD-enabled function 
>> can be vectorized. On the callee side, we propose to add Clang FE 
>> support for "#pragma omp declare simd" syntax and a new pass to transform the SIMD-enabled function body into a SIMD loop.
>> This newly created loop can then be fed to LLVM loopVectorizer (or its 
>> future
>> enhancement) for vectorization. This work does leverage LLVM's 
>> existing LoopVectorizer.
>> 
>> 
>> Problem Statement
>> =================
>> Currently, if a loop calls a user-defined function or a 3rd party 
>> library function, the loop can't be vectorized unless the function is 
>> inlined. In the example below the LoopVectorizer fails to vectorize 
>> the k loop due to its function call to "dowork" because "dowork" is an 
>> external function. Note that inlining the "dowork" function may result 
>> in vectorization for some of the cases, but that is not a generally 
>> applicable solution. Also, there may be reasons why compiler may not (or can't) inline the "dowork" function call.
>> Therefore, there is value in being able to vectorize the loop with a 
>> call to "dowork" function in it.
>> 
>> #include<stdio.h>
>> extern float dowork(float *a, int k);
>> 
>> float a[4096];
>> int main()
>> { int k;
>> #pragma clang loop vectorize(enable)
>> for (k = 0; k < 4096; k++) {
>>   a[k] = k * 0.5;
>>   a[k] = dowork(a, k);
>> }
>> printf("passed %f\n", a[1024]);
>> }
> I think it should be possible to vectorize such loop even without openmp clauses. We just need to gather a vector value from several scalar calls, and vectorizer already knows how to do that, we just need not to bail out early. Dealing with calls is tricky, but in this case we have the pragma, so we can assume it should be fine. What do you think, will it make sense to start from here?
>> 
>> sh-4.1$ clang -c -O2 -Rpass=loop-vectorize -Rpass-missed=loop-vectorize
>>                    -Rpass-analysis=loop-vectorize loopvec.c
>> loopvec.c:15:12: remark: loop not vectorized: call instruction cannot be
>>     vectorized [-Rpass-analysis]
>>   a[k] = dowork(a, k);
>>          ^
>> loopvec.c:13:3: remark: loop not vectorized: use -Rpass-analysis=loop-vectorize
>>     for more info (Force=true) [-Rpass-missed=loop-vectorize]  for (k 
>> = 0; k < 4096; k++) {  ^
>> loopvec.c:13:3: warning: loop not vectorized: failed explicitly specified
>>               loop vectorization [-Wpass-failed]
>> 1 warning generated.
>> 
>> 
>> New functionality of Vectorization
>> ==================================
>> New functionalities and enhancements are proposed to address the 
>> issues stated above which include: a) Vectorize a function annotated 
>> by the programmer using OpenMP* SIMD extensions; b) Enhance LLVM's 
>> LoopVectorizer to vectorize a loop containing a call to SIMD-enabled function.
>> 
>> For example, when writing:
>> 
>> #include<stdio.h>
>> 
>> #pragma omp declare simd uniform(a) linear(k) extern float 
>> dowork(float *a, int k);
>> 
>> float a[4096];
>> int main()
>> { int k;
>> #pragma clang loop vectorize(enable)
>> for (k = 0; k < 4096; k++) {
>>   a[k] = k * 0.5;
>>   a[k] = dowork(a, k);
>> }
>> printf("passed %f\n", a[1024]);
>> }
>> 
>> the programmer asserts that
>> a) there will be a vector version of "dowork" available for the compiler to
>>    use (link with, with appropriate signature, explained below) when
>>    vectorizing the k loop; and that
>> b) no loop-carried backward dependencies are introduced by the "dowork"
>>    call that prevent the vectorization of the k loop.
>> 
>> The expected vector loop (shown as pseudo code, ignoring leftover 
>> iterations) resulting from LLVM's LoopVectorizer is
>> 
>> ... ...
>> vectorized_for (k = 0; k < 4096; k += VL) {
>>   a[k:VL] = {k, k+1, k+2, k+VL-1} * 0.5;
>>   a[k:VL] = _ZGVb4Nul_dowork(a, k);
>> }
>> ... ...
>> 
>> In this example "_ZGVb4Nul_dowork" is a special name mangling where:
>> _ZGV is a prefix based on C/C++ name mangling rule suggested by GCC 
>> community, 'b' indicates "xmm" (assume we vectorize here to 128bit xmm 
>> vector registers), '4' is VL (assume we vectorize here for length 4), 
>> 'N' indicates that the function is vectorized without a mask, M indicates that
>>    the function is vecrized with a mask.
>> 'u' indicates that the first parameter has the "uniform" property, 'l' 
>> indicates that the second argement has the "linear" property.
>> 
>> More details (including name mangling scheme) can be found in the 
>> following references [2].
>> 
>> References
>> ==========
>> 
>> 1. OpenMP SIMD language extensions: http://www.openmp.org/mp-documents/openmp-4.
>> 5.pdf
>> 
>> 2. VectorABI Documentation:
>> https://www.cilkplus.org/sites/default/files/open_specifications/Intel
>> -ABI-Vecto
>> r-Function-2012-v0.9.5.pdf
>> https://sourceware.org/glibc/wiki/libmvec?action=AttachFile&do=view&ta
>> rget=Vecto
>> rABI.txt
>> 
>> [[Note: VectorABI was reviewed at X86-64 System V Application Binary Interface
>>       mailing list. The discussion was recorded at
>>       https://groups.google.com/forum/#!topic/x86-64-abi/LmppCfN1rZ4 
>> ]]
>> 
>> 3. The first paper on SIMD extensions and implementations:
>> "Compiling C/C++ SIMD Extensions for Function and Loop Vectorizaion on 
>> Multicore-SIMD Processors" by Xinmin Tian, Hideki Saito, Milind 
>> Girkar, Serguei Preis, Sergey Kozhukhov, et al., IPDPS Workshops 2012, 
>> pages 2349--2358
>> [[Note: the first implementation and the paper were done before VectorABI was
>>       finalized with the GCC community and Redhat. The latest VectorABI
>>       version for OpenMP 4.5 is ready to be published]]
>> 
>> 
>> Proposed Implementation
>> =======================
>> 1. Clang FE parses "#pragma omp declare simd [clauses]" and generates mangled
>>  name including these prefixes as vector signatures. These mangled name
>>  prefixes are recorded as function attributes in LLVM function attribute
>>  group. Note that it may be possible to have several mangled names associated
>>  with the same function, which correspond to several desired vectorized
>>  versions. Clang FE generates all function attributes for expected vector
>>  variants to be generated by the back-end. E.g.,
>> 
>>  #pragma omp delcare simd uniform(a) linear(k)
>>  float dowork(float *a, int k)
>>  {
>>     a[k] = sinf(a[k]) + 9.8f;
>>  }
>> 
>>  define __stdcall f32 @_dowork(f32* %a, i32 %k) #0
>>  ... ...
>>  attributes #0 = { nounwind uwtable "_ZGVbM4ul_" "_ZGVbN4ul_" ...}
>> 
>> 2. A new vector function generation pass is introduced to generate vector
>>  variants of the original scalar function based on VectorABI (see [2, 3]).
>>  For example, one vector variant is generated for "_ZGVbN4ul_" attribute
>>  as follows (pseudo code):
>> 
>>  define __stdcall <4 x f32> @_ZGVbN4ul_dowork(f32* %a, i32 %k) #0
>>  {
>>    #pragma clang loop vectorize(enable)
>>    for (int %t = k; %t < %k + 4; %t++) {
>>      %a[t] = sinf(%a[t]) + 9.8f;
>>    }
>>    vec_load xmm0, %a[k:VL]
>>    return xmm0;
>>  }
> Am I getting it right, that you're going to emit declarations for all possible vector types, and then implement only used ones? If not, how does frontend know which vector-width to use? If the dowork function and its caller are in different modules, how does compiler communicate what vector width are needed?
> 
> 
> 
>> 
>>  The body of the function is wrapped inside a loop having VL iterations,
>>  which correspond to the vector lanes.
>> 
>>  The LLVM LoopVectorizer will vectorize the generated %t loop, expected
>>  to produce the following vectorized code eliminating the loop (pseudo code):
>> 
>>  define __stdcall <4 x f32> @_ZGVbN4ul_dowork(f32* %a, i32 %k) #0
>>  {
>>    vec_load xmm1,  %a[k: VL]
>>    xmm2 = call __svml_sinf(xmm1)
>>    xmm0 = vec_add  xmm2, [9,8f, 9.8f, 9.8f, 9.8f]
>>    store %a[k:VL], xmm0
>>    return xmm0;
>>  }
>> 
>>  [[Note: Vectorizer support for the Short Vector Math Library (SVML)
>>          functions will be a seperate proposal. ]]
> Loop Vectorizer already supports math functions and math functions libraries. You might need just to expand this support to SVML (i.e. add tables of correspondence between scalar and vector function variants).
>> 
>> 3. The LLVM LoopVectorizer is enhanced to
>>  a) identify loops with calls that have been annotated with
>>     "#pragma omp declare simd" by checking function attribute groups;
>>  b) analyze each call instruction and its parameters in the loop, to
>>     determine if each parameter has the following properties:
>>       * uniform
>>       * linear + stride
>>       * vector
>>       * aligned
>>       * called inside a conditional branch or not
>>         ... ...
>>     Based on these properties, the signature of the vectorized call is
>>     generated; and
>>  c) performs signature matching to obtain the suitable vector variant
>>     among the signatures available for the called function. If no such
>>     signature is found, the call cannot be vectorized.
>> 
>>  Note that a similar enhancement can and should be made also to LLVM's
>>  SLP vectorizer.
>> 
>>  For example:
>> 
>>  #pragma omp declare simd uniform(a) linear(k)
>>  extern float dowork(float *a, int k);
>> 
>>  ... ...
>>  #pragma clang loop vectorize(enable)
>>  for (k = 0; k < 4096; k++) {
>>    a[k] = k * 0.5;
>>    a[k] = dowork(a, k);
>>  }
>>  ... ...
>> 
>>  Step a: "dowork" function is marked as SIMD-enabled function
>>          attributes #0 = { nounwind uwtable "_ZGVbM4ul_" "_ZGVbN4ul_" 
>> ...}
>> 
>>  Step b: 1) 'a' is uniform, as it is the base address of array 'a'
>>          2) 'k' is linear, as 'k' is the induction variable with stride=1
>>          3) SIMD "dowork" is called unconditionally in the candidate k loop.
>>          4) it is compiled for SSE4.1 with the Vector Length VL=4.
>>             based on these properties, the signature is "_ZGVbN4ul_"
>> 
>>  [[Notes: For conditional call in the loop, it needs masking support,
>>           the implementation details seen in reference [1][2][3] ]]
>> 
>>  Step c: Check if the signature "_ZGVbN4ul_" exists in function attribute #0;
>>          if yes the suitable vectorized version is found and will be linked
>>          with.
>> 
>>  The below loop is expected to be produced by the LoopVectorizer:
>>  ... ...
>>  vectorized_for (k = 0; k < 4096; k += 4) {
>>    a[k:4] = {k, k+1, k+2, k+3} * 0.5;
>>    a[k:4] = _ZGVb4Nul_dowork(a, k);
>>  }
>>  ... ...
>> 
>> [[Note: Vectorizer support for the Short Vector Math Library (SVML) functions
>>       will be a seperate proposal. ]]
>> 
>> 
>> GCC and ICC Compatibility
>> =========================
>> With this proposal the callee function and the loop containing a call 
>> to it can each be compiled and vectorized by a different compiler, 
>> including
>> Clang+LLVM with its LoopVectorizer as outlined above, GCC and ICC. The
>> vectorized loop will then be linked with the vectorized callee function.
>> Of-course each of these compilers can also be used to compile both 
>> loop and callee function.
>> 
>> 
>> Current Implementation Status and Plan 
>> ======================================
>> 1. Clang FE is done by Intel Clang FE team according to #1. Note: Clang FE
>>  syntax process patch is implemented and under community review
>>  (http://reviews.llvm.org/D10599). In general, the review feedback is
>>  very positive from the Clang community.
>> 
>> 2. A new pass for function vectorization is implemented to support #2 and
>>  to be prepared for LLVM community review.
>> 
>> 3. Work is in progress to teach LLVM's LoopVectorizer to vectorize a loop
>>  with user-defined function calls according to #3.
>> 
>> Call for Action
>> ===============
>> 1. Please review this proposal and provide constructive feedback on its
>>  direction and key ideas.
>> 
>> 2. Feel free to ask any technical questions related to this proposal and
>>  to read the associated references.
>> 
>> 3. Help is also highly welcome and appreciated in the development and
>>  upstreaming process.
>> 
> 
> Again, thanks for writing it up. I think this would be a valuable improvement of the vectorizer and I'm looking forward to further discussion and/or patches!
> 
> Best regards,
> Michael
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>