[cfe-dev] [llvm-dev] Proposal for function vectorization and loop vectorization with function calls
via cfe-dev
cfe-dev at lists.llvm.org
Fri Mar 18 03:03:26 PDT 2016
Pinging David and Richard!
Yours,
Andrey
> 3 марта 2016 г., в 11:27, Andrey Bokhanko <andreybokhanko at gmail.com> написал(а):
>
> Hi David [Majnemer], Richard [Smith],
>
> Front-end wise, the biggest change in this proposal is introduction of
> new mangling for vector functions.
>
> May I ask you to look at the mangling part (sections 1 and 2 in the
> "Proposed Implementation" chapter) and review it?
>
> (Obviously, others who are concerned with how mangling is done in
> Clang are welcome to chime in as well!)
>
> Yours,
> Andrey
>
>
> On Wed, Mar 2, 2016 at 10:49 PM, Tian, Xinmin via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>> Proposal for function vectorization and loop vectorization with function calls
>> ==============================================================================
>> Intel Corporation (3/2/2016)
>>
>> This is a proposal for an initial work towards Clang and LLVM implementation of
>> vectorizing a function annotated with OpenMP 4.5's "#pragma omp declare simd"
>> (named SIMD-enabled function) and its associated clauses based on the VectorABI
>> [2]. On the caller side, we propose to improve LLVM loopVectorizer such that
>> the code that calls the SIMD-enabled function can be vectorized. On the callee
>> side, we propose to add Clang FE support for "#pragma omp declare simd" syntax
>> and a new pass to transform the SIMD-enabled function body into a SIMD loop.
>> This newly created loop can then be fed to LLVM loopVectorizer (or its future
>> enhancement) for vectorization. This work does leverage LLVM's existing
>> LoopVectorizer.
>>
>>
>> Problem Statement
>> =================
>> Currently, if a loop calls a user-defined function or a 3rd party library
>> function, the loop can't be vectorized unless the function is inlined. In the
>> example below the LoopVectorizer fails to vectorize the k loop due to its
>> function call to "dowork" because "dowork" is an external function. Note that
>> inlining the "dowork" function may result in vectorization for some of the
>> cases, but that is not a generally applicable solution. Also, there may be
>> reasons why compiler may not (or can't) inline the "dowork" function call.
>> Therefore, there is value in being able to vectorize the loop with a call to
>> "dowork" function in it.
>>
>> #include<stdio.h>
>> extern float dowork(float *a, int k);
>>
>> float a[4096];
>> int main()
>> { int k;
>> #pragma clang loop vectorize(enable)
>> for (k = 0; k < 4096; k++) {
>> a[k] = k * 0.5;
>> a[k] = dowork(a, k);
>> }
>> printf("passed %f\n", a[1024]);
>> }
>>
>> sh-4.1$ clang -c -O2 -Rpass=loop-vectorize -Rpass-missed=loop-vectorize
>> -Rpass-analysis=loop-vectorize loopvec.c
>> loopvec.c:15:12: remark: loop not vectorized: call instruction cannot be
>> vectorized [-Rpass-analysis]
>> a[k] = dowork(a, k);
>> ^
>> loopvec.c:13:3: remark: loop not vectorized: use -Rpass-analysis=loop-vectorize
>> for more info (Force=true) [-Rpass-missed=loop-vectorize]
>> for (k = 0; k < 4096; k++) {
>> ^
>> loopvec.c:13:3: warning: loop not vectorized: failed explicitly specified
>> loop vectorization [-Wpass-failed]
>> 1 warning generated.
>>
>>
>> New functionality of Vectorization
>> ==================================
>> New functionalities and enhancements are proposed to address the issues
>> stated above which include: a) Vectorize a function annotated by the
>> programmer using OpenMP* SIMD extensions; b) Enhance LLVM's LoopVectorizer
>> to vectorize a loop containing a call to SIMD-enabled function.
>>
>> For example, when writing:
>>
>> #include<stdio.h>
>>
>> #pragma omp declare simd uniform(a) linear(k)
>> extern float dowork(float *a, int k);
>>
>> float a[4096];
>> int main()
>> { int k;
>> #pragma clang loop vectorize(enable)
>> for (k = 0; k < 4096; k++) {
>> a[k] = k * 0.5;
>> a[k] = dowork(a, k);
>> }
>> printf("passed %f\n", a[1024]);
>> }
>>
>> the programmer asserts that
>> a) there will be a vector version of "dowork" available for the compiler to
>> use (link with, with appropriate signature, explained below) when
>> vectorizing the k loop; and that
>> b) no loop-carried backward dependencies are introduced by the "dowork"
>> call that prevent the vectorization of the k loop.
>>
>> The expected vector loop (shown as pseudo code, ignoring leftover iterations)
>> resulting from LLVM's LoopVectorizer is
>>
>> ... ...
>> vectorized_for (k = 0; k < 4096; k += VL) {
>> a[k:VL] = {k, k+1, k+2, k+VL-1} * 0.5;
>> a[k:VL] = _ZGVb4Nul_dowork(a, k);
>> }
>> ... ...
>>
>> In this example "_ZGVb4Nul_dowork" is a special name mangling where:
>> _ZGV is a prefix based on C/C++ name mangling rule suggested by GCC community,
>> 'b' indicates "xmm" (assume we vectorize here to 128bit xmm vector registers),
>> '4' is VL (assume we vectorize here for length 4),
>> 'N' indicates that the function is vectorized without a mask, M indicates that
>> the function is vecrized with a mask.
>> 'u' indicates that the first parameter has the "uniform" property,
>> 'l' indicates that the second argement has the "linear" property.
>>
>> More details (including name mangling scheme) can be found in the following
>> references [2].
>>
>> References
>> ==========
>>
>> 1. OpenMP SIMD language extensions: http://www.openmp.org/mp-documents/openmp-4.
>> 5.pdf
>>
>> 2. VectorABI Documentation:
>> https://www.cilkplus.org/sites/default/files/open_specifications/Intel-ABI-Vecto
>> r-Function-2012-v0.9.5.pdf
>> https://sourceware.org/glibc/wiki/libmvec?action=AttachFile&do=view&target=Vecto
>> rABI.txt
>>
>> [[Note: VectorABI was reviewed at X86-64 System V Application Binary Interface
>> mailing list. The discussion was recorded at
>> https://groups.google.com/forum/#!topic/x86-64-abi/LmppCfN1rZ4 ]]
>>
>> 3. The first paper on SIMD extensions and implementations:
>> "Compiling C/C++ SIMD Extensions for Function and Loop Vectorizaion on
>> Multicore-SIMD Processors" by Xinmin Tian, Hideki Saito, Milind Girkar,
>> Serguei Preis, Sergey Kozhukhov, et al., IPDPS Workshops 2012, pages 2349--2358
>> [[Note: the first implementation and the paper were done before VectorABI was
>> finalized with the GCC community and Redhat. The latest VectorABI
>> version for OpenMP 4.5 is ready to be published]]
>>
>>
>> Proposed Implementation
>> =======================
>> 1. Clang FE parses "#pragma omp declare simd [clauses]" and generates mangled
>> name including these prefixes as vector signatures. These mangled name
>> prefixes are recorded as function attributes in LLVM function attribute
>> group. Note that it may be possible to have several mangled names associated
>> with the same function, which correspond to several desired vectorized
>> versions. Clang FE generates all function attributes for expected vector
>> variants to be generated by the back-end. E.g.,
>>
>> #pragma omp delcare simd uniform(a) linear(k)
>> float dowork(float *a, int k)
>> {
>> a[k] = sinf(a[k]) + 9.8f;
>> }
>>
>> define __stdcall f32 @_dowork(f32* %a, i32 %k) #0
>> ... ...
>> attributes #0 = { nounwind uwtable "_ZGVbM4ul_" "_ZGVbN4ul_" ...}
>>
>> 2. A new vector function generation pass is introduced to generate vector
>> variants of the original scalar function based on VectorABI (see [2, 3]).
>> For example, one vector variant is generated for "_ZGVbN4ul_" attribute
>> as follows (pseudo code):
>>
>> define __stdcall <4 x f32> @_ZGVbN4ul_dowork(f32* %a, i32 %k) #0
>> {
>> #pragma clang loop vectorize(enable)
>> for (int %t = k; %t < %k + 4; %t++) {
>> %a[t] = sinf(%a[t]) + 9.8f;
>> }
>> vec_load xmm0, %a[k:VL]
>> return xmm0;
>> }
>>
>> The body of the function is wrapped inside a loop having VL iterations,
>> which correspond to the vector lanes.
>>
>> The LLVM LoopVectorizer will vectorize the generated %t loop, expected
>> to produce the following vectorized code eliminating the loop (pseudo code):
>>
>> define __stdcall <4 x f32> @_ZGVbN4ul_dowork(f32* %a, i32 %k) #0
>> {
>> vec_load xmm1, %a[k: VL]
>> xmm2 = call __svml_sinf(xmm1)
>> xmm0 = vec_add xmm2, [9,8f, 9.8f, 9.8f, 9.8f]
>> store %a[k:VL], xmm0
>> return xmm0;
>> }
>>
>> [[Note: Vectorizer support for the Short Vector Math Library (SVML)
>> functions will be a seperate proposal. ]]
>>
>> 3. The LLVM LoopVectorizer is enhanced to
>> a) identify loops with calls that have been annotated with
>> "#pragma omp declare simd" by checking function attribute groups;
>> b) analyze each call instruction and its parameters in the loop, to
>> determine if each parameter has the following properties:
>> * uniform
>> * linear + stride
>> * vector
>> * aligned
>> * called inside a conditional branch or not
>> ... ...
>> Based on these properties, the signature of the vectorized call is
>> generated; and
>> c) performs signature matching to obtain the suitable vector variant
>> among the signatures available for the called function. If no such
>> signature is found, the call cannot be vectorized.
>>
>> Note that a similar enhancement can and should be made also to LLVM's
>> SLP vectorizer.
>>
>> For example:
>>
>> #pragma omp declare simd uniform(a) linear(k)
>> extern float dowork(float *a, int k);
>>
>> ... ...
>> #pragma clang loop vectorize(enable)
>> for (k = 0; k < 4096; k++) {
>> a[k] = k * 0.5;
>> a[k] = dowork(a, k);
>> }
>> ... ...
>>
>> Step a: "dowork" function is marked as SIMD-enabled function
>> attributes #0 = { nounwind uwtable "_ZGVbM4ul_" "_ZGVbN4ul_" ...}
>>
>> Step b: 1) 'a' is uniform, as it is the base address of array 'a'
>> 2) 'k' is linear, as 'k' is the induction variable with stride=1
>> 3) SIMD "dowork" is called unconditionally in the candidate k loop.
>> 4) it is compiled for SSE4.1 with the Vector Length VL=4.
>> based on these properties, the signature is "_ZGVbN4ul_"
>>
>> [[Notes: For conditional call in the loop, it needs masking support,
>> the implementation details seen in reference [1][2][3] ]]
>>
>> Step c: Check if the signature "_ZGVbN4ul_" exists in function attribute #0;
>> if yes the suitable vectorized version is found and will be linked
>> with.
>>
>> The below loop is expected to be produced by the LoopVectorizer:
>> ... ...
>> vectorized_for (k = 0; k < 4096; k += 4) {
>> a[k:4] = {k, k+1, k+2, k+3} * 0.5;
>> a[k:4] = _ZGVb4Nul_dowork(a, k);
>> }
>> ... ...
>>
>> [[Note: Vectorizer support for the Short Vector Math Library (SVML) functions
>> will be a seperate proposal. ]]
>>
>>
>> GCC and ICC Compatibility
>> =========================
>> With this proposal the callee function and the loop containing a call to it
>> can each be compiled and vectorized by a different compiler, including
>> Clang+LLVM with its LoopVectorizer as outlined above, GCC and ICC. The
>> vectorized loop will then be linked with the vectorized callee function.
>> Of-course each of these compilers can also be used to compile both loop and
>> callee function.
>>
>>
>> Current Implementation Status and Plan
>> ======================================
>> 1. Clang FE is done by Intel Clang FE team according to #1. Note: Clang FE
>> syntax process patch is implemented and under community review
>> (http://reviews.llvm.org/D10599). In general, the review feedback is
>> very positive from the Clang community.
>>
>> 2. A new pass for function vectorization is implemented to support #2 and
>> to be prepared for LLVM community review.
>>
>> 3. Work is in progress to teach LLVM's LoopVectorizer to vectorize a loop
>> with user-defined function calls according to #3.
>>
>> Call for Action
>> ===============
>> 1. Please review this proposal and provide constructive feedback on its
>> direction and key ideas.
>>
>> 2. Feel free to ask any technical questions related to this proposal and
>> to read the associated references.
>>
>> 3. Help is also highly welcome and appreciated in the development and
>> upstreaming process.
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
More information about the cfe-dev
mailing list