[llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

Francesco Petrogalli via llvm-dev llvm-dev at lists.llvm.org
Mon Dec 12 05:44:45 PST 2016

Hi Xinmin,

I have updated the clang patch using the standard name mangling you
suggested - I was not fully aware of the C++ mangling convention “_ZVG”.

I am using “D” for 64-bit NEON and “Q” for 128-bit NEON, which makes NEON
vector symbols look as follows:


Here “Q” means -> NEON 128-bit, “D” means -> NEON 64-bit

Please notice that although I have changed the name mangling in clang [1],
there have been no need to update the relative llvm patch [2], as the
vectorisation process is _independent_ of the name mangling.



[1] https://reviews.llvm.org/D27250
[2] https://reviews.llvm.org/D27249, The only update was a bug fix in the
copy constructor of the TLII and in the return value of the TLII::mangle()
method. None of the underlying scalar/vector function matching algorithms
have been touched.

On 08/12/2016 18:11, "Tian, Xinmin" <xinmin.tian at intel.com> wrote:

>Hi Francesco, a bit more information.  GCC veclib is implemented based on
>GCC VectorABI for declare simd as well.
>For name mangling, we have to follow certain rules of C/C++ (e.g. prefix
>needs to _ZVG ....).  David Majnemer who is the owner and stakeholder for
>approval for Clang and LLVM.  Also,  we need to pay attention to GCC
>compatibility.  I would suggest you look into how GCC VectorABI can be
>extended support your Arch.
>-----Original Message-----
>From: Odeh, Saher 
>Sent: Thursday, December 8, 2016 3:49 AM
>To: Tian, Xinmin <xinmin.tian at intel.com>; llvm-dev at lists.llvm.org;
>Francesco.Petrogalli at arm.com
>Cc: nd <nd at arm.com>; Masten, Matt <matt.masten at intel.com>; Hal Finkel
><hfinkel at anl.gov>; Zaks, Ayal <ayal.zaks at intel.com>; a.bataev at hotmail.com
>Subject: RE: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the
>Hi Francesco,
>As you stated in the RFC, when vectorizing a scalar function (e.g. when
>using omp declare simd), one needs to incorporate attributes to the
>resulting vectorized-function.
>These attributes describe a) the behavior of the function, e.g. mask-able
>or not, and b) the type of the parameters, e.g. scalar or linear or any
>other option.
>As this list is extensive, it is only logical to use an existing
>infrastructure of ICC and GCC vectorABI which already covers all of these
>options as stated in Xinmin's RFC
>Moreover, when considering other compilers such as GCC, I do see that the
>resulting assembly actually does incorporate this exact infrastructure.
>So if we wish to link different parts of the program using clang and GCC
>we'll need to adhere to the same name mangling/ABI. Please see the below
>result after compiling an omp declare simd function using GCC.
>Lastly, please note the two out of the three components of the
>implementation have already been committed or submitted, and both are
>adhering the name mangling proposed by Xinmin's RFC. A) committed - the
>FE portion by Alexey [https://reviews.llvm.org/rL264853], it generates
>mangled names in the manner described by Xinmin's RFC, See below B)
>Submitted - the callee side by Matt [https://reviews.llvm.org/D22792], it
>uses these mangled names. and C) caller which is covered by this patch.
>In order to mitigate the needed effort and possible issues when
>implementing, I believe it is best to follow the name mangling proposed
>in Xinmin's RFC. What do you think?
>GCC Example
>Compiler version: GCC 6.1.0
>Compile line: gcc -c omp.c -fopenmp -Wall -S -o - -O3 > omp.s
>#include <omp.h>
>#pragma omp declare simd
>int dowork(int* a, int idx)
> return a[idx] * a[idx]*7;
>less omp.s | grep @function
>        .type   dowork, @function
>        .type   _ZGVbN4vv_dowork, @function
>        .type   _ZGVbM4vv_dowork, @function
>        .type   _ZGVcN4vv_dowork, @function
>        .type   _ZGVcM4vv_dowork, @function
>        .type   _ZGVdN8vv_dowork, @function
>        .type   _ZGVdM8vv_dowork, @function
>        .type   _ZGVeN16vv_dowork, @function
>        .type   _ZGVeM16vv_dowork, @function
>Clang on FE using Alexey's patch
>Compile line: clang -c tst/omp_fun.c -fopenmp -mllvm -print-after-all >&
>#pragma omp declare simd
>extern int dowork(int* a, int idx)
>  return a[idx]*7;
>int main() {
>  dowork(0,1);
>attributes #0 = { nounwind uwtable "_ZGVbM4vv_dowork" "_ZGVbN4vv_dowork"
>"_ZGVcM8vv_dowork" "_ZGVcN8vv_dowork" "_ZGVdM8vv_dowork"
>"_ZGVdN8vv_dowork" "_ZGVeM16vv_dowork" "_ZGVeN16vv_dowork"
>"disable-tail-calls"="false" "less-precise-fpmad"="false"
>"no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf"
>"no-infs-fp-math"="false" "no-jump-tables"="false"
>"no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false"
>"no-trapping-math"="false" "stack-protector-buffer-size"="8"
>"target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2,+x87"
>"unsafe-fp-math"="false" "use-soft-float"="false" }
>Thanks Saher
>-----Original Message-----
>From: Francesco Petrogalli [mailto:Francesco.Petrogalli at arm.com]
>Sent: Tuesday, December 06, 2016 17:22
>To: Tian, Xinmin <xinmin.tian at intel.com>; llvm-dev at lists.llvm.org
>Cc: nd <nd at arm.com>; Masten, Matt <matt.masten at intel.com>; Hal Finkel
><hfinkel at anl.gov>; Zaks, Ayal <ayal.zaks at intel.com>; a.bataev at hotmail.com
>Subject: Re: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the
>Hi Xinmin,
>Thank you for your email.
>I have been catching up with the content of your proposal, and I have
>some questions/remarks below that I'd like to discuss with you - see the
>final section in the proposal.
>I have specifically added Alexey B. to the mail so we can move our
>conversation from phabricator to the mailing list.
>Before we start, I just want to mention that the initial idea of using
>llvm::FunctionType for vector function generation and matching has been
>proposed by a colleague, Paul Walker, when we first tried out supporting
>this on AArch64 on an internal version of llvm. I received some input
>also from Amara Emerson.
>In our case we had a slightly different problem to solve: we wanted to
>support in the vectorizer a rich set of vector math routines provided
>with an external library. We managed to do this by adding the pragma to
>the (scalar) function declaration of the header file provided with the
>library, and as shown by the patches I have submitted, by generating
>vector function signatures that the vectorizer can search in the
>Here is an updated version of the proposal. Please let me know what you
>think, and if you have any solution we could use for the final section.
># RFC for "pragma omp declare simd"
>Hight level components:
>A) Global variable generator (clang FE)
>B) Parameter descriptors (as new enumerations in llvm::Attribute)
>C) TLII methods and fields for the multimap (llvm middle-end)
>## Workflow
>Example user input, with a declaration and definition:
>    #pragma omp declare simd
>    #pragma omp declare simd uniform(y)
>    extern double pow(double x, double y);
>    #pragma omp declare simd
>    #pragma omp declare simd linear(x:2)
>    float foo(float x) {....}
>    /// code using both functions
>### Step 1
>The compiler FE process these definition and declaration and generates a
>list of globals as follows:
>    @prefix_vector_pow1_midfix_pow_postfix = external global
>                                             <4 x double>(<4 x double>,
>                                                          <4 x double>)
>    @prefix_vector_pow2_midfix_pow_postfix = external global
>                                             <4 x double>(<4 x double>,
>                                                          double)
>    @prefix_vector_foo1_midfix_foo_postfix = external global
>                                             <8 x float>(<8 x float>,
>                                                         <8 x float>)
>    @prefix_vector_foo1_midfix_foo_postfix = external global
>                                             <8 x float>(<8 x float>,
>                                                         <8 x float> #0)
>    ...
>    attribute #0  = {linear = 2}
>Notes about step 1:
>1. The mapping scalar name <-> vector name is in the
>   prefix/midfix/postfix mangled name of the global variable.
>2. The examples shows only a set of possible vector function for a
>   sizeof(<4 x double>) vector extension. If multiple vector extension
>   live in the same target (eg. NEON 64-bit or NEON 128-bit, or SSE
>   and AVX512) the front end takes care to generate each of the
>   associated functions (like it is done now).
>3. Vector function parameters are rendered using the same
>   Characteristic Data Type (CDT) rule already in the compiler FE.
>4. Uniform parameters are rendered with the original scalar type.
>5. Linear parameters are rendered with vectors using the same
>   CDT-generated vector length, and decorated with proper
>   attributes. I think we could extent the llvm::Attribute enumeration
>adding the following:
>   - linear : numeric, specify_the step
>   - linear_var : numeric, specify the position of the uniform variable
>holding the step
>   - linear_uval[_var]: numeric as before, but for the "uval" modifier
>(both constant step or variable step)
>   - linear_val[_var]: numeric, as before, but for "val" modifier
>   - linear_ref[_var] numeric, for "ref" modifier.
>   For example, "attribute #0 = {linear = 2}" says that the vector of
>   the associated parameter in the function signature has a linear
>   step of 2.
>### Step 2
>The compiler FE invokes a TLII method in BackendUtils.cpp that populate a
>multimap in the TLII by checking the globals created in the previous step.
>Each global is processed, demangling the [pre/mid/post]fix name and
>generate a mapping in the TLII as follows:
>    struct VectorFnInfo {
>       std::string Name;
>       FunctionType *Signature;
>    };
>    std::multimap<std:string, VectorFnInfo> VFInfo;
>For the initial example, the multimap in the TLI is populated as follows:
>    "pow" -> [(vector_pow1, <4 x double>(<4 x double>, <4 x double>)),
>              (vector_pow2, <4 x double>(<4 x double>, double))]
>    "foo" -> [(vector_foo1, <8 x float>(<8 x float>, <8 x float>)),
>              (vector_foo2, <8 x float>(<8 x float>, <8 x float> #0))]
>Notes about step 2:
>Given the fact that the external globals that the FE have generated are
>removed _before_ the vectorizer kicks in, I am not sure if the "attribute
>#0" needed for one of the parameter is still present at this point. IF
>NOT, I think that in this case we could enrich the "VectorFnInfo" as
>    struct VectorFnInfo {
>       std::string Name;
>       FunctionType *Signature;
>       std::set<unsigned, llvm:Attribute> Attrs;
>    };
>The field "Attrs" maps the position of the parameter with the
>correspondent llvm::Attribute present in the global variable.
>I have added this note for the sake of completeness. I *think* that we
>won't be needing this additional Attrs field: I have already shown in the
>llvm patch I submitted that the function type "survives" after the global
>gets removed, I don't see why the parameter attribute shouldn't survive
>too (last famous words?).
>### Step 3
>This step happens in the LoopVectorizer. The InnerLoopVectorizer queries
>the TargetLibraryInfo looking for a vectorized version of the function by
>scalar name and function signature with the following method:
>    TargetLibraryInfo::isFunctionVectorizable(std::string ScalarName,
>FuncionType *FTy);
>This is done in a way similar to what my current llvm patch does: the
>loop vectorizer makes up the function signature it needs and look for it
>in the TLI. If a match is found, vectorization is possible. Right now the
>compiler is not aware of uniform/linear function attributes, but it still
>can refer to them in a target agnostic way, by using scalar signatures
>for the uniform ones and using llvm::Attributes for the linear ones.
>Notice that the vector name here is not used at all, which is good as any
>architecture can come up with it's own name mangling for vector
>functions, without breaking the ability of the vectorizer to vectorize
>the same code with the new name mangling.
>## External libraries vs user provided code
>The example with "pow" and "foo" I have provided before shows a function
>declaration and a function definition. Although the TLII mechanism I have
>described seems to be valid only for the former case, I think that it is
>valid also for the latter.  In fact, in case of a function definition,
>the compiler would have to generate also the body of the vector function,
>but that external global variable could still be used to inform the TLII
>of such function. The fact that the vector function needed by the
>vectorizer is in some module instead of in an external library doesn't
>seems to make all that difference at compile time to me.
># Some final notes (call for ideas!)
>There is one level of target dependence that I still have to sort out,
>and for this I need input from the community and in particular from the
>Intel folks.
>I will start with this example:
>    #pragma omp declare simd
>    float foo(float x);
>In case of NEON, this would generate 2 globals, one for vectors holding 2
>floats, and one for vector holding 4 floats, corresponding to NEON 64-bit
>and 128-bit respectively. This means that the vectorizer have a unique
>function it could choose from the list the TLI provides.
>This is not the same on Intel, for example when this code generates
>vector names for AVX and AVX2. The register width for these architecture
>extensions are the same, so all the TLI has is a mapping between scalar
>name and (vectro_name, function_type) who's two elements differ only in
>the vector_name string.
>This breaks the target independence of the vectorizer, as it would
>require it to parse the vector_name to be able to choose between the AVX
>or the AVX2 implementation.
>Now, to make this work one should have to encode the SSE/SSE2/AVX/AVX2
>information in the VectorFnInfo structure. Does anybody have an idea on
>how best to do it? For the sake of keeping the vectorizer target
>independent, I would like to avoid encoding this piece of information in
>the VectorFnInfo struct. I have seen that in your code you are generating
>SSE/AVX/AVX2/AVX512 vector functions, how do you plan to choose between
>them in the vectorizer? I could not find how you planned to solve this
>problem in your proposal, or have I just missed it?
>Is there a way to do this in the TLII? The function type of the vector
>function could use the "target-feature" attribute of function
>definitions, but how coudl the vectorizer decide which one to use?
>Anyway, that's it. Your feedback will be much appreciated.
>From: Tian, Xinmin <xinmin.tian at intel.com>
>Sent: 30 November 2016 17:16:12
>To: Francesco Petrogalli; llvm-dev at lists.llvm.org
>Cc: nd; Masten, Matt; Hal Finkel; Zaks, Ayal
>Subject: RE: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the
>Hi Francesco,
>Good to know, you are working on the support for this feature. I assume
>you knew the RFC below.  The VectorABI mangling we proposed were approved
>by C++ Clang FE name mangling owner David M from Google,  the ClangFE
>support was committed in its main trunk by Alexey.
>"Proposal for function vectorization and loop vectorization with function
>calls", March 2, 2016. Intel Corp.
>Matt submitted patch to generate vector variants for function
>definitions, not just function declarations. You may want to take a look.
> Ayal's RFC will be also needed to support vectorization of function body
>in general.
>I agreed, we should have an option -fopenmp-simd to enable SIMD only,
>both GCC and ICC have similar options.
>I would suggest we shall sync-up on these work, so we don't duplicate the
>-----Original Message-----
>From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
>Francesco Petrogalli via llvm-dev
>Sent: Wednesday, November 30, 2016 7:11 AM
>To: llvm-dev at lists.llvm.org
>Cc: nd <nd at arm.com>
>Subject: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the
>Dear all,
>I have just created a couple of differential reviews to enable the
>vectorisation of loops that have function calls to routines marked with
>"#pragma omp declare simd".
>They can be (re)viewed here:
>* https://reviews.llvm.org/D27249
>* https://reviews.llvm.org/D27250
>The current implementation allows the loop vectorizer to generate vector
>code for source file as:
>  #pragma omp declare simd
>  double f(double x);
>  void aaa(double *x, double *y, int N) {
>    for (int i = 0; i < N; ++i) {
>      x[i] = f(y[i]);
>    }
>  }
>by invoking clang with arguments:
>  $> clang -fopenmp -c -O3 file.c [...]
>Such functionality should provide a nice interface for vector libraries
>developers that can be used to inform the loop vectorizer of the
>availability of an external library with the vector implementation of the
>scalar functions in the loops. For this, all is needed to do is to mark
>with "#pragma omp declare simd" the function declaration in the header
>file of the library and generate the associated symbols in the object
>file of the library according to the name scheme of the vector ABI (see
>notes below).
>I am interested in any feedback/suggestion/review the community might
>have regarding this behaviour.
>Below you find a description of the implementation and some notes.
>The functionality is implemented as follow:
>1. Clang CodeGen generates a set of global external variables for each of
>the function declarations marked with the OpenMP pragma. Each of such
>globals are named according a mangling that is generated by
>llvm::TargetLibraryInfoImpl (TLII), and holds the vector signature of the
>associated vector function. (See examples in the tests of the clang patch.
>Each scalar function can generate multiple vector functions depending on
>the clauses of the declare simd directives) 2. When clang created the
>TLII, it processes the llvm::Module and finds out which of the globals of
>the module have the correct mangling and type so that they be added to
>the TLII as a list of vector function that can be associated to the
>original scalar one.
>3. The LoopVectorizer looks for the available vector functions through
>the TLII not by scalar name and vectorisation factor but by scalar name
>and vector function signature, thus enabling the vectorizer to be able to
>distinguish a "vector vpow1(vector x, vector y)" from a "vector
>vpow2(vector x, scalar y)". (The second one corresponds to a "declare
>simd uniform(y)" for a "scalar pow(scalar x, scalar y)" declaration).
>(Notice that the changes in the loop vectorizer are minimal.)
>1. To enable SIMD only for OpenMP, leaving all the multithread/target
>behaviour behind, we should enable this also with a new option:
>2. The AArch64 vector ABI in the code is essentially the same as for the
>Intel one (apart from the prefix and the masking argument), and it is
>based on the clauses associated to "declare simd" in OpenMP 4.0. For
>OpenMP4.5, the parameters section of the mangled name should be updated.
>This update will not change the vectorizer behaviour as all the
>vectorizer needs to detect a vectorizable function is the original scalar
>name and a compatible vector function signature. Of course, any
>changes/updates in the ABI will have to be reflected in the symbols of
>the binary file of the library.
>3. Whistle this is working only for function declaration, the same
>functionality can be used when (if) clang will implement the declare simd
>OpenMP pragma for function definitions.
>4. I have enabled this for any loop that invokes the scalar function
>call, not just for those annotated with "#pragma omp for simd". I don't
>have any preference here, but at the same time I don't see any reason why
>this shouldn't be enabled by default for non annotated loops. Let me know
>if you disagree, I'd happily change the functionality if there are sound
>reasons behind that.
>LLVM Developers mailing list
>llvm-dev at lists.llvm.org

More information about the llvm-dev mailing list