[llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

Mon Dec 12 09:32:57 PST 2016

Francesco, thanks for updating the patch.

GCC used b, c, d,   you used  Q for ARM 128-bit which seems fine. For D (64-bit), do you have to use it, or you can find another letter to avoid the future conflict / confusion if they need D vs. d?  Is GCC community ok with them for compatibility for ARM?  

Thanks,
Xinmin

-----Original Message-----
From: Francesco Petrogalli [mailto:Francesco.Petrogalli at arm.com] 
Sent: Monday, December 12, 2016 5:45 AM
To: Tian, Xinmin <xinmin.tian at intel.com>; Odeh, Saher <saher.odeh at intel.com>; llvm-dev at lists.llvm.org
Cc: nd <nd at arm.com>; Masten, Matt <matt.masten at intel.com>; Hal Finkel <hfinkel at anl.gov>; Zaks, Ayal <ayal.zaks at intel.com>; a.bataev at hotmail.com; David Majnemer <david.majnemer at gmail.com>; Renato Golin <renato.golin at linaro.org>
Subject: Re: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

Hi Xinmin,

I have updated the clang patch using the standard name mangling you suggested - I was not fully aware of the C++ mangling convention “_ZVG”.

I am using “D” for 64-bit NEON and “Q” for 128-bit NEON, which makes NEON vector symbols look as follows:

_ZVGQN2v__Z1fd
_ZVGDN2v__Z1ff
_ZVGQN4v__Z1ff

Here “Q” means -> NEON 128-bit, “D” means -> NEON 64-bit

Please notice that although I have changed the name mangling in clang [1], there have been no need to update the relative llvm patch [2], as the vectorisation process is _independent_ of the name mangling.

Regards,

Francesco

[1] https://reviews.llvm.org/D27250
[2] https://reviews.llvm.org/D27249, The only update was a bug fix in the copy constructor of the TLII and in the return value of the TLII::mangle() method. None of the underlying scalar/vector function matching algorithms have been touched.

On 08/12/2016 18:11, "Tian, Xinmin" <xinmin.tian at intel.com> wrote:

>Hi Francesco, a bit more information.  GCC veclib is implemented based 
>on GCC VectorABI for declare simd as well.
>
>For name mangling, we have to follow certain rules of C/C++ (e.g. 
>prefix needs to _ZVG ....).  David Majnemer who is the owner and 
>stakeholder for approval for Clang and LLVM.  Also,  we need to pay 
>attention to GCC compatibility.  I would suggest you look into how GCC 
>VectorABI can be extended support your Arch.
>
>Thanks,
>Xinmin
>
>-----Original Message-----
>From: Odeh, Saher
>Sent: Thursday, December 8, 2016 3:49 AM
>To: Tian, Xinmin <xinmin.tian at intel.com>; llvm-dev at lists.llvm.org; 
>Francesco.Petrogalli at arm.com
>Cc: nd <nd at arm.com>; Masten, Matt <matt.masten at intel.com>; Hal Finkel 
><hfinkel at anl.gov>; Zaks, Ayal <ayal.zaks at intel.com>; 
>a.bataev at hotmail.com
>Subject: RE: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the 
>LoopVectorizer
>
>Hi Francesco,
>
>As you stated in the RFC, when vectorizing a scalar function (e.g. when 
>using omp declare simd), one needs to incorporate attributes to the 
>resulting vectorized-function.
>These attributes describe a) the behavior of the function, e.g. 
>mask-able or not, and b) the type of the parameters, e.g. scalar or 
>linear or any other option.
>
>As this list is extensive, it is only logical to use an existing 
>infrastructure of ICC and GCC vectorABI which already covers all of 
>these options as stated in Xinmin's RFC 
>[http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html].
>Moreover, when considering other compilers such as GCC, I do see that 
>the resulting assembly actually does incorporate this exact infrastructure.
>So if we wish to link different parts of the program using clang and 
>GCC we'll need to adhere to the same name mangling/ABI. Please see the 
>below result after compiling an omp declare simd function using GCC.
>Lastly, please note the two out of the three components of the 
>implementation have already been committed or submitted, and both are 
>adhering the name mangling proposed by Xinmin's RFC. A) committed - the 
>FE portion by Alexey [https://reviews.llvm.org/rL264853], it generates 
>mangled names in the manner described by Xinmin's RFC, See below B) 
>Submitted - the callee side by Matt [https://reviews.llvm.org/D22792], 
>it uses these mangled names. and C) caller which is covered by this patch.
>
>In order to mitigate the needed effort and possible issues when 
>implementing, I believe it is best to follow the name mangling proposed 
>in Xinmin's RFC. What do you think?
>
>GCC Example
>----------------
>Compiler version: GCC 6.1.0
>Compile line: gcc -c omp.c -fopenmp -Wall -S -o - -O3 > omp.s
>
>omp.c
>#include <omp.h>
>
>#pragma omp declare simd
>int dowork(int* a, int idx)
>{
> return a[idx] * a[idx]*7;
>}
>
>less omp.s | grep @function
>        .type   dowork, @function
>        .type   _ZGVbN4vv_dowork, @function
>        .type   _ZGVbM4vv_dowork, @function
>        .type   _ZGVcN4vv_dowork, @function
>        .type   _ZGVcM4vv_dowork, @function
>        .type   _ZGVdN8vv_dowork, @function
>        .type   _ZGVdM8vv_dowork, @function
>        .type   _ZGVeN16vv_dowork, @function
>        .type   _ZGVeM16vv_dowork, @function
>
>Clang on FE using Alexey's patch
>---------------------------------------
>Compile line: clang -c tst/omp_fun.c -fopenmp -mllvm -print-after-all 
>>& out
>
>#pragma omp declare simd
>extern int dowork(int* a, int idx)
>{
>  return a[idx]*7;
>}
>
>
>int main() {
>  dowork(0,1);
>}
>
>attributes #0 = { nounwind uwtable "_ZGVbM4vv_dowork" "_ZGVbN4vv_dowork"
>"_ZGVcM8vv_dowork" "_ZGVcN8vv_dowork" "_ZGVdM8vv_dowork"
>"_ZGVdN8vv_dowork" "_ZGVeM16vv_dowork" "_ZGVeN16vv_dowork"
>"correctly-rounded-divide-sqrt-fp-math"="false"
>"disable-tail-calls"="false" "less-precise-fpmad"="false"
>"no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf"
>"no-infs-fp-math"="false" "no-jump-tables"="false"
>"no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false"
>"no-trapping-math"="false" "stack-protector-buffer-size"="8"
>"target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2,+x87"
>"unsafe-fp-math"="false" "use-soft-float"="false" }
>
>
>Thanks Saher
>
>-----Original Message-----
>From: Francesco Petrogalli [mailto:Francesco.Petrogalli at arm.com]
>Sent: Tuesday, December 06, 2016 17:22
>To: Tian, Xinmin <xinmin.tian at intel.com>; llvm-dev at lists.llvm.org
>Cc: nd <nd at arm.com>; Masten, Matt <matt.masten at intel.com>; Hal Finkel 
><hfinkel at anl.gov>; Zaks, Ayal <ayal.zaks at intel.com>; 
>a.bataev at hotmail.com
>Subject: Re: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the 
>LoopVectorizer
>
>Hi Xinmin,
>
>Thank you for your email.
>
>I have been catching up with the content of your proposal, and I have 
>some questions/remarks below that I'd like to discuss with you - see 
>the final section in the proposal.
>
>I have specifically added Alexey B. to the mail so we can move our 
>conversation from phabricator to the mailing list.
>
>Before we start, I just want to mention that the initial idea of using 
>llvm::FunctionType for vector function generation and matching has been 
>proposed by a colleague, Paul Walker, when we first tried out 
>supporting this on AArch64 on an internal version of llvm. I received 
>some input also from Amara Emerson.
>
>In our case we had a slightly different problem to solve: we wanted to 
>support in the vectorizer a rich set of vector math routines provided 
>with an external library. We managed to do this by adding the pragma to 
>the (scalar) function declaration of the header file provided with the 
>library, and as shown by the patches I have submitted, by generating 
>vector function signatures that the vectorizer can search in the 
>TargetLibraryInfo.
>
>Here is an updated version of the proposal. Please let me know what you 
>think, and if you have any solution we could use for the final section.
>
># RFC for "pragma omp declare simd"
>
>Hight level components:
>
>A) Global variable generator (clang FE)
>B) Parameter descriptors (as new enumerations in llvm::Attribute)
>C) TLII methods and fields for the multimap (llvm middle-end)
>
>## Workflow
>
>Example user input, with a declaration and definition:
>
>    #pragma omp declare simd
>    #pragma omp declare simd uniform(y)
>    extern double pow(double x, double y);
>
>    #pragma omp declare simd
>    #pragma omp declare simd linear(x:2)
>    float foo(float x) {....}
>
>    /// code using both functions
>
>### Step 1
>
>
>The compiler FE process these definition and declaration and generates 
>a list of globals as follows:
>
>    @prefix_vector_pow1_midfix_pow_postfix = external global
>                                             <4 x double>(<4 x double>,
>                                                          <4 x double>)
>    @prefix_vector_pow2_midfix_pow_postfix = external global
>                                             <4 x double>(<4 x double>,
>                                                          double)
>    @prefix_vector_foo1_midfix_foo_postfix = external global
>                                             <8 x float>(<8 x float>,
>                                                         <8 x float>)
>    @prefix_vector_foo1_midfix_foo_postfix = external global
>                                             <8 x float>(<8 x float>,
>                                                         <8 x float> #0)
>    ...
>    attribute #0  = {linear = 2}
>
>
>Notes about step 1:
>
>1. The mapping scalar name <-> vector name is in the
>   prefix/midfix/postfix mangled name of the global variable.
>2. The examples shows only a set of possible vector function for a
>   sizeof(<4 x double>) vector extension. If multiple vector extension
>   live in the same target (eg. NEON 64-bit or NEON 128-bit, or SSE
>   and AVX512) the front end takes care to generate each of the
>   associated functions (like it is done now).
>3. Vector function parameters are rendered using the same
>   Characteristic Data Type (CDT) rule already in the compiler FE.
>4. Uniform parameters are rendered with the original scalar type.
>5. Linear parameters are rendered with vectors using the same
>   CDT-generated vector length, and decorated with proper
>   attributes. I think we could extent the llvm::Attribute enumeration 
>adding the following:
>   - linear : numeric, specify_the step
>   - linear_var : numeric, specify the position of the uniform variable 
>holding the step
>   - linear_uval[_var]: numeric as before, but for the "uval" modifier 
>(both constant step or variable step)
>   - linear_val[_var]: numeric, as before, but for "val" modifier
>   - linear_ref[_var] numeric, for "ref" modifier.
>
>   For example, "attribute #0 = {linear = 2}" says that the vector of
>   the associated parameter in the function signature has a linear
>   step of 2.
>
>### Step 2
>
>The compiler FE invokes a TLII method in BackendUtils.cpp that populate 
>a multimap in the TLII by checking the globals created in the previous step.
>
>Each global is processed, demangling the [pre/mid/post]fix name and 
>generate a mapping in the TLII as follows:
>
>    struct VectorFnInfo {
>       std::string Name;
>       FunctionType *Signature;
>    };
>    std::multimap<std:string, VectorFnInfo> VFInfo;
>
>
>For the initial example, the multimap in the TLI is populated as follows:
>
>    "pow" -> [(vector_pow1, <4 x double>(<4 x double>, <4 x double>)),
>              (vector_pow2, <4 x double>(<4 x double>, double))]
>
>    "foo" -> [(vector_foo1, <8 x float>(<8 x float>, <8 x float>)),
>              (vector_foo2, <8 x float>(<8 x float>, <8 x float> #0))]
>
>Notes about step 2:
>
>Given the fact that the external globals that the FE have generated are 
>removed _before_ the vectorizer kicks in, I am not sure if the 
>"attribute #0" needed for one of the parameter is still present at this 
>point. IF NOT, I think that in this case we could enrich the 
>"VectorFnInfo" as
>follows:
>
>    struct VectorFnInfo {
>       std::string Name;
>       FunctionType *Signature;
>       std::set<unsigned, llvm:Attribute> Attrs;
>    };
>
>The field "Attrs" maps the position of the parameter with the 
>correspondent llvm::Attribute present in the global variable.
>
>I have added this note for the sake of completeness. I *think* that we 
>won't be needing this additional Attrs field: I have already shown in 
>the llvm patch I submitted that the function type "survives" after the 
>global gets removed, I don't see why the parameter attribute shouldn't 
>survive too (last famous words?).
>
>### Step 3
>
>This step happens in the LoopVectorizer. The InnerLoopVectorizer 
>queries the TargetLibraryInfo looking for a vectorized version of the 
>function by scalar name and function signature with the following method:
>
>    TargetLibraryInfo::isFunctionVectorizable(std::string ScalarName, 
>FuncionType *FTy);
>
>This is done in a way similar to what my current llvm patch does: the 
>loop vectorizer makes up the function signature it needs and look for 
>it in the TLI. If a match is found, vectorization is possible. Right 
>now the compiler is not aware of uniform/linear function attributes, 
>but it still can refer to them in a target agnostic way, by using 
>scalar signatures for the uniform ones and using llvm::Attributes for the linear ones.
>
>Notice that the vector name here is not used at all, which is good as 
>any architecture can come up with it's own name mangling for vector 
>functions, without breaking the ability of the vectorizer to vectorize 
>the same code with the new name mangling.
>
>## External libraries vs user provided code
>
>The example with "pow" and "foo" I have provided before shows a 
>function declaration and a function definition. Although the TLII 
>mechanism I have described seems to be valid only for the former case, 
>I think that it is valid also for the latter.  In fact, in case of a 
>function definition, the compiler would have to generate also the body 
>of the vector function, but that external global variable could still 
>be used to inform the TLII of such function. The fact that the vector 
>function needed by the vectorizer is in some module instead of in an 
>external library doesn't seems to make all that difference at compile time to me.
>
># Some final notes (call for ideas!)
>
>There is one level of target dependence that I still have to sort out, 
>and for this I need input from the community and in particular from the 
>Intel folks.
>
>I will start with this example:
>
>    #pragma omp declare simd
>    float foo(float x);
>
>In case of NEON, this would generate 2 globals, one for vectors holding 
>2 floats, and one for vector holding 4 floats, corresponding to NEON 
>64-bit and 128-bit respectively. This means that the vectorizer have a 
>unique function it could choose from the list the TLI provides.
>
>This is not the same on Intel, for example when this code generates 
>vector names for AVX and AVX2. The register width for these 
>architecture extensions are the same, so all the TLI has is a mapping 
>between scalar name and (vectro_name, function_type) who's two elements 
>differ only in the vector_name string.
>
>This breaks the target independence of the vectorizer, as it would 
>require it to parse the vector_name to be able to choose between the 
>AVX or the AVX2 implementation.
>
>Now, to make this work one should have to encode the SSE/SSE2/AVX/AVX2 
>information in the VectorFnInfo structure. Does anybody have an idea on 
>how best to do it? For the sake of keeping the vectorizer target 
>independent, I would like to avoid encoding this piece of information 
>in the VectorFnInfo struct. I have seen that in your code you are 
>generating
>SSE/AVX/AVX2/AVX512 vector functions, how do you plan to choose between 
>them in the vectorizer? I could not find how you planned to solve this 
>problem in your proposal, or have I just missed it?
>
>Is there a way to do this in the TLII? The function type of the vector 
>function could use the "target-feature" attribute of function 
>definitions, but how coudl the vectorizer decide which one to use?
>
>Anyway, that's it. Your feedback will be much appreciated.
>
>Cheers,
>Francesco
>
>________________________________________
>From: Tian, Xinmin <xinmin.tian at intel.com>
>Sent: 30 November 2016 17:16:12
>To: Francesco Petrogalli; llvm-dev at lists.llvm.org
>Cc: nd; Masten, Matt; Hal Finkel; Zaks, Ayal
>Subject: RE: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the 
>LoopVectorizer
>
>Hi Francesco,
>
>Good to know, you are working on the support for this feature. I assume 
>you knew the RFC below.  The VectorABI mangling we proposed were 
>approved by C++ Clang FE name mangling owner David M from Google,  the 
>ClangFE support was committed in its main trunk by Alexey.
>
>"Proposal for function vectorization and loop vectorization with 
>function calls", March 2, 2016. Intel Corp.
>http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html.
>
>Matt submitted patch to generate vector variants for function 
>definitions, not just function declarations. You may want to take a look.
> Ayal's RFC will be also needed to support vectorization of function 
>body in general.
>
>I agreed, we should have an option -fopenmp-simd to enable SIMD only, 
>both GCC and ICC have similar options.
>
>I would suggest we shall sync-up on these work, so we don't duplicate 
>the effort.
>
>Thanks,
>Xinmin
>
>-----Original Message-----
>From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of 
>Francesco Petrogalli via llvm-dev
>Sent: Wednesday, November 30, 2016 7:11 AM
>To: llvm-dev at lists.llvm.org
>Cc: nd <nd at arm.com>
>Subject: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the 
>LoopVectorizer
>
>Dear all,
>
>I have just created a couple of differential reviews to enable the 
>vectorisation of loops that have function calls to routines marked with 
>"#pragma omp declare simd".
>
>They can be (re)viewed here:
>
>* https://reviews.llvm.org/D27249
>
>* https://reviews.llvm.org/D27250
>
>The current implementation allows the loop vectorizer to generate 
>vector code for source file as:
>
>  #pragma omp declare simd
>  double f(double x);
>
>  void aaa(double *x, double *y, int N) {
>    for (int i = 0; i < N; ++i) {
>      x[i] = f(y[i]);
>    }
>  }
>
>
>by invoking clang with arguments:
>
>  $> clang -fopenmp -c -O3 file.c [...]
>
>
>Such functionality should provide a nice interface for vector libraries 
>developers that can be used to inform the loop vectorizer of the 
>availability of an external library with the vector implementation of 
>the scalar functions in the loops. For this, all is needed to do is to 
>mark with "#pragma omp declare simd" the function declaration in the 
>header file of the library and generate the associated symbols in the 
>object file of the library according to the name scheme of the vector 
>ABI (see notes below).
>
>I am interested in any feedback/suggestion/review the community might 
>have regarding this behaviour.
>
>Below you find a description of the implementation and some notes.
>
>Thanks,
>
>Francesco
>
>-----------
>
>The functionality is implemented as follow:
>
>1. Clang CodeGen generates a set of global external variables for each 
>of the function declarations marked with the OpenMP pragma. Each of 
>such globals are named according a mangling that is generated by 
>llvm::TargetLibraryInfoImpl (TLII), and holds the vector signature of 
>the associated vector function. (See examples in the tests of the clang patch.
>Each scalar function can generate multiple vector functions depending 
>on the clauses of the declare simd directives) 2. When clang created 
>the TLII, it processes the llvm::Module and finds out which of the 
>globals of the module have the correct mangling and type so that they 
>be added to the TLII as a list of vector function that can be 
>associated to the original scalar one.
>3. The LoopVectorizer looks for the available vector functions through 
>the TLII not by scalar name and vectorisation factor but by scalar name 
>and vector function signature, thus enabling the vectorizer to be able 
>to distinguish a "vector vpow1(vector x, vector y)" from a "vector 
>vpow2(vector x, scalar y)". (The second one corresponds to a "declare 
>simd uniform(y)" for a "scalar pow(scalar x, scalar y)" declaration).
>(Notice that the changes in the loop vectorizer are minimal.)
>
>
>Notes:
>
>1. To enable SIMD only for OpenMP, leaving all the multithread/target 
>behaviour behind, we should enable this also with a new option:
>-fopenmp-simd
>2. The AArch64 vector ABI in the code is essentially the same as for 
>the Intel one (apart from the prefix and the masking argument), and it 
>is based on the clauses associated to "declare simd" in OpenMP 4.0. For 
>OpenMP4.5, the parameters section of the mangled name should be updated.
>This update will not change the vectorizer behaviour as all the 
>vectorizer needs to detect a vectorizable function is the original 
>scalar name and a compatible vector function signature. Of course, any 
>changes/updates in the ABI will have to be reflected in the symbols of 
>the binary file of the library.
>3. Whistle this is working only for function declaration, the same 
>functionality can be used when (if) clang will implement the declare 
>simd OpenMP pragma for function definitions.
>4. I have enabled this for any loop that invokes the scalar function 
>call, not just for those annotated with "#pragma omp for simd". I don't 
>have any preference here, but at the same time I don't see any reason 
>why this shouldn't be enabled by default for non annotated loops. Let 
>me know if you disagree, I'd happily change the functionality if there 
>are sound reasons behind that.
>
>_______________________________________________
>LLVM Developers mailing list
>llvm-dev at lists.llvm.org
>http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev