[llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer
Tian, Xinmin via llvm-dev
llvm-dev at lists.llvm.org
Mon Dec 12 09:32:57 PST 2016
Francesco, thanks for updating the patch.
GCC used b, c, d, you used Q for ARM 128-bit which seems fine. For D (64-bit), do you have to use it, or you can find another letter to avoid the future conflict / confusion if they need D vs. d? Is GCC community ok with them for compatibility for ARM?
Thanks,
Xinmin
-----Original Message-----
From: Francesco Petrogalli [mailto:Francesco.Petrogalli at arm.com]
Sent: Monday, December 12, 2016 5:45 AM
To: Tian, Xinmin <xinmin.tian at intel.com>; Odeh, Saher <saher.odeh at intel.com>; llvm-dev at lists.llvm.org
Cc: nd <nd at arm.com>; Masten, Matt <matt.masten at intel.com>; Hal Finkel <hfinkel at anl.gov>; Zaks, Ayal <ayal.zaks at intel.com>; a.bataev at hotmail.com; David Majnemer <david.majnemer at gmail.com>; Renato Golin <renato.golin at linaro.org>
Subject: Re: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer
Hi Xinmin,
I have updated the clang patch using the standard name mangling you suggested - I was not fully aware of the C++ mangling convention “_ZVG”.
I am using “D” for 64-bit NEON and “Q” for 128-bit NEON, which makes NEON vector symbols look as follows:
_ZVGQN2v__Z1fd
_ZVGDN2v__Z1ff
_ZVGQN4v__Z1ff
Here “Q” means -> NEON 128-bit, “D” means -> NEON 64-bit
Please notice that although I have changed the name mangling in clang [1], there have been no need to update the relative llvm patch [2], as the vectorisation process is _independent_ of the name mangling.
Regards,
Francesco
[1] https://reviews.llvm.org/D27250
[2] https://reviews.llvm.org/D27249, The only update was a bug fix in the copy constructor of the TLII and in the return value of the TLII::mangle() method. None of the underlying scalar/vector function matching algorithms have been touched.
On 08/12/2016 18:11, "Tian, Xinmin" <xinmin.tian at intel.com> wrote:
>Hi Francesco, a bit more information. GCC veclib is implemented based
>on GCC VectorABI for declare simd as well.
>
>For name mangling, we have to follow certain rules of C/C++ (e.g.
>prefix needs to _ZVG ....). David Majnemer who is the owner and
>stakeholder for approval for Clang and LLVM. Also, we need to pay
>attention to GCC compatibility. I would suggest you look into how GCC
>VectorABI can be extended support your Arch.
>
>Thanks,
>Xinmin
>
>-----Original Message-----
>From: Odeh, Saher
>Sent: Thursday, December 8, 2016 3:49 AM
>To: Tian, Xinmin <xinmin.tian at intel.com>; llvm-dev at lists.llvm.org;
>Francesco.Petrogalli at arm.com
>Cc: nd <nd at arm.com>; Masten, Matt <matt.masten at intel.com>; Hal Finkel
><hfinkel at anl.gov>; Zaks, Ayal <ayal.zaks at intel.com>;
>a.bataev at hotmail.com
>Subject: RE: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the
>LoopVectorizer
>
>Hi Francesco,
>
>As you stated in the RFC, when vectorizing a scalar function (e.g. when
>using omp declare simd), one needs to incorporate attributes to the
>resulting vectorized-function.
>These attributes describe a) the behavior of the function, e.g.
>mask-able or not, and b) the type of the parameters, e.g. scalar or
>linear or any other option.
>
>As this list is extensive, it is only logical to use an existing
>infrastructure of ICC and GCC vectorABI which already covers all of
>these options as stated in Xinmin's RFC
>[http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html].
>Moreover, when considering other compilers such as GCC, I do see that
>the resulting assembly actually does incorporate this exact infrastructure.
>So if we wish to link different parts of the program using clang and
>GCC we'll need to adhere to the same name mangling/ABI. Please see the
>below result after compiling an omp declare simd function using GCC.
>Lastly, please note the two out of the three components of the
>implementation have already been committed or submitted, and both are
>adhering the name mangling proposed by Xinmin's RFC. A) committed - the
>FE portion by Alexey [https://reviews.llvm.org/rL264853], it generates
>mangled names in the manner described by Xinmin's RFC, See below B)
>Submitted - the callee side by Matt [https://reviews.llvm.org/D22792],
>it uses these mangled names. and C) caller which is covered by this patch.
>
>In order to mitigate the needed effort and possible issues when
>implementing, I believe it is best to follow the name mangling proposed
>in Xinmin's RFC. What do you think?
>
>GCC Example
>----------------
>Compiler version: GCC 6.1.0
>Compile line: gcc -c omp.c -fopenmp -Wall -S -o - -O3 > omp.s
>
>omp.c
>#include <omp.h>
>
>#pragma omp declare simd
>int dowork(int* a, int idx)
>{
> return a[idx] * a[idx]*7;
>}
>
>less omp.s | grep @function
> .type dowork, @function
> .type _ZGVbN4vv_dowork, @function
> .type _ZGVbM4vv_dowork, @function
> .type _ZGVcN4vv_dowork, @function
> .type _ZGVcM4vv_dowork, @function
> .type _ZGVdN8vv_dowork, @function
> .type _ZGVdM8vv_dowork, @function
> .type _ZGVeN16vv_dowork, @function
> .type _ZGVeM16vv_dowork, @function
>
>Clang on FE using Alexey's patch
>---------------------------------------
>Compile line: clang -c tst/omp_fun.c -fopenmp -mllvm -print-after-all
>>& out
>
>#pragma omp declare simd
>extern int dowork(int* a, int idx)
>{
> return a[idx]*7;
>}
>
>
>int main() {
> dowork(0,1);
>}
>
>attributes #0 = { nounwind uwtable "_ZGVbM4vv_dowork" "_ZGVbN4vv_dowork"
>"_ZGVcM8vv_dowork" "_ZGVcN8vv_dowork" "_ZGVdM8vv_dowork"
>"_ZGVdN8vv_dowork" "_ZGVeM16vv_dowork" "_ZGVeN16vv_dowork"
>"correctly-rounded-divide-sqrt-fp-math"="false"
>"disable-tail-calls"="false" "less-precise-fpmad"="false"
>"no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf"
>"no-infs-fp-math"="false" "no-jump-tables"="false"
>"no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false"
>"no-trapping-math"="false" "stack-protector-buffer-size"="8"
>"target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2,+x87"
>"unsafe-fp-math"="false" "use-soft-float"="false" }
>
>
>Thanks Saher
>
>-----Original Message-----
>From: Francesco Petrogalli [mailto:Francesco.Petrogalli at arm.com]
>Sent: Tuesday, December 06, 2016 17:22
>To: Tian, Xinmin <xinmin.tian at intel.com>; llvm-dev at lists.llvm.org
>Cc: nd <nd at arm.com>; Masten, Matt <matt.masten at intel.com>; Hal Finkel
><hfinkel at anl.gov>; Zaks, Ayal <ayal.zaks at intel.com>;
>a.bataev at hotmail.com
>Subject: Re: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the
>LoopVectorizer
>
>Hi Xinmin,
>
>Thank you for your email.
>
>I have been catching up with the content of your proposal, and I have
>some questions/remarks below that I'd like to discuss with you - see
>the final section in the proposal.
>
>I have specifically added Alexey B. to the mail so we can move our
>conversation from phabricator to the mailing list.
>
>Before we start, I just want to mention that the initial idea of using
>llvm::FunctionType for vector function generation and matching has been
>proposed by a colleague, Paul Walker, when we first tried out
>supporting this on AArch64 on an internal version of llvm. I received
>some input also from Amara Emerson.
>
>In our case we had a slightly different problem to solve: we wanted to
>support in the vectorizer a rich set of vector math routines provided
>with an external library. We managed to do this by adding the pragma to
>the (scalar) function declaration of the header file provided with the
>library, and as shown by the patches I have submitted, by generating
>vector function signatures that the vectorizer can search in the
>TargetLibraryInfo.
>
>Here is an updated version of the proposal. Please let me know what you
>think, and if you have any solution we could use for the final section.
>
># RFC for "pragma omp declare simd"
>
>Hight level components:
>
>A) Global variable generator (clang FE)
>B) Parameter descriptors (as new enumerations in llvm::Attribute)
>C) TLII methods and fields for the multimap (llvm middle-end)
>
>## Workflow
>
>Example user input, with a declaration and definition:
>
> #pragma omp declare simd
> #pragma omp declare simd uniform(y)
> extern double pow(double x, double y);
>
> #pragma omp declare simd
> #pragma omp declare simd linear(x:2)
> float foo(float x) {....}
>
> /// code using both functions
>
>### Step 1
>
>
>The compiler FE process these definition and declaration and generates
>a list of globals as follows:
>
> @prefix_vector_pow1_midfix_pow_postfix = external global
> <4 x double>(<4 x double>,
> <4 x double>)
> @prefix_vector_pow2_midfix_pow_postfix = external global
> <4 x double>(<4 x double>,
> double)
> @prefix_vector_foo1_midfix_foo_postfix = external global
> <8 x float>(<8 x float>,
> <8 x float>)
> @prefix_vector_foo1_midfix_foo_postfix = external global
> <8 x float>(<8 x float>,
> <8 x float> #0)
> ...
> attribute #0 = {linear = 2}
>
>
>Notes about step 1:
>
>1. The mapping scalar name <-> vector name is in the
> prefix/midfix/postfix mangled name of the global variable.
>2. The examples shows only a set of possible vector function for a
> sizeof(<4 x double>) vector extension. If multiple vector extension
> live in the same target (eg. NEON 64-bit or NEON 128-bit, or SSE
> and AVX512) the front end takes care to generate each of the
> associated functions (like it is done now).
>3. Vector function parameters are rendered using the same
> Characteristic Data Type (CDT) rule already in the compiler FE.
>4. Uniform parameters are rendered with the original scalar type.
>5. Linear parameters are rendered with vectors using the same
> CDT-generated vector length, and decorated with proper
> attributes. I think we could extent the llvm::Attribute enumeration
>adding the following:
> - linear : numeric, specify_the step
> - linear_var : numeric, specify the position of the uniform variable
>holding the step
> - linear_uval[_var]: numeric as before, but for the "uval" modifier
>(both constant step or variable step)
> - linear_val[_var]: numeric, as before, but for "val" modifier
> - linear_ref[_var] numeric, for "ref" modifier.
>
> For example, "attribute #0 = {linear = 2}" says that the vector of
> the associated parameter in the function signature has a linear
> step of 2.
>
>### Step 2
>
>The compiler FE invokes a TLII method in BackendUtils.cpp that populate
>a multimap in the TLII by checking the globals created in the previous step.
>
>Each global is processed, demangling the [pre/mid/post]fix name and
>generate a mapping in the TLII as follows:
>
> struct VectorFnInfo {
> std::string Name;
> FunctionType *Signature;
> };
> std::multimap<std:string, VectorFnInfo> VFInfo;
>
>
>For the initial example, the multimap in the TLI is populated as follows:
>
> "pow" -> [(vector_pow1, <4 x double>(<4 x double>, <4 x double>)),
> (vector_pow2, <4 x double>(<4 x double>, double))]
>
> "foo" -> [(vector_foo1, <8 x float>(<8 x float>, <8 x float>)),
> (vector_foo2, <8 x float>(<8 x float>, <8 x float> #0))]
>
>Notes about step 2:
>
>Given the fact that the external globals that the FE have generated are
>removed _before_ the vectorizer kicks in, I am not sure if the
>"attribute #0" needed for one of the parameter is still present at this
>point. IF NOT, I think that in this case we could enrich the
>"VectorFnInfo" as
>follows:
>
> struct VectorFnInfo {
> std::string Name;
> FunctionType *Signature;
> std::set<unsigned, llvm:Attribute> Attrs;
> };
>
>The field "Attrs" maps the position of the parameter with the
>correspondent llvm::Attribute present in the global variable.
>
>I have added this note for the sake of completeness. I *think* that we
>won't be needing this additional Attrs field: I have already shown in
>the llvm patch I submitted that the function type "survives" after the
>global gets removed, I don't see why the parameter attribute shouldn't
>survive too (last famous words?).
>
>### Step 3
>
>This step happens in the LoopVectorizer. The InnerLoopVectorizer
>queries the TargetLibraryInfo looking for a vectorized version of the
>function by scalar name and function signature with the following method:
>
> TargetLibraryInfo::isFunctionVectorizable(std::string ScalarName,
>FuncionType *FTy);
>
>This is done in a way similar to what my current llvm patch does: the
>loop vectorizer makes up the function signature it needs and look for
>it in the TLI. If a match is found, vectorization is possible. Right
>now the compiler is not aware of uniform/linear function attributes,
>but it still can refer to them in a target agnostic way, by using
>scalar signatures for the uniform ones and using llvm::Attributes for the linear ones.
>
>Notice that the vector name here is not used at all, which is good as
>any architecture can come up with it's own name mangling for vector
>functions, without breaking the ability of the vectorizer to vectorize
>the same code with the new name mangling.
>
>## External libraries vs user provided code
>
>The example with "pow" and "foo" I have provided before shows a
>function declaration and a function definition. Although the TLII
>mechanism I have described seems to be valid only for the former case,
>I think that it is valid also for the latter. In fact, in case of a
>function definition, the compiler would have to generate also the body
>of the vector function, but that external global variable could still
>be used to inform the TLII of such function. The fact that the vector
>function needed by the vectorizer is in some module instead of in an
>external library doesn't seems to make all that difference at compile time to me.
>
># Some final notes (call for ideas!)
>
>There is one level of target dependence that I still have to sort out,
>and for this I need input from the community and in particular from the
>Intel folks.
>
>I will start with this example:
>
> #pragma omp declare simd
> float foo(float x);
>
>In case of NEON, this would generate 2 globals, one for vectors holding
>2 floats, and one for vector holding 4 floats, corresponding to NEON
>64-bit and 128-bit respectively. This means that the vectorizer have a
>unique function it could choose from the list the TLI provides.
>
>This is not the same on Intel, for example when this code generates
>vector names for AVX and AVX2. The register width for these
>architecture extensions are the same, so all the TLI has is a mapping
>between scalar name and (vectro_name, function_type) who's two elements
>differ only in the vector_name string.
>
>This breaks the target independence of the vectorizer, as it would
>require it to parse the vector_name to be able to choose between the
>AVX or the AVX2 implementation.
>
>Now, to make this work one should have to encode the SSE/SSE2/AVX/AVX2
>information in the VectorFnInfo structure. Does anybody have an idea on
>how best to do it? For the sake of keeping the vectorizer target
>independent, I would like to avoid encoding this piece of information
>in the VectorFnInfo struct. I have seen that in your code you are
>generating
>SSE/AVX/AVX2/AVX512 vector functions, how do you plan to choose between
>them in the vectorizer? I could not find how you planned to solve this
>problem in your proposal, or have I just missed it?
>
>Is there a way to do this in the TLII? The function type of the vector
>function could use the "target-feature" attribute of function
>definitions, but how coudl the vectorizer decide which one to use?
>
>Anyway, that's it. Your feedback will be much appreciated.
>
>Cheers,
>Francesco
>
>________________________________________
>From: Tian, Xinmin <xinmin.tian at intel.com>
>Sent: 30 November 2016 17:16:12
>To: Francesco Petrogalli; llvm-dev at lists.llvm.org
>Cc: nd; Masten, Matt; Hal Finkel; Zaks, Ayal
>Subject: RE: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the
>LoopVectorizer
>
>Hi Francesco,
>
>Good to know, you are working on the support for this feature. I assume
>you knew the RFC below. The VectorABI mangling we proposed were
>approved by C++ Clang FE name mangling owner David M from Google, the
>ClangFE support was committed in its main trunk by Alexey.
>
>"Proposal for function vectorization and loop vectorization with
>function calls", March 2, 2016. Intel Corp.
>http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html.
>
>Matt submitted patch to generate vector variants for function
>definitions, not just function declarations. You may want to take a look.
> Ayal's RFC will be also needed to support vectorization of function
>body in general.
>
>I agreed, we should have an option -fopenmp-simd to enable SIMD only,
>both GCC and ICC have similar options.
>
>I would suggest we shall sync-up on these work, so we don't duplicate
>the effort.
>
>Thanks,
>Xinmin
>
>-----Original Message-----
>From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
>Francesco Petrogalli via llvm-dev
>Sent: Wednesday, November 30, 2016 7:11 AM
>To: llvm-dev at lists.llvm.org
>Cc: nd <nd at arm.com>
>Subject: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the
>LoopVectorizer
>
>Dear all,
>
>I have just created a couple of differential reviews to enable the
>vectorisation of loops that have function calls to routines marked with
>"#pragma omp declare simd".
>
>They can be (re)viewed here:
>
>* https://reviews.llvm.org/D27249
>
>* https://reviews.llvm.org/D27250
>
>The current implementation allows the loop vectorizer to generate
>vector code for source file as:
>
> #pragma omp declare simd
> double f(double x);
>
> void aaa(double *x, double *y, int N) {
> for (int i = 0; i < N; ++i) {
> x[i] = f(y[i]);
> }
> }
>
>
>by invoking clang with arguments:
>
> $> clang -fopenmp -c -O3 file.c [...]
>
>
>Such functionality should provide a nice interface for vector libraries
>developers that can be used to inform the loop vectorizer of the
>availability of an external library with the vector implementation of
>the scalar functions in the loops. For this, all is needed to do is to
>mark with "#pragma omp declare simd" the function declaration in the
>header file of the library and generate the associated symbols in the
>object file of the library according to the name scheme of the vector
>ABI (see notes below).
>
>I am interested in any feedback/suggestion/review the community might
>have regarding this behaviour.
>
>Below you find a description of the implementation and some notes.
>
>Thanks,
>
>Francesco
>
>-----------
>
>The functionality is implemented as follow:
>
>1. Clang CodeGen generates a set of global external variables for each
>of the function declarations marked with the OpenMP pragma. Each of
>such globals are named according a mangling that is generated by
>llvm::TargetLibraryInfoImpl (TLII), and holds the vector signature of
>the associated vector function. (See examples in the tests of the clang patch.
>Each scalar function can generate multiple vector functions depending
>on the clauses of the declare simd directives) 2. When clang created
>the TLII, it processes the llvm::Module and finds out which of the
>globals of the module have the correct mangling and type so that they
>be added to the TLII as a list of vector function that can be
>associated to the original scalar one.
>3. The LoopVectorizer looks for the available vector functions through
>the TLII not by scalar name and vectorisation factor but by scalar name
>and vector function signature, thus enabling the vectorizer to be able
>to distinguish a "vector vpow1(vector x, vector y)" from a "vector
>vpow2(vector x, scalar y)". (The second one corresponds to a "declare
>simd uniform(y)" for a "scalar pow(scalar x, scalar y)" declaration).
>(Notice that the changes in the loop vectorizer are minimal.)
>
>
>Notes:
>
>1. To enable SIMD only for OpenMP, leaving all the multithread/target
>behaviour behind, we should enable this also with a new option:
>-fopenmp-simd
>2. The AArch64 vector ABI in the code is essentially the same as for
>the Intel one (apart from the prefix and the masking argument), and it
>is based on the clauses associated to "declare simd" in OpenMP 4.0. For
>OpenMP4.5, the parameters section of the mangled name should be updated.
>This update will not change the vectorizer behaviour as all the
>vectorizer needs to detect a vectorizable function is the original
>scalar name and a compatible vector function signature. Of course, any
>changes/updates in the ABI will have to be reflected in the symbols of
>the binary file of the library.
>3. Whistle this is working only for function declaration, the same
>functionality can be used when (if) clang will implement the declare
>simd OpenMP pragma for function definitions.
>4. I have enabled this for any loop that invokes the scalar function
>call, not just for those annotated with "#pragma omp for simd". I don't
>have any preference here, but at the same time I don't see any reason
>why this shouldn't be enabled by default for non annotated loops. Let
>me know if you disagree, I'd happily change the functionality if there
>are sound reasons behind that.
>
>_______________________________________________
>LLVM Developers mailing list
>llvm-dev at lists.llvm.org
>http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
More information about the llvm-dev
mailing list