[llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

Tue Dec 6 07:21:54 PST 2016

Hi Xinmin,

Thank you for your email.

I have been catching up with the content of your proposal, and I have
some questions/remarks below that I’d like to discuss with you - see
the final section in the proposal.

I have specifically added Alexey B. to the mail so we can move our
conversation from phabricator to the mailing list.

Before we start, I just want to mention that the initial idea of using
llvm::FunctionType for vector function generation and matching has
been proposed by a colleague, Paul Walker, when we first tried out
supporting this on AArch64 on an internal version of llvm. I received
some input also from Amara Emerson.

In our case we had a slightly different problem to solve: we wanted to
support in the vectorizer a rich set of vector math routines provided
with an external library. We managed to do this by adding the pragma
to the (scalar) function declaration of the header file provided with
the library, and as shown by the patches I have submitted, by
generating vector function signatures that the vectorizer can search
in the TargetLibraryInfo.

Here is an updated version of the proposal. Please let me know what
you think, and if you have any solution we could use for the final
section.

# RFC for "pragma omp declare simd"

Hight level components:

A) Global variable generator (clang FE)
B) Parameter descriptors (as new enumerations in llvm::Attribute)
C) TLII methods and fields for the multimap (llvm middle-end)

## Workflow

Example user input, with a declaration and definition:

    #pragma omp declare simd
    #pragma omp declare simd uniform(y)
    extern double pow(double x, double y);

    #pragma omp declare simd
    #pragma omp declare simd linear(x:2)
    float foo(float x) {....}

    /// code using both functions

### Step 1

The compiler FE process these definition and declaration and
generates a list of globals as follows:

    @prefix_vector_pow1_midfix_pow_postfix = external global
                                             <4 x double>(<4 x double>,
                                                          <4 x double>)
    @prefix_vector_pow2_midfix_pow_postfix = external global
                                             <4 x double>(<4 x double>,
                                                          double)
    @prefix_vector_foo1_midfix_foo_postfix = external global
                                             <8 x float>(<8 x float>,
                                                         <8 x float>)
    @prefix_vector_foo1_midfix_foo_postfix = external global
                                             <8 x float>(<8 x float>,
                                                         <8 x float> #0)
    ...
    attribute #0  = {linear = 2}

Notes about step 1:

1. The mapping scalar name <-> vector name is in the
   prefix/midfix/postfix mangled name of the global variable.
2. The examples shows only a set of possible vector function for a
   sizeof(<4 x double>) vector extension. If multiple vector extension
   live in the same target (eg. NEON 64-bit or NEON 128-bit, or SSE
   and AVX512) the front end takes care to generate each of the
   associated functions (like it is done now).
3. Vector function parameters are rendered using the same
   Characteristic Data Type (CDT) rule already in the compiler FE.
4. Uniform parameters are rendered with the original scalar type.
5. Linear parameters are rendered with vectors using the same
   CDT-generated vector length, and decorated with proper
   attributes. I think we could extent the llvm::Attribute enumeration adding the following:
   - linear : numeric, specify_the step
   - linear_var : numeric, specify the position of the uniform variable holding the step
   - linear_uval[_var]: numeric as before, but for the "uval" modifier (both constant step or variable step)
   - linear_val[_var]: numeric, as before, but for "val" modifier
   - linear_ref[_var] numeric, for "ref" modifier.

   For example, "attribute #0 = {linear = 2}" says that the vector of
   the associated parameter in the function signature has a linear
   step of 2.

### Step 2

The compiler FE invokes a TLII method in BackendUtils.cpp that
populate a multimap in the TLII by checking the globals created in the
previous step.

Each global is processed, demangling the [pre/mid/post]fix name and
generate a mapping in the TLII as follows:

    struct VectorFnInfo {
       std::string Name;
       FunctionType *Signature;
    };
    std::multimap<std:string, VectorFnInfo> VFInfo;

For the initial example, the multimap in the TLI is populated as follows:

    "pow" -> [(vector_pow1, <4 x double>(<4 x double>, <4 x double>)),
              (vector_pow2, <4 x double>(<4 x double>, double))]

    "foo" -> [(vector_foo1, <8 x float>(<8 x float>, <8 x float>)),
              (vector_foo2, <8 x float>(<8 x float>, <8 x float> #0))]

Notes about step 2:

Given the fact that the external globals that the FE have generated
are removed _before_ the vectorizer kicks in, I am not sure if the
"attribute #0" needed for one of the parameter is still present at
this point. IF NOT, I think that in this case we could enrich the
"VectorFnInfo" as follows:

    struct VectorFnInfo {
       std::string Name;
       FunctionType *Signature;
       std::set<unsigned, llvm:Attribute> Attrs;
    };

The field "Attrs" maps the position of the parameter with the
correspondent llvm::Attribute present in the global variable.

I have added this note for the sake of completeness. I *think* that we
won't be needing this additional Attrs field: I have already shown in
the llvm patch I submitted that the function type "survives" after the
global gets removed, I don't see why the parameter attribute shouldn't
survive too (last famous words?).

### Step 3

This step happens in the LoopVectorizer. The InnerLoopVectorizer
queries the TargetLibraryInfo looking for a vectorized version of the
function by scalar name and function signature with the following method:

    TargetLibraryInfo::isFunctionVectorizable(std::string ScalarName, FuncionType *FTy);

This is done in a way similar to what my current llvm patch does: the
loop vectorizer makes up the function signature it needs and look for
it in the TLI. If a match is found, vectorization is possible. Right
now the compiler is not aware of uniform/linear function attributes,
but it still can refer to them in a target agnostic way, by using
scalar signatures for the uniform ones and using llvm::Attributes for
the linear ones.

Notice that the vector name here is not used at all, which is good as
any architecture can come up with it's own name mangling for vector
functions, without breaking the ability of the vectorizer to vectorize
the same code with the new name mangling.

## External libraries vs user provided code

The example with "pow" and "foo" I have provided before shows a
function declaration and a function definition. Although the TLII
mechanism I have described seems to be valid only for the former case,
I think that it is valid also for the latter.  In fact, in case of a
function definition, the compiler would have to generate also the body
of the vector function, but that external global variable could still
be used to inform the TLII of such function. The fact that the vector
function needed by the vectorizer is in some module instead of in an
external library doesn't seems to make all that difference at compile
time to me.

# Some final notes (call for ideas!)

There is one level of target dependence that I still have to sort out,
and for this I need input from the community and in particular from
the Intel folks.

I will start with this example:

    #pragma omp declare simd
    float foo(float x);

In case of NEON, this would generate 2 globals, one for vectors
holding 2 floats, and one for vector holding 4 floats, corresponding
to NEON 64-bit and 128-bit respectively. This means that the
vectorizer have a unique function it could choose from the list the
TLI provides.

This is not the same on Intel, for example when this code generates
vector names for AVX and AVX2. The register width for these
architecture extensions are the same, so all the TLI has is a mapping
between scalar name and (vectro_name, function_type) who's two
elements differ only in the vector_name string.

This breaks the target independence of the vectorizer, as it would
require it to parse the vector_name to be able to choose between the
AVX or the AVX2 implementation.

Now, to make this work one should have to encode the SSE/SSE2/AVX/AVX2
information in the VectorFnInfo structure. Does anybody have an idea on how
best to do it? For the sake of keeping the vectorizer target
independent, I would like to avoid encoding this piece of information
in the VectorFnInfo struct. I have seen that in your code you are
generating SSE/AVX/AVX2/AVX512 vector functions, how do you plan to
choose between them in the vectorizer? I could not find how you
planned to solve this problem in your proposal, or have I just missed
it?

Is there a way to do this in the TLII? The function type of the vector
function could use the "target-feature" attribute of function
definitions, but how coudl the vectorizer decide which one to use?

Anyway, that's it. Your feedback will be much appreciated.

Cheers,
Francesco

________________________________________
From: Tian, Xinmin <xinmin.tian at intel.com>
Sent: 30 November 2016 17:16:12
To: Francesco Petrogalli; llvm-dev at lists.llvm.org
Cc: nd; Masten, Matt; Hal Finkel; Zaks, Ayal
Subject: RE: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the  LoopVectorizer

Hi Francesco,

Good to know, you are working on the support for this feature. I assume you knew the RFC below.  The VectorABI mangling we proposed were approved by C++ Clang FE name mangling owner David M from Google,  the ClangFE support was committed in its main trunk by Alexey.

“Proposal for function vectorization and loop vectorization with function calls”, March 2, 2016. Intel Corp.  http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html.

Matt submitted patch to generate vector variants for function definitions, not just function declarations. You may want to take a look.  Ayal's RFC will be also needed to support vectorization of function body in general.

I agreed, we should have an option -fopenmp-simd to enable SIMD only, both GCC and ICC have similar options.

I would suggest we shall sync-up on these work, so we don't duplicate the effort.

Thanks,
Xinmin

-----Original Message-----
From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Francesco Petrogalli via llvm-dev
Sent: Wednesday, November 30, 2016 7:11 AM
To: llvm-dev at lists.llvm.org
Cc: nd <nd at arm.com>
Subject: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

Dear all,

I have just created a couple of differential reviews to enable the vectorisation of loops that have function calls to routines marked with “#pragma omp declare simd”.

They can be (re)viewed here:

* https://reviews.llvm.org/D27249

* https://reviews.llvm.org/D27250

The current implementation allows the loop vectorizer to generate vector code for source file as:

  #pragma omp declare simd
  double f(double x);

  void aaa(double *x, double *y, int N) {
    for (int i = 0; i < N; ++i) {
      x[i] = f(y[i]);
    }
  }

by invoking clang with arguments:

  $> clang -fopenmp -c -O3 file.c […]

Such functionality should provide a nice interface for vector libraries developers that can be used to inform the loop vectorizer of the availability of an external library with the vector implementation of the scalar functions in the loops. For this, all is needed to do is to mark with “#pragma omp declare simd” the function declaration in the header file of the library and generate the associated symbols in the object file of the library according to the name scheme of the vector ABI (see notes below).

I am interested in any feedback/suggestion/review the community might have regarding this behaviour.

Below you find a description of the implementation and some notes.

Thanks,

Francesco

-----------

The functionality is implemented as follow:

1. Clang CodeGen generates a set of global external variables for each of the function declarations marked with the OpenMP pragma. Each of such globals are named according a mangling that is generated by llvm::TargetLibraryInfoImpl (TLII), and holds the vector signature of the associated vector function. (See examples in the tests of the clang patch.
Each scalar function can generate multiple vector functions depending on the clauses of the declare simd directives) 2. When clang created the TLII, it processes the llvm::Module and finds out which of the globals of the module have the correct mangling and type so that they be added to the TLII as a list of vector function that can be associated to the original scalar one.
3. The LoopVectorizer looks for the available vector functions through the TLII not by scalar name and vectorisation factor but by scalar name and vector function signature, thus enabling the vectorizer to be able to distinguish a "vector vpow1(vector x, vector y)” from a “vector vpow2(vector x, scalar y)”. (The second one corresponds to a “declare simd uniform(y)” for a “scalar pow(scalar x, scalar y)” declaration). (Notice that the changes in the loop vectorizer are minimal.)

Notes:

1. To enable SIMD only for OpenMP, leaving all the multithread/target behaviour behind, we should enable this also with a new option:
-fopenmp-simd
2. The AArch64 vector ABI in the code is essentially the same as for the Intel one (apart from the prefix and the masking argument), and it is based on the clauses associated to “declare simd” in OpenMP 4.0. For OpenMP4.5, the parameters section of the mangled name should be updated.
This update will not change the vectorizer behaviour as all the vectorizer needs to detect a vectorizable function is the original scalar name and a compatible vector function signature. Of course, any changes/updates in the ABI will have to be reflected in the symbols of the binary file of the library.
3. Whistle this is working only for function declaration, the same functionality can be used when (if) clang will implement the declare simd OpenMP pragma for function definitions.
4. I have enabled this for any loop that invokes the scalar function call, not just for those annotated with “#pragma omp for simd”. I don’t have any preference here, but at the same time I don’t see any reason why this shouldn’t be enabled by default for non annotated loops. Let me know if you disagree, I’d happily change the functionality if there are sound reasons behind that.

_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev