[cfe-dev] [llvm-dev] [RFC] Expose user provided vector function for auto-vectorization.

Tue Jun 4 13:48:55 PDT 2019

> On Jun 4, 2019, at 7:27 AM, Simon Moll <moll at cs.uni-saarland.de> wrote:
> 
> Hi,
> 

Hi Simon!

> I think this is going in the right direction.

Good!

> Please do not tie the vector-variant mechanism to closely to either VectorABI or OpenMP. We already know that there is more we could do beyond "map"-like SIMD functions.

I think I understand what you mean here (because of the example below on reductions), but it is the first time I hear the term “map”-like (shame on me?). Can you define what that mean? 

> Besides, i guess it makes sense to compile a list of use cases to validate the current design and look ahead. It is easy to lose track of the requirements if it's just a trail of emails.
> 

I will add the examples in my next iteration of the RFC. Is that OK?

> On 6/3/19 7:59 PM, Francesco Petrogalli via cfe-dev wrote:
>> Hi All,
>> 
>> The original intend of this thread is to "Expose user provided vector function for auto-vectorization.”
>> 
>> I originally proposed to use OpenMP `declare variant` for the sake of using something that is defined by a standard. The RFC itself is not about fully implementing the `declare variant` directive. In fact, given the amount of complication it is bringing, I would like to move the discussion away from `declare variant`. Therefore, I kindly ask to move any further discussion about `declare variant` to a separate thread.
>> 
>> I believe that to "Expose user provided vector function for auto-vectorization” we need three components.
>> 
>> 1. The main component is the IR representation we want to give to this information. My proposal is to use the `vector-variant` attribute with custom symbol redirection.
>> 
>> 	vector-variant = {“_ZGVnN2v_f(custon_vector_f_2), _ZGVnN4v_f(custon_vector_f_4)”}
>> 
>> The names here are made of the Vector Function ABI mangled name, plus custom symbol redirection in parenthesis. I believe that themes mangled according to the Vector Function ABI have all the information needed to build the signature of the vector function and the properties of its parameters (linear, uniform, aligned…). This format will cover most (if not all) the cases that are needed for auto-vectorization. I am not aware of any situation in which this information might not be sufficient. Please provide such an example if you know of any.
> 
> Does the vector variant inherit all function and parameter attributes from the scalar function?

Yes.

> This should work ok for map-like SIMD arithmetic.

Yes.

>  However, in light of functions beyond SIMD arithmetic, i think the RFC should specify clearly what we may assume about a vector-variant given its name and scalar function declaration.
> 

Would renaming the IR attribute from `vactor-variant` to `declare-simd` less controversial here? The behavior of SIMD functions as described in `declare simd` of OpenMP 4.0+ is well defined, and well represented by the Vector Function ABI mangling scheme. At this point in the discussion, the latter name seems a better choice to me.

> By building the vector-variant mechanism around the current VectorABI/OpenMP, we are also inheriting their limitations, such as:
> 
> 1. A vector variant may return a scalar value with a property (linear, uniform, aligned, ..). For example, this may the case for custom reduction functions (my_reduction_operator(<8 x double> %v) --> double).
> 

Gotcha. This made me take a look at `declare reduction` in the OpenMP standard. Maybe there is a way to handle this information in the Vector Function ABI.

Do you have other example of vector functions that could return a scalar, other then reductions, and most importantly, whose scalar output parameter would require a `linear` or `aligned` clause? As for the `uniform` property…  Does it even make sense to mark output as uniform?

> 2. User-defined vector variants may take the mask at a different parameter position than required by VectorABI.
> LLVM-VP solves this by introducing the "mask" attribute for function parameters (https://reviews.llvm.org/D57504).

Does this mean that you might have more than one mask in input, one for each vector (data) parameter and one for the output?

> 
> 3. Upcoming Vector/SIMD ISAs such as the V-extension and NEC SX-Aurora have an active vector length besides just the mask. What ever solution out of this RFC should accommodate for that. Just as for the mask, LLVM-VP provides an parameter attribute for the vector length "vlen”.

This is something that LLVM definitely needs to handle. Mask attribute and vlen attribute for function parameters seems a good idea to be able to represent such signatures in IR.

Can a user use `declare simd` to generate such signature? Alternatively, what would a user need to write in C code to be able to represent those signatures?

> 
> 4. For SIMD functions beyond "map", the behavior of a SIMD function may significantly depend on the mask. In this case already the scalar function would need to be marked as "convergent" (but only if the code is actually going to be vectorized..). Eg, memory accesses (store_f64 -> store_v8f64(<8 x i1> %M)) or a function that simply returns the mask.
> 

I am not sure I understand this item. Can you write a specific example?

> This is the same issue that the GPU folks are discussing for thread-group semantics: http://lists.llvm.org/pipermail/llvm-dev/2018-December/128662.html
> 

OK, but… SIMD != target? Or am I missing something? Here we are representing only SIMD.

> ISPC (http://ispc.github.io/) and the more general Region Vectorizer (https://github.com/cdl-saarland/rv) are examples of frameworks that actually implement thread-group semantics for vectorization, including "wavefront" intrinsics, etc.
> 
>> We can attach the IR attribute to call instructions (preferred for avoiding conflicts when merging modules who don’t see the same attributes) or to function declaration, or both.
>> 
>> 2. The second component is a tool that other parts of LLVM (for example, the loop vectorizer) can use to query the availability of the vector function, the SVFS I have described in the original post of the RFC, which is based on interpreting the `vector-variant` attribute.
> 
> The SVFS seems similar to the function resolver API in RV (https://github.com/cdl-saarland/rv/blob/master/include/rv/resolver/resolver.h). To clarify, RV's resolver API is all about flexibility, eg we use it to implement inter-procedural vectorization, OpenMP declare simd and SLEEF vector math. 

That seems to be a good thing! The SVFS has a limited scope at the moment, but nothing prevents extending it.

> However, it does not commit to a specific order/prioritization of vector variants.
> 

For `declare simd` function redirected via `declare variant`, the order/prioritization is defined the OpenMP standard. Are you saying that you are ignoring those rules?

> You also mentioned splitting vector functions when no vector variant for the full vectorization factor is available.

I meant joining, not splitting, as wrapping a 2-lane vector version twice to perform 4-lanes vectorization. But this is just an idea, we definitely need to develop a cost model for that.

> I suggest to not hide this split call in an opaque wrapper function.

OK. 

> In particular the cost model of the SLP vectorizer would benefit from this information..

Sorry, which information would be beneficial for the SLP vectorizer? I am missing the context here. Do you mean that the SLP would benefit of knowing that the 4-lane version is not a “pure” 4 lanes version, but made of two invocation of the 2-lane version? 

> and by extension also future versions of the loop/function vectorizer.
> 
>> The final component is the one that seems to have generated most of the controversies discussed in the thread, and for which I decided to move away from `declare variant`.
>> 
>> 3. The third component is a set of descriptors that can be attached to the scalar function declaration / definition in the C/C++ source file, to be able to inform about the availability of an associated vector functions that can be used when / if needed.
>> 
>> As someone as suggested, we should use a custom attribute. Because the mangling scheme of the Vector Function ABI provides all the information about the shape and properties of the vector function, I propose the approach exemplified in the following code:
>> 
>> 
>> ```
>> // AArch64 Advanced SIMD compilation
>> double foo(double) __attribute__(simd_variant(“nN2v”,”neon_foo”));
>> float64x2_t neon_foo(float64x2_t x) {…}
>> 
>> // x86 SSE compilation
>> double foo(double) __attribute__(simd_variant(“aN2v”,”sse_foo”));
>> __m128 sse_foo(__m128 x) {…}
>> ```
>> 
>> The attribute would use the “core” tokens of the mangled names (without _ZGV prefix and the scalar function name postfix) to describe the vector function provided in the redirection.
> Since this attribute implies the "_ZGV" prefix, shouldn't it rather be called "vectorabi_variant”?

Sure. Although, for the sake of renaming, given that `declare variant` maps directly to the Vector Function ABI mangling scheme of the target, as I already mentioned,  I think we should opt for naming the attribute as `declare-simd`.

>> Formal syntax:
>> 
>> ```
>> __attribute__(simd_variant(“<isa><mask><VLEN><par_type_list>”, “custom_vector_name”))
>> 
>> <isa> := “a” (SSE), “b” (AVX) , …, “n” (NEON), “s” (SVE) (from the vector function ABI specifications of each of the targets that support this, for now AArch64 and x86)
>> 
>> <mask> := “N” for no mask, or “M” for masking
>> 
>> <VLEN> := number of lanes in a vector | “x” for scalable vectorization (defined in the AArch64 Vector function ABI).
>> 
>> <part_type_list> := “v” | “l” | … all these tokens are defined in the Vector Function ABI of the target (which get selected by the <isa>). FWIW, they are the same for x86 and AArch64.
>> ```
>> 
>> Please let me know what you thing about this proposal. I will rework the proposal if it makes it easier to follow and submit a new RFC about it, but before getting into rewriting everything I want to have some feedback on this change.
>> 
>> Kind regards,
>> 
>> Francesco
>> 
>>> On May 31, 2019, at 8:17 PM, Doerfert, Johannes <jdoerfert at anl.gov> wrote:
>>> 
>>> On 06/01, Saito, Hideki wrote:
>>>> Page 22 of OpenMP 5.0 specification (Lines 13/14):
>>>> 
>>>> 	When any thread encounters a simd construct, the iterations of the loop associated with the
>>>> 	construct may be executed concurrently using the SIMD lanes that are available to the thread
>>>> 
>>>> This is the Execution Model. The word here is "may" i.e., not "must".
> 
> As long as this reads "may" and there is no clear semantics for "concurrent execution using the SIMD lanes", "pragma omp simd" is precluded from advancing from "vectorize this loop" to a SPMD-like programming model for vectorization as it is common place in the GPU domain.
> 
> Thanks!
> 

Thank you!

Francesco

> Simon
> 
> -- 
> 
> Simon Moll
> Researcher / PhD Student
> 
> Compiler Design Lab (Prof. Hack)
> Saarland University, Computer Science
> Building E1.3, Room 4.31
> 
> Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de
> Fax. +49 (0)681 302-3065  : http://compilers.cs.uni-saarland.de/people/moll
>