[llvm-dev] [cfe-dev] [RFC] Expose user provided vector function for auto-vectorization.

Thu Jun 6 01:31:45 PDT 2019

On 6/4/19 10:48 PM, Francesco Petrogalli wrote:
>
>> Please do not tie the vector-variant mechanism to closely to either VectorABI or OpenMP. We already know that there is more we could do beyond "map"-like SIMD functions.
> I think I understand what you mean here (because of the example below on reductions), but it is the first time I hear the term “map”-like (shame on me?). Can you define what that mean?
By "map"-like i mean that the same function is applied to all lanes 
individually as in std::for_each.
>> Besides, i guess it makes sense to compile a list of use cases to validate the current design and look ahead. It is easy to lose track of the requirements if it's just a trail of emails.
>>
> I will add the examples in my next iteration of the RFC. Is that OK?

Sounds good!

>> On 6/3/19 7:59 PM, Francesco Petrogalli via cfe-dev wrote:
>>> Hi All,
>>>
>>> The original intend of this thread is to "Expose user provided vector function for auto-vectorization.”
>>>
>>> I originally proposed to use OpenMP `declare variant` for the sake of using something that is defined by a standard. The RFC itself is not about fully implementing the `declare variant` directive. In fact, given the amount of complication it is bringing, I would like to move the discussion away from `declare variant`. Therefore, I kindly ask to move any further discussion about `declare variant` to a separate thread.
>>>
>>> I believe that to "Expose user provided vector function for auto-vectorization” we need three components.
>>>
>>> 1. The main component is the IR representation we want to give to this information. My proposal is to use the `vector-variant` attribute with custom symbol redirection.
>>>
>>> 	vector-variant = {“_ZGVnN2v_f(custon_vector_f_2), _ZGVnN4v_f(custon_vector_f_4)”}
>>>
>>> The names here are made of the Vector Function ABI mangled name, plus custom symbol redirection in parenthesis. I believe that themes mangled according to the Vector Function ABI have all the information needed to build the signature of the vector function and the properties of its parameters (linear, uniform, aligned…). This format will cover most (if not all) the cases that are needed for auto-vectorization. I am not aware of any situation in which this information might not be sufficient. Please provide such an example if you know of any.
>> Does the vector variant inherit all function and parameter attributes from the scalar function?
> Yes.
>
>> This should work ok for map-like SIMD arithmetic.
> Yes.
>
>>   However, in light of functions beyond SIMD arithmetic, i think the RFC should specify clearly what we may assume about a vector-variant given its name and scalar function declaration.
>>
> Would renaming the IR attribute from `vactor-variant` to `declare-simd` less controversial here? The behavior of SIMD functions as described in `declare simd` of OpenMP 4.0+ is well defined, and well represented by the Vector Function ABI mangling scheme. At this point in the discussion, the latter name seems a better choice to me.
It's not about the name. Does Vector ABI itself provide any guarantees 
on the semantics of vector functions or do these all carry over from the 
OpenMP standard?
>> By building the vector-variant mechanism around the current VectorABI/OpenMP, we are also inheriting their limitations, such as:
>>
>> 1. A vector variant may return a scalar value with a property (linear, uniform, aligned, ..). For example, this may the case for custom reduction functions (my_reduction_operator(<8 x double> %v) --> double).
>>
> Gotcha. This made me take a look at `declare reduction` in the OpenMP standard. Maybe there is a way to handle this information in the Vector Function ABI.
>
> Do you have other example of vector functions that could return a scalar, other then reductions, and most importantly, whose scalar output parameter would require a `linear` or `aligned` clause? As for the `uniform` property…  Does it even make sense to mark output as uniform?
There could be a function that takes a vector and broadcasts one of its 
elements to all SIMD lanes. If the output is uniform, it means that it 
is the same for all SIMD lanes, eg exactly what you have in a reduction. 
I don't have a convincing example for the linear/aligned in the OpenMP 
context... but how about a vector malloc? That one would return an 
aligned, linear pointer. Btw, in RV we also admit "aligned" on integer 
parameters and return values. In that case, the alignment is a divisor 
of the value.
>> 2. User-defined vector variants may take the mask at a different parameter position than required by VectorABI.
>> LLVM-VP solves this by introducing the "mask" attribute for function parameters (https://reviews.llvm.org/D57504).
> Does this mean that you might have more than one mask in input, one for each vector (data) parameter and one for the output?
There is still just one mask for the whole function but the position of 
the mask parameter in the function signature does not conform to VectorABI.
>> 3. Upcoming Vector/SIMD ISAs such as the V-extension and NEC SX-Aurora have an active vector length besides just the mask. What ever solution out of this RFC should accommodate for that. Just as for the mask, LLVM-VP provides an parameter attribute for the vector length "vlen”.
> This is something that LLVM definitely needs to handle. Mask attribute and vlen attribute for function parameters seems a good idea to be able to represent such signatures in IR.
>
> Can a user use `declare simd` to generate such signature? Alternatively, what would a user need to write in C code to be able to represent those signatures?
Well, you can argue that for AVL ISAs the mask is simply defined as a 
pair, consisting of the mask and the vlen parameter. The mask in terms 
of OpenMP is then understood as that pair. If you have "pragma declare 
SIMD" functions for AVL targets, this implies the existence of both a 
mask and an AVL parameter. I am not aware of any implementation that 
does that.
>> 4. For SIMD functions beyond "map", the behavior of a SIMD function may significantly depend on the mask. In this case already the scalar function would need to be marked as "convergent" (but only if the code is actually going to be vectorized..). Eg, memory accesses (store_f64 -> store_v8f64(<8 x i1> %M)) or a function that simply returns the mask.
>>
> I am not sure I understand this item. Can you write a specific example?
#pragma omp simd
for (i = 0; i < 8; ++i) {
if (i % 3) {
     auto v = foo() // <-- no side-effect but most not be hoisted or the 
value of "v" will change.
     if (i % 5) use(A, v)
   }
}

where

foo() { ret i1 1; } --> _ZGV*M8_foo(v8i1 %M) { ret i1 %M }

use(A, i1 %V) { A[sext i32 %V] = 42.0; } --> _ZGV*M8uv_use(double * %A, 
v8i1 %v, v8i1 %M) { masked_store(A[i32 popcnt v], 42.0, %M); }

Contrived? Yes! But, eg SIMD raytracing codes introspect the mask in 
similar ways.

>   
>
>
>> This is the same issue that the GPU folks are discussing for thread-group semantics:http://lists.llvm.org/pipermail/llvm-dev/2018-December/128662.html
>>
> OK, but… SIMD != target? Or am I missing something? Here we are representing only SIMD.

SIMD-izing parallel iterations/threads for CPU is largely the same as 
generating code for GPUs. It's warp-level code gen on GPU versus SIMD 
code gen on CPU. You are bound to run into similar issues with anything 
that is more complex than a simple loop nest.

This might not be what OpenMP says but the function mapping mechanism 
you propose here should be general enough to work outside of OpenMP as 
well (eg mapping custom reduction operators, bringing ISPC/RV style SIMD 
to LLVM proper).

>> ISPC (http://ispc.github.io/) and the more general Region Vectorizer (https://github.com/cdl-saarland/rv) are examples of frameworks that actually implement thread-group semantics for vectorization, including "wavefront" intrinsics, etc.
>>
>>> We can attach the IR attribute to call instructions (preferred for avoiding conflicts when merging modules who don’t see the same attributes) or to function declaration, or both.
>>>
>>> 2. The second component is a tool that other parts of LLVM (for example, the loop vectorizer) can use to query the availability of the vector function, the SVFS I have described in the original post of the RFC, which is based on interpreting the `vector-variant` attribute.
>> The SVFS seems similar to the function resolver API in RV (https://github.com/cdl-saarland/rv/blob/master/include/rv/resolver/resolver.h). To clarify, RV's resolver API is all about flexibility, eg we use it to implement inter-procedural vectorization, OpenMP declare simd and SLEEF vector math.
> That seems to be a good thing! The SVFS has a limited scope at the moment, but nothing prevents extending it.
>
>> However, it does not commit to a specific order/prioritization of vector variants.
>>
> For `declare simd` function redirected via `declare variant`, the order/prioritization is defined the OpenMP standard. Are you saying that you are ignoring those rules?

You mean 2.3.3 of the OpenMP standard? We don't implement that. RV 
simply picks up the stray Vector ABI strings in scalar function 
attributes, auto-generates all those vector functions and registers the 
definitions with the RV resolver API (~SVFS).

>> You also mentioned splitting vector functions when no vector variant for the full vectorization factor is available.
> I meant joining, not splitting, as wrapping a 2-lane vector version twice to perform 4-lanes vectorization. But this is just an idea, we definitely need to develop a cost model for that.
+1 (we mean the same thing btw, i just explained it in the opposite 
direction).
>> I suggest to not hide this split call in an opaque wrapper function.
> OK.
>
>> In particular the cost model of the SLP vectorizer would benefit from this information..
> Sorry, which information would be beneficial for the SLP vectorizer? I am missing the context here. Do you mean that the SLP would benefit of knowing that the 4-lane version is not a “pure” 4 lanes version, but made of two invocation of the 2-lane version?
Exactly. This may lead to different instruction pairings in SLP.
>> and by extension also future versions of the loop/function vectorizer.
>>
>>> The final component is the one that seems to have generated most of the controversies discussed in the thread, and for which I decided to move away from `declare variant`.
>>>
>>> 3. The third component is a set of descriptors that can be attached to the scalar function declaration / definition in the C/C++ source file, to be able to inform about the availability of an associated vector functions that can be used when / if needed.
>>>
>>> As someone as suggested, we should use a custom attribute. Because the mangling scheme of the Vector Function ABI provides all the information about the shape and properties of the vector function, I propose the approach exemplified in the following code:
>>>
>>>
>>> ```
>>> // AArch64 Advanced SIMD compilation
>>> double foo(double) __attribute__(simd_variant(“nN2v”,”neon_foo”));
>>> float64x2_t neon_foo(float64x2_t x) {…}
>>>
>>> // x86 SSE compilation
>>> double foo(double) __attribute__(simd_variant(“aN2v”,”sse_foo”));
>>> __m128 sse_foo(__m128 x) {…}
>>> ```
>>>
>>> The attribute would use the “core” tokens of the mangled names (without _ZGV prefix and the scalar function name postfix) to describe the vector function provided in the redirection.
>> Since this attribute implies the "_ZGV" prefix, shouldn't it rather be called "vectorabi_variant”?
> Sure. Although, for the sake of renaming, given that `declare variant` maps directly to the Vector Function ABI mangling scheme of the target, as I already mentioned,  I think we should opt for naming the attribute as `declare-simd`.
Ok.. how would you extend the attribute to also cover reduction 
functions? I just want to avoid that the vector function APIs in LLVM 
are cut short to tightly fit the current Vector ABI as if vector ABI 
were the ultimate solution in vector function mapping.
>>> Formal syntax:
>>>
>>> ```
>>> __attribute__(simd_variant(“<isa><mask><VLEN><par_type_list>”, “custom_vector_name”))
>>>
>>> <isa> := “a” (SSE), “b” (AVX) , …, “n” (NEON), “s” (SVE) (from the vector function ABI specifications of each of the targets that support this, for now AArch64 and x86)
>>>
>>> <mask> := “N” for no mask, or “M” for masking
>>>
>>> <VLEN> := number of lanes in a vector | “x” for scalable vectorization (defined in the AArch64 Vector function ABI).
>>>
>>> <part_type_list> := “v” | “l” | … all these tokens are defined in the Vector Function ABI of the target (which get selected by the <isa>). FWIW, they are the same for x86 and AArch64.
>>> ```
>>>
>>> Please let me know what you thing about this proposal. I will rework the proposal if it makes it easier to follow and submit a new RFC about it, but before getting into rewriting everything I want to have some feedback on this change.
>>>
>>> Kind regards,
>>>
>>> Francesco
>>>
>>>> On May 31, 2019, at 8:17 PM, Doerfert, Johannes<jdoerfert at anl.gov>  wrote:
>>>>
>>>> On 06/01, Saito, Hideki wrote:
>>>>> Page 22 of OpenMP 5.0 specification (Lines 13/14):
>>>>>
>>>>> 	When any thread encounters a simd construct, the iterations of the loop associated with the
>>>>> 	construct may be executed concurrently using the SIMD lanes that are available to the thread
>>>>>
>>>>> This is the Execution Model. The word here is "may" i.e., not "must".
>> As long as this reads "may" and there is no clear semantics for "concurrent execution using the SIMD lanes", "pragma omp simd" is precluded from advancing from "vectorize this loop" to a SPMD-like programming model for vectorization as it is common place in the GPU domain.
>>
>> Thanks!
>>
> Thank you!
>
> Francesco

Some of the things i am bringing up here may seem hypothetical if you 
"just" want to map SIMD arithmetic in current VectorABI/OpenMP. Fancy 
function mappings certainly do not have to be addressed by this first 
implementation. We should anyway be mindful to keep the mapping 
infrastructure extensible in those directions. Otherwise the next time 
vector function mappings come up, we won't be dealing with just OpenMP, 
VectorABI and the new stuff but also our legacy vector mapping 
infrastructure that turned out to be incompatible with the new requirements.

- Simon

>
>> Simon
>>
>> -- 
>>
>> Simon Moll
>> Researcher / PhD Student
>>
>> Compiler Design Lab (Prof. Hack)
>> Saarland University, Computer Science
>> Building E1.3, Room 4.31
>>
>> Tel. +49 (0)681 302-57521 :moll at cs.uni-saarland.de
>> Fax. +49 (0)681 302-3065  :http://compilers.cs.uni-saarland.de/people/moll
>>
-- 

Simon Moll
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 :moll at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  :http://compilers.cs.uni-saarland.de/people/moll