[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Wed Jul 4 06:13:36 PDT 2018

Hi Simon,

Replies inline.

-Graham

> i am the main author of RV, the Region Vectorizer (github.com/cdl-saarland/rv). I want to share our standpoint as potential users of the proposed vector-length agnostic IR (RISC-V, ARM SVE).
> -- support for `llvm.experimental.vector.reduce.*` intrinsics --
> RV relies heavily on predicate reductions (`or` and `and` reduction) to tame divergent loops and provide a vector-length agnostic programming model on LLVM IR. I'd really like to see these adopted early on in the new VLA backends so we can fully support these targets from the start. Without these generic intrinsics, we would either need to emit target specific ones or go through the painful process of VLA-style reduction trees with       loops or the like.

The vector reduction intrinsics were originally created to support SVE in order to avoid loops, so we'll definitely be using them.

> -- setting the vector length (MVL) --
> I really like the idea of the `inherits_vlen` attribute. Absence of this attribute in a callee means we can safely stop tracking the vector length across the call boundary.
> However, i think there are some issues with the `vlen token` approach.
> * Why do you need an explicit vlen token if there is a 1 : 1-0 correspondence between functions and vlen tokens?

I think there's a bit of a mix-up here... my proposal doesn't feature tokens. Robin's proposal earlier in the year did, but I think we've reached a consensus that they aren't necessary.

We do need to decide where to place the appropriate checks for which function an instruction is from before allowing copying:

1. Solely within the passes that perform cross-function optimizations. Light-weight, but easier to get it wrong.

2. Within generic methods that insert instructions into blocks. Probably more code changes than method 1. May run into problems if an instruction is cloned first (and therefore has no parent to check -- looking at operands/uses may suffice though).

3. Within size queries. Probably insufficient in places where entire blocks are copied without looking at the types of each individual instruction, and also suffers from problems when cloning instructions.

My current idea is to proceed with option 2 with some additional checks where needed.

> * My main concern is that you are navigating towards a local optimum here. All is well as long as there is only one vector length per function. However, if the architecture supports changing the vector length at any point but you explicitly forbid it, programmers will complain, well, i will for one ;-) Once you give in to that demand you are facing the situation that multiple vector length tokens are live within the same function. This means you have to stop transformations from mixing vector operations with different vector lengths: these would otherwise incur an expensive state change at every vlen transition. However, there is no natural way to express that two SSA values (vlen tokens) must not be live at the same program point.

So I think we've agreed that the notion of vscale inside a function is consistent, so that all size comparisons and stack allocations will use the maximum size for that function.

However, use of setvl or predication changes the effective length inside the function. This is already the case for masked loads and stores -- although an AVX512 vector is 512 bits in size, a different amount of data can be transferred to/from memory. 

Robin will be working on the best way to represent setvl, whereas SVE will just use <scalable n x i1> predicate vectors to control length.

> On 06/11/2018 05:47 PM, Robin Kruppe via llvm-dev wrote:
>> There are some operations that use vl for things other than simple
>> masking. To give one example, "speculative" loads (which silencing
>> some exceptions to safely permit vectorization of some loops with
>> data-dependent exits, such as strlen) can shrink vl as a side effect.
>> I believe this can be handled by modelling all relevant operations
>> (including setvl itself) as intrinsics that have side effects or
>> read/write inaccessible memory. However, if you want to have the
>> "current" vl (or equivalent mask) around as SSA value, you need to
>> "reload" it after any operation that updates vl. That seems like it
>> could get a bit complex if you want to do it efficiently (in the
>> limit, it seems equivalent to SSA construction).
>> 
> I think modeling the vector length as state isn't as bad as it may sound first. In fact, how about modeling the "hard" vector length as a thread_local global variable? That way there is exactly one valid vector length value at every point (defined by the value of the thread_local global variable of the exact name). There is no need for a "demanded vlen" analyses: the global variable yields the value immediately. The RISC-V backend can map the global directly to the vlen register. If a target does not support a re-configurable vector length (SVE), it is safe to run SSA construction during legalization     and use explicit predication instead. You'd perform SSA construction only at the backend/legalization phase.
> Vice versa coming from IR targeted at LLVM SVE, you can go the other way, run a demanded vlen analysis, and encode it explicitly in the program. vlen changes are expensive and should be rare anyway.

This was in response to my suggestion to model setvl with predicates; I've withdrawn the idea. The vscale intrinsic is enough to represent 'maxvl', and based on the IR samples I've seen for RVV, a setvl intrinsic would return the dynamic length in order to correctly update offset/induction variables.

> ; explicit vlen_state modelling in RV could look like this:
> 
> @vlen_state = thread_local global token ; this gives AA a fixed point to constraint vlen-dependent operations
> 
> llvm.vla.setvl(i32 %n)                  ; implicitly writes-only %vlen_state
> i32 llvm.vla.getvl()                    ; implicitly reads-only %vlen_state
> 
> llvm.vla.fadd.f64(f64, f64)           ; implicitly reads-only %vlen_state
> llvm.vla.fdiv.f64(f64, f64)           : .. same
> 
> ; this implements the "speculative" load mentioned in the quote above (writes %vlen_state. I suppose it also reads it first?)
> <scalable 1 x f64> llvm.riscv.probe.f64(%ptr)

Having separate getvl and setvl intrinsics may work nicely, but I'll leave that to Robin to decide.

> By relying on memory dependence, this also implies that arithmetic operations can be re-ordered freely as long as vlen_state does not change between them (SLP, "loop mix (CGO16)", ..).
> Regarding function calls, if the callee does not have the 'inherits_vlen' attribute, the target can use a default value at function entry (max width or "undef"). Otherwise, the vector length needs to be communicated from caller to callee. However, the `vlen_state` variable already achieves that for a first implementation.

I got the impression that the RVV team wanted to be able to reconfigure registers (and therefore potentially change max vector length/number of available registers) for each function; if a call to a function is required from inside a vectorized loop then I think maxvl/vscale has to match and the callee must not reconfigure registers. I suspect there will be a complicated cost model to decide whether to change configuration or stick with a default of all registers enabled.

> Last but not least, thank you all for working on this! I am really looking forward to playing around with vla architectures in LLVM.

Glad to hear it; there is an SVE emulator[1] available so once we've managed to get some code committed you'll be able to try some of this out, at least on one of the architectures.

[1] https://developer.arm.com/products/software-development-tools/hpc/arm-instruction-emulator