[llvm-dev] [RFC] Vector Predication

Tue Feb 5 04:28:30 PST 2019

On 2/5/19 12:06 PM, Bruce Hoult wrote:
> On Tue, Feb 5, 2019 at 1:23 AM Simon Moll <moll at cs.uni-saarland.de> wrote:
>> I think this is the usual mixup of AVL and MVL.
>>
>> AVL: is part of the predicate and can change between vector operations
>> just like a mask can (light weight).
>>
>> MVL: Is the physical vector register length and can be re-configured per
>> function (RVV only atm) - (heavy weight, stop-the-world instruction).
>>
>> The vectorlen parameter in EVL intrinsics is for the AVL.
> Unless I misunderstand, this doesn't describe RVV correctly, although
> this is understandable as the spec has moved around a bit in the last
> six or twelve months as it's gotten closer to being set in stone.
>
> The way it has ended up (very unlikely to change now) is:
>
> - any given RVV vector unit has 32 registers each with the same and
> fixed length in bits.
>
> - the vector unit is configured by the VSETVL[I] instruction which has
> two arguments: 1) the requested AVL, and 2) the vtype (vector type).
>
> - The vtype is an integer with several small fields, of which two are
> currently defined (the other bits must be zero). The fields are the
> Standard Element Width and VLMul. SEW can be any power of 2 from 8
> bits up to some implementation-defined maximum (1024 bits absolute
> maximum). VLMul says that you don't actually need 32 distinct vector
> variables in your current loop/function and you're willing to trade
> number of registers for a larger MVL. So, you can gang together each
> even/odd register pair into 16 longer registers (named 0,2,4...30), or
> you can gang together groups of four or at most eight registers.
>
> - the current MVL -- the maximum number of elements in a vector
> register -- is the hardware register length, multiplied by the VLMul
> field in vtype, divided by the SEW field in vtype.
>
> - the AVL is the smaller of MVL and the requested AVL.
>
> - only two things can change AVL: the VSETVL[I] instruction, and a
> special kind of memory load: "Unit-stride First-Fault Loads" if the
> load crosses a protection boundary and the tail of the vector is
> inaccessible. This kind of load is relatively uncommon and exists so
> you can vectorise things where the end of the application vector is
> data-dependent rather than counted. The canonical example is
> strlen()/strcpy(). For most code you can ignore it and say the AVL
> changes only when you execute VSETVL[I].
>
> - any time the program uses VSETVL[I] *both* the MVL and the AVL can change.
>
> - the common case is a loop with the vtype in an immediate VSETVLI at
> the head of the loop. In this case, the AVL potentially changes in
> every iteration of the loop (but usually only in the last one or two
> iterations). As the vtype is in an immediate it can't change from
> iteration to iteration. But it's common for two loops in the same
> function to use different vtype, and so different MVL, because the
> loops might either operate on different data types, or need a
> different number of vector variables in the loop, or both.
>
> - VSETVL[I] is *not* heavyweight, even if it changes the MVL. It's
> quite ok to execute it as much as you want -- even before every vector
> instruction if you want. That would be pretty unusual, and I think
> falls more into the "clever hand-written code" area than into anything
> a compiler is likely to want to generate from C loops, although it's
> certainly possible.
>
> Here's an example:
>
> void foo(size_t n, int64_t *dst, int32_t *a, int32_t *b){
>    for (size_t i=0; i<n; ++i)
>      dst[i] += a[i] * b[i];
> }
>
> If 32x32->64 multiplies are cheaper than 64x64->64 multiplies then you
> might want to compile this to:
>
> # args n in a0, dst in a1, a in a2, b in a3, AVL in t0
> foo:
>      vsetvli a4, a0, vsew32,vlmul4  # vtype = 32-bit integer vectors, AVL in a4
>      vlw.v v0, (a2)          # Get 32b vector a into v0-v3
>      vlw.v v4, (a3)          # Get 32b vector b into v4-v7
>      slli a5, a4, 2             # multiply AVL by element size 4 bytes
>      add a2, a2, a5        # Bump pointer a
>      add a3, a3, a5        # Bump pointer b
>      vwmul.vv v8, v0, v4   # 64b result in v8-v15
>
>      vsetvli zero, a0, vsew64,vlmul8  # Operate on 64b values, discard
> new AVL as it's the same
>      vld.v v16, (a1)         # Get 64b vector dst into v16-v23
>      vadd.vv v16, v16, v8 # add 64b elements in v8-v15 to v16-v23
>      vsd.v v16, (a1)          # Store vector of 64b
>      slli a5, a4, 3               # multiply AVL by element size 8 bytes
>      add a1, a1, a5        # Bump pointer dst
>      sub a0, a0, a4        # subtract AVL from n to get remaining count
>      bnez a0, foo         # Any more?
>      ret
>
> The alternative of course is to set up for 64 bit elements at the
> outset, let the two vlw.v's for a and b widen the 32 bit loads into 64
> bit elements, then do 64x64->64 multiplies. The code would be two
> instructions shorter, saving one of the vsetvli (4 bytes) and one of
> the shifts (2 bytes).
>
> Assuming for the moment a 512 bit (64 byte) vector register size
> (total vector register file 2 KB). this function initially sets the
> MVL to 64 (2048 bits divided into 32-bit elements). The widening
> multiply produces 64 64-bit elements. The second half of the loop then
> sets the element size to 64 bits and doubles the vlmul, so the MVL is
> still 64 (4096 bits divided into 64-bit elements). The load, add, and
> store of dst then takes place using 64 bit calculations.
>
> Except on the last iteration [1] the AVL will be the same as the MVL.
> Both will change (in bits, not in number of elements in this case)
> twice in each loop.
>
> [1] if on the 2nd to last iteration there are, say, 72 elements left,
> the vsetvli instruction might choose to return an AVL of 36 elements,
> leaving 36 for the last iteration, rather than doing 64 and then
> leaving only 8 for the last iteration. Or maybe 48 and 24, or 40 and
> 32 depending on what suits that particular hardware. Or maybe it will
> equalise the last three or four or more iterations. The main rule is
> the AVL must decrease monotonically.

Thank you for the detailed explanation! I wasn't aware of the current 
state of RVV in that regard.

This seems to imply that enforcing MVL changes only per function level 
is now moot (as in 
https://lists.llvm.org/pipermail/llvm-dev/2018-April/122517.html).

-- 

Simon Moll
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : moll at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://compilers.cs.uni-saarland.de/people/moll