[llvm-dev] [RFC] Vector Predication

Tue Feb 5 03:06:17 PST 2019

On Tue, Feb 5, 2019 at 1:23 AM Simon Moll <moll at cs.uni-saarland.de> wrote:
> I think this is the usual mixup of AVL and MVL.
>
> AVL: is part of the predicate and can change between vector operations
> just like a mask can (light weight).
>
> MVL: Is the physical vector register length and can be re-configured per
> function (RVV only atm) - (heavy weight, stop-the-world instruction).
>
> The vectorlen parameter in EVL intrinsics is for the AVL.

Unless I misunderstand, this doesn't describe RVV correctly, although
this is understandable as the spec has moved around a bit in the last
six or twelve months as it's gotten closer to being set in stone.

The way it has ended up (very unlikely to change now) is:

- any given RVV vector unit has 32 registers each with the same and
fixed length in bits.

- the vector unit is configured by the VSETVL[I] instruction which has
two arguments: 1) the requested AVL, and 2) the vtype (vector type).

- The vtype is an integer with several small fields, of which two are
currently defined (the other bits must be zero). The fields are the
Standard Element Width and VLMul. SEW can be any power of 2 from 8
bits up to some implementation-defined maximum (1024 bits absolute
maximum). VLMul says that you don't actually need 32 distinct vector
variables in your current loop/function and you're willing to trade
number of registers for a larger MVL. So, you can gang together each
even/odd register pair into 16 longer registers (named 0,2,4...30), or
you can gang together groups of four or at most eight registers.

- the current MVL -- the maximum number of elements in a vector
register -- is the hardware register length, multiplied by the VLMul
field in vtype, divided by the SEW field in vtype.

- the AVL is the smaller of MVL and the requested AVL.

- only two things can change AVL: the VSETVL[I] instruction, and a
special kind of memory load: "Unit-stride First-Fault Loads" if the
load crosses a protection boundary and the tail of the vector is
inaccessible. This kind of load is relatively uncommon and exists so
you can vectorise things where the end of the application vector is
data-dependent rather than counted. The canonical example is
strlen()/strcpy(). For most code you can ignore it and say the AVL
changes only when you execute VSETVL[I].

- any time the program uses VSETVL[I] *both* the MVL and the AVL can change.

- the common case is a loop with the vtype in an immediate VSETVLI at
the head of the loop. In this case, the AVL potentially changes in
every iteration of the loop (but usually only in the last one or two
iterations). As the vtype is in an immediate it can't change from
iteration to iteration. But it's common for two loops in the same
function to use different vtype, and so different MVL, because the
loops might either operate on different data types, or need a
different number of vector variables in the loop, or both.

- VSETVL[I] is *not* heavyweight, even if it changes the MVL. It's
quite ok to execute it as much as you want -- even before every vector
instruction if you want. That would be pretty unusual, and I think
falls more into the "clever hand-written code" area than into anything
a compiler is likely to want to generate from C loops, although it's
certainly possible.

Here's an example:

void foo(size_t n, int64_t *dst, int32_t *a, int32_t *b){
  for (size_t i=0; i<n; ++i)
    dst[i] += a[i] * b[i];
}

If 32x32->64 multiplies are cheaper than 64x64->64 multiplies then you
might want to compile this to:

# args n in a0, dst in a1, a in a2, b in a3, AVL in t0
foo:
    vsetvli a4, a0, vsew32,vlmul4  # vtype = 32-bit integer vectors, AVL in a4
    vlw.v v0, (a2)          # Get 32b vector a into v0-v3
    vlw.v v4, (a3)          # Get 32b vector b into v4-v7
    slli a5, a4, 2             # multiply AVL by element size 4 bytes
    add a2, a2, a5        # Bump pointer a
    add a3, a3, a5        # Bump pointer b
    vwmul.vv v8, v0, v4   # 64b result in v8-v15

    vsetvli zero, a0, vsew64,vlmul8  # Operate on 64b values, discard
new AVL as it's the same
    vld.v v16, (a1)         # Get 64b vector dst into v16-v23
    vadd.vv v16, v16, v8 # add 64b elements in v8-v15 to v16-v23
    vsd.v v16, (a1)          # Store vector of 64b
    slli a5, a4, 3               # multiply AVL by element size 8 bytes
    add a1, a1, a5        # Bump pointer dst
    sub a0, a0, a4        # subtract AVL from n to get remaining count
    bnez a0, foo         # Any more?
    ret

The alternative of course is to set up for 64 bit elements at the
outset, let the two vlw.v's for a and b widen the 32 bit loads into 64
bit elements, then do 64x64->64 multiplies. The code would be two
instructions shorter, saving one of the vsetvli (4 bytes) and one of
the shifts (2 bytes).

Assuming for the moment a 512 bit (64 byte) vector register size
(total vector register file 2 KB). this function initially sets the
MVL to 64 (2048 bits divided into 32-bit elements). The widening
multiply produces 64 64-bit elements. The second half of the loop then
sets the element size to 64 bits and doubles the vlmul, so the MVL is
still 64 (4096 bits divided into 64-bit elements). The load, add, and
store of dst then takes place using 64 bit calculations.

Except on the last iteration [1] the AVL will be the same as the MVL.
Both will change (in bits, not in number of elements in this case)
twice in each loop.

[1] if on the 2nd to last iteration there are, say, 72 elements left,
the vsetvli instruction might choose to return an AVL of 36 elements,
leaving 36 for the last iteration, rather than doing 64 and then
leaving only 8 for the last iteration. Or maybe 48 and 24, or 40 and
32 depending on what suits that particular hardware. Or maybe it will
equalise the last three or four or more iterations. The main rule is
the AVL must decrease monotonically.