[llvm-dev] [RFC] Vector Predication

Fri Feb 1 02:58:48 PST 2019

On Fri, Feb 1, 2019 at 2:09 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
> Neat! I did not know that about the V extension. So this sounds as though the V extension would like support for <VL x <4 x float>>-style vectors as well.

Yes. In general, support for <VL x <M x iN>> where M is in {2,4,8} and
N could be as small as 1 though support for smaller than i8 is
optional. (no distinction is drawn between int and float in the vector
configuration -- that's up to the operations performed)

> We are currently thinking of defining the extension in terms of a 16-bit prefix that changes standard 32-bit instructions into vectorized 48-bit instructions, allowing most future or current standard/non-standard extensions to be vectorized, rather than having to wait for additional extensions to have vector versions added to the V extension (one reason we are not using the V extension instead), such as the B extension.

Do you mean instructions following the standard 48-bit encoding
scheme, that happen to contain a standard 32 bit instruction as a
payload?

>Having a prefix rather than, or in addition to, a layout configuration register allows intermixing vector operations on different group/element sizes without having to constantly change the vector configuration every few instructions.

No real difference. The standard RISC-V Vector extension is intended
to allow exactly those changes to the vector configuration every few
instructions. It's mostly the microcontroller people coming from
DSP/SIMD who want to do that, so it's up to them to make that
efficient on their cores -- they might even do macro-op fusion on it.
Big OoO/Supercomputer style code compiled from C/FORTRAN in general
doesn't want to do that kind of thing.

Example code that changes the configuration within a loop to do 16 bit
loads, 16x16->32 multiply, then 32 bit shift and store:

# Example: Load 16-bit values, widen multiply to 32b, shift 32b result
# right by 3, store 32b values.
loop:
    vsetvli a3, a0, vsew16,vlmul4  # vtype = 16-bit integer vectors
    vlh.v v4, (a1)          # Get 16b vector
      slli t1, a3, 1
      add a1, a1, t1        # Bump pointer
    vwmul.vs v8, v4, v1     # 32b in <v8--v15>

    vsetvli x0, a0, vsew32,vlmul8  # Operate on 32b values
    vsrl.vi v8, v8, 3
    vsw.v v8, (a2)          # Store vector of 32b
      slli t1, t1, 2
      add a2, a2, t1        # Bump pointer
      sub a0, a0, a3        # Decrement count
      bnez a0, loop         # Any more?

(this example is probably only useful if 16x16->32 mul is
significantly faster than 32x32->32, otherwise you'd just load and
sign extend the 16 bit data into 32 bit elements)

A note on vector register numbering. There are registers 0..31. If you
specify vlmul4 then only v0,v4,v8,v12,v16,v20,v24,v28 are valid
register numbers. If you specify vlmul8 then only v0,v8,v16,v24 are
valid.