[llvm-dev] [RFC] Vector Predication

Bruce Hoult via llvm-dev llvm-dev at lists.llvm.org
Fri Feb 1 02:58:48 PST 2019


On Fri, Feb 1, 2019 at 2:09 AM Jacob Lifshay <programmerjake at gmail.com> wrote:
> Neat! I did not know that about the V extension. So this sounds as though the V extension would like support for <VL x <4 x float>>-style vectors as well.

Yes. In general, support for <VL x <M x iN>> where M is in {2,4,8} and
N could be as small as 1 though support for smaller than i8 is
optional. (no distinction is drawn between int and float in the vector
configuration -- that's up to the operations performed)

> We are currently thinking of defining the extension in terms of a 16-bit prefix that changes standard 32-bit instructions into vectorized 48-bit instructions, allowing most future or current standard/non-standard extensions to be vectorized, rather than having to wait for additional extensions to have vector versions added to the V extension (one reason we are not using the V extension instead), such as the B extension.

Do you mean instructions following the standard 48-bit encoding
scheme, that happen to contain a standard 32 bit instruction as a
payload?

>Having a prefix rather than, or in addition to, a layout configuration register allows intermixing vector operations on different group/element sizes without having to constantly change the vector configuration every few instructions.

No real difference. The standard RISC-V Vector extension is intended
to allow exactly those changes to the vector configuration every few
instructions. It's mostly the microcontroller people coming from
DSP/SIMD who want to do that, so it's up to them to make that
efficient on their cores -- they might even do macro-op fusion on it.
Big OoO/Supercomputer style code compiled from C/FORTRAN in general
doesn't want to do that kind of thing.

Example code that changes the configuration within a loop to do 16 bit
loads, 16x16->32 multiply, then 32 bit shift and store:

# Example: Load 16-bit values, widen multiply to 32b, shift 32b result
# right by 3, store 32b values.
loop:
    vsetvli a3, a0, vsew16,vlmul4  # vtype = 16-bit integer vectors
    vlh.v v4, (a1)          # Get 16b vector
      slli t1, a3, 1
      add a1, a1, t1        # Bump pointer
    vwmul.vs v8, v4, v1     # 32b in <v8--v15>

    vsetvli x0, a0, vsew32,vlmul8  # Operate on 32b values
    vsrl.vi v8, v8, 3
    vsw.v v8, (a2)          # Store vector of 32b
      slli t1, t1, 2
      add a2, a2, t1        # Bump pointer
      sub a0, a0, a3        # Decrement count
      bnez a0, loop         # Any more?

(this example is probably only useful if 16x16->32 mul is
significantly faster than 32x32->32, otherwise you'd just load and
sign extend the 16 bit data into 32 bit elements)

A note on vector register numbering. There are registers 0..31. If you
specify vlmul4 then only v0,v4,v8,v12,v16,v20,v24,v28 are valid
register numbers. If you specify vlmul8 then only v0,v8,v16,v24 are
valid.


More information about the llvm-dev mailing list