[llvm-dev] [RFC] Code generation for RISC-V V-extension

Fri Oct 16 01:02:18 PDT 2020

Hi all,

(I uploaded a PDF rendered of this email just in case some of its sections
are difficult to read
https://pm.bsc.es/~rferrer/tmp/rfc-vector-code-generation.pdf )

This is a RFC on the strategy we have been developing at the Barcelona
Supercomputing Center (BSC) to generate code for the RISC-V V-extension.

We (BSC and SiFive) have prepared a very first patch showing the strategy
explained in this email, found here

- https://reviews.llvm.org/D89449

Feedback is very welcome as we see this change as a foundation on which to
build the rest of the V-extension codegen support.

## Introduction

The V-extension of RISC-V brings vector computation capabilities to the
RISC-V ecosystem. The full details of the ISA specification are found here:
https://github.com/riscv/riscv-v-spec This proposal is based on draft
version 0.9 found at https://github.com/riscv/riscv-v-spec/releases/tag/0.9.

The V-extension is rather flexible so it is difficult to summarize it in an
reasonably sized email. Please refer to the above specification for all the
details. I'll try to describe here all the features that are relevant for
code generation.

The ISA provides 32 vector registers `v0`,`v1`, ..., `v31`. The size in
bits of each vector register is an implementation-specific parameter called
`VLEN` and must be a power of two. Vector registers are partitioned (i.e.
densely packed) in elements whose size in bits is a power of two, ranging
from 8 to `ELEN`. `ELEN` is also a power of two and `ELEN`<=`VLEN`.

Due to encoding constraints, not all the operands of a vector operation are
encoded in the instructions themselves. Two CSR (control and status
registers) are used instead:

- `vl`: the number of elements being operated, called the vector length. A
vector instruction will operate the elements `0` to `vl-1`
- `vtype`: the vector type. This register encodes the element size of the
operation, called the standard element width (`sew`) and a vector grouping
mechanism called the length multiplier (`lmul`)

The length multiplier (`lmul`) can take values 1, 2, 4, 8, 1/2, 1/4, 1/8.

- When `lmul=1` the vector instructions operate on the (32) vector
registers.
- When `lmul>1` the vector instructions operate on vector groups (see
below) encoded in the instruction using the lowest numbered vector register
of the group.
- When `lmul<1` the vector instructions operate on the lowest half, quarter
or eighth of a vector register.

A vector group of size `lmul>1` is the set of consecutive vector registers
`v{lmul*i}`, `v{lmul*i+1}`, ... , `v{lmul*(i + 1) - 1}`. So

- `lmul=2` has 16 groups: `v0`, `v2`, `v4`, `v6`, …, `v24`, `v26`, `v28`,
`v30`
- `lmul=4` has 8 groups: `v0`, `v4`, `v8`, `v12`, `v16`, `v20`, `v24`, `v28`
- `lmul=8` has 4 groups: `v0`, `v8`, `v16`, `v24`

For instance, under `lmul=4`, a vector group `v4` operand includes vector
registers `v4`, `v5,` `v6` and `v7` as if they had been concatenated as a
four times larger vector register. `lmul` is useful to align the number of
elements in vector codes whose element sizes are different (say when
combining vectors of 32- and 64-bit elements) or when doing *widenings*
(zero, sign or fp extensions) or *narrowings* (truncations).

A program must ensure that both `vl` and `vtype` have the correct values
for a vector operation before executing a vector instruction. This is done
using the `vsetvli` instruction.

```
vsetvli rdest, rsrc, sew,lmul,tx,mx     # We discuss tx,mx below
```

`rsrc` is the application vector length (AVL) and will be used when setting
the `vl`. `rdest` is updated with the value of `vl`. The spec allows some
latitude here but a simple functional model of what vsetvli does is the
following

```
vl ← min(rsrc, lmul*VLEN/sew)
vtype ← sew,lmul,...
```

`vsetvli` has a couple of special cases

- When `rsrc` is `x0` and `rdest` is not `x0` then `vl ← lmul*VLEN/sew`. In
other words, sets `vl` to be the maximum vector length for a given `lmul`
and `sew`. This is useful for whole-register operations.

  ```
  vsetvli t0, x0, e32,m2,… # vl ← 2*VLEN/64
                           # vtype ← e32,m2,...
                           # t0 ← vl
  ```

- When both `rsrc` and `rdest` are `x0` (the hard-coded zero of RISC-V)
then `vl` is used as the AVL. This can be used to change the `vtype` when
we know the ratio `sew/lmul` will be preserved.

  ```
  vsetvli x0, x0, e64,m4,… # changing vtype from e32,m2 to e64,m4 is OK (vl
is unchanged)
                           # vtype ← e64,m4,...
  ```

Two simple examples (register `x10` contains the AVL)

- Add two 32-bit element vectors under `lmul=1`

  ```
  vsetvli x0, x10, e32,m1,…
  vadd.vv v1, v2, v3  # v1[0:vl-1] ← v2[0:vl-1] + v3[0:vl-1]
                      # where v[i:j] is all v[x] where i <= x <= j
  ```

- Add two 64-bit element vectors under `lmul=2`

  ```
  vsetvli x0, x10, e64,m2,…
  vadd.vv v2, v4, v6  # Updates v2 and v3. Reads v4, v5 and v6, v7
                      # v2[0:x-1] ← v4[0:x-1] + v6[0:x-1] where x =
min(VLEN/64, vl)
                      # v3[0:y-1] ← v5[0:y-1] + v7[0:y-1] where y = vl - x
  ```

Note: it is not necessary to emit a `vsetvli` instruction before every
vector instruction if the current `vl` and `vtype` are still suitable for
the intended vector operation. Removing redundant `vsetvli` instructions is
not part of this proposal.

### Masks and tails

RISC-V Vector extension supports masks in almost all of its instructions.
There are no distinguished mask registers, instead vector registers can be
used to represent masks. However an instruction whose execution is masked
can only use the `v0` register as the mask operand. Elements of the
destination register that are masked off by the mask are called _inactive
elements_ (i.e. masked-off)

A vector instruction can be executed under a `vl` setting where `vl` <
`lmul*VLEN/sew`. Elements of the destination register past the current `vl`
are called the _tail elements_.

There are two modes for the tail and inactive elements

- undisturbed, in which the element of the destination register is left
unmodified
- agnostic, in which the elements of the destination register is either
left unmodified or all its bits set to 1 (for debugging purposes). In this
mode we cannot assume anything about the bits of those elements

`tx,mx` in vsetvli above correspond to these two policies and can be
combined in 4 ways:

- `tu,mu`. Both tail and inactive are left undisturbed
- `ta,ma`. Both tail and inactive are agnostic
- `tu,ma`. Tail is left undisturbed and inactive are agnostic
- `ta,mu`. Tail is agnostic and inactive are left undisturbed.

It may be reasonable to prefer the mode that is the least defined (such as
`ta,ma`) for the intended IR operation. Some select operations, however,
can be folded using masks in instructions (that would lead us to `ta,mu`).
However, for the sake of simplicity and given the early stage of this
works, we are going to use `tu,mu` mode everywhere. Relaxing to agnostic
where possible is postponed as later work.

## Mapping to LLVM IR Types

It seems reasonable to make `VLEN` an unknown constant to the compiler.
This leads us to use scalable vector types (such as the ones used by Arm
SVE target).

However in contrast to SVE, RISC-V V-extension does not prescribe a minimal
vector size (other than `ELEN<=VLEN`). This minimal vector size impacts the
LLVM types we can use because it impacts `vscale`. We propose that LLVM
targets only `VLEN>=64`. We believe other permissible values under the
V-ext spec (such as `VLEN=8`) are not realistically useful. Also we propose
that LLVM supports only `ELEN=32` and `ELEN=64`.

These constraints allow us to define `vscale` as `VLEN/64`. This makes the
LLVM IR types stable between the two `ELEN`s considered.

|                  | lmul=⅛       | lmul=¼         | lmul=½          |
lmul=1           | lmul=2           | lmul=4           | lmul=8           |
| ---------------- | ------------ | -------------- | --------------- |
---------------- | ---------------- | ---------------- | ---------------- |
| i64 (ELEN=64)    | N/A          | N/A            | N/A             | <v x
1 x i64>    | <v x 2 x i64>    | <v x 4 x i64>    | <v x 8 x i64>    |
| i32              | N/A          | N/A            | <v x 1 x i32>   | <v x
2 x i32>    | <v x 4 x i32>    | <v x 8 x i32>    | <v x 16 x i32>   |
| i16              | N/A          | <v x 1 x i16>  | <v x 2 x i16>   | <v x
4 x i16>    | <v x 8 x i16>    | <v x 16 x i16>   | <v x 32 x i16>   |
| i8               | <v x 1 x i8> | <v x 2 x i8>   | <v x 4 x i8>    | <v x
8 x i8>     | <v x 16 x i8>    | <v x 32 x i8>    | <v x 64 x i8>    |
| double (ELEN=64) | N/A          | N/A            | N/A             | <v x
1 x double> | <v x 2 x double> | <v x 4 x double> | <v x 8 x double> |
| float            | N/A          | N/A            | <v x 1 x float> | <v x
2 x float>  | <v x 4 x float>  | <v x 8 x float>  | <v x 16 x float> |
| half             | N/A          | <v x 1 x half> | <v x 2 x half>  | <v x
4 x half>   | <v x 8 x half>   | <v x 16 x half>  | <v x 32 x half>  |

(Read `<v x k x ty>` as `<vscale x k x ty>`)

One downside of this design is that doesn't allow vectors of `i128` (this
is, `ELEN=128`). In that case `vscale` would have to be 1/2 under `lmul=1`.
This type (and its fp counterpart `float128`) are not that common and in
case of extreme necessity types for `lmul=2` could be used instead.

As for mask vectors, they are physically represented using a layout of
densely packed bits in a vector register. It seems reasonable to map them
to the following LLVM IR types

```
<vscale x 1 x i1>
<vscale x 2 x i1>
<vscale x 4 x i1>
<vscale x 8 x i1>
<vscale x 16 x i1>
<vscale x 32 x i1>
<vscale x 64 x i1>
```

Two types with the same ratio `sew`/`lmul` will have the same related mask
type. For instance, two different comparisons one under `sew=64`, `lmul=2`
and the other under `sew=32`, `lmul=1` will both generate a mask `<vscale x
2 x i1>`.

## Register classes

Given the vector registers and the different sizes of vector groups, we
propose to define 4 register classes.

- `VR` for vector registers (`v0`, `v1`, ..., `v31`). `lmul<=1` and mask
registers
- `VRM2` for vector groups of length 2, this is `lmul=2` (`v0m2`, `v2m2`,
..., `v30m2`)
- `VRM4` for vector groups of length 4 (`v0m4`, `v4m4`, ..., `v28m4`)
- `VRM8` for vector groups of length 8 (`v0m8`, `v8m8`, `v16m8`, `v24m8`)

So far it looks like `lmul<1` types and mask types do not benefit from
having a dedicated class, so `VR` would be used in that case.

## What LLVM IR inputs we care bout

There are three kinds of inputs we believe are important to be selected as
vector instructions of V-extension.

- LLVM IR instructions that operate on vectors (whole-register operations)

  ```
  vadd <vscale x 2 x i32> %a, %b
  ```

- Vector Predication intrinsics (explicit vector length vector operations)

  ```
  call <vscale x 2 x i32> @llvm.vp.add(..., i32 %evl)?
  ```

  See http://www.llvm.org/docs/Proposals/VectorPredication.html

- RISC-V V-extension target-specific intrinsics

  ```
  call <vscale x 2 x i32> @llvm.rvv.vadd.nxv2i32(...)
  ```

  See https://github.com/riscv/rvv-intrinsic-doc

## Challenges for code generation

The actual V-extension instructions are not immediately useful for the code
generation process (= the gap between LLVM IR and the emission of the
instructions)

- Two instances of the same instruction have different semantics depending
on the precise values of `vtype` and `vl`.

  - To the best of our knowledge `MachineInstrs` cannot be overloaded per
register class

- The correct values of `vtype` and `vl` must be set before executing a
vector instruction

  - `MachineInstrs` may need more information than just the encoded
operands, at least during selection

Other targets also have to solve this problem of "having the right
configuration/state before executing an instruction". As far as I
understood from the documentation and comments in llvm-dev, Intel AMX (see
http://lists.llvm.org/pipermail/llvm-dev/2020-September/144848.html) also
has some configuration step before being able to use the intructions that
implement some tensor operations. Looks like our work started earlier than
Intel AMX so their approach here might be interesting to compare in case
either design/approach can be improved.

Another example of a "needs configuration/set state" target (even if
deprecated today and probably of little interest to LLVM) is the Arm VFP
extension found in some legacy Arm cores such as the one in the Raspberry
Pi 1. I'm pretty sure I'm missing other notable examples. If someone is
working on a target with a similar way of working, feel free to chime in!

## The actual proposal

(Thanks for reading up to here!)

1. Mirror all the RISC-V V-extension instructions (which are already on
upstream) that depend either on vl or vtype in pseudo instructions. One
pseudo instruction per `lmul`.

   1. Example: `VADD_VV` → `PseudoVADD_VV_M1`, `PseudoVADD_VV_M2`,
`PseudoVADD_VV_M4`, `PseudoVADD_VV_M8`, `PseudoVADD_VV_MF2`,
`PseudoVADD_VV_MF4`, `PseudoVADD_VV_MF8`,

2. Extend each pseudo instructions with additional operands.

   1. a merge operand that is used for undisturbed semantics (otherwise set
to `IMPLICIT_DEF`). This operand is tied to the destination. If this is an
actual value it entails `tu,mu` (see section above "Masks and tails")
   2. a vector mask operand for masked instructions (set to
`RISCV::NoRegister` if unmasked)
   3. a `sew` immediate operand
   4. a `GPR` containing the vector length of this operation (or
`RISCV::X0` if we want to use the maximum vector length)
   5. implicit uses of `vl` and `vtype` registers

4. Use the pseudo instructions in the SelectionDAG patterns.

   1. Patterns would, at least, choose the right `lmul` version, set the
proper `sew` immediate.
   2. Optionally set a merge operand if merge semantics is desired

5. Each pseudo instruction would also have a custom inserter
   1. The custom inserter emits a `vsetvli` instruction before the
instruction. The `lmul` is derived from the pseudo instruction.
   2. Now the `sew` and the `vl` operands are not useful anymore and can be
set to `-1`, `RISCV::NoRegister` respectively

The final result is that each operation is prepended with a `vsetvli` that
sets both the `sew`, `lmul` and the `vl`.

While correct, this is naive, so later passes can remove the redundant
`vsetvlis`. This is out of the scope of this RFC.

Finally, at the end of CodeGen

5. AsmPrinter emits the Pseudo Instructions as their corresponding real
RISC-V V-extension MCInst instructions
   1. We can’t do that earlier because we want to correctly keep track the
liveness of registers (their register classes). This is important for
`LMUL>1`.

### Example of a pseudo-instruction

This is a slightly simplified tablegen definition of a pseudo instruction

```
let Constraints = “$rd = $merge”, Uses = [VL, VTYPE],
  usesCustomInserter = 1, BaseInstr = VADD_VV in
def PseudoVADD_VV_M2 : Pseudo<(outs VRM2:$rd),
       (ins VRM2:$merge, VRM2:$rs2, VRM2:$rs1, VMaskOp:$vm,
          GPR:$vl, ixlenimm:$sew), []>;
```

Tablegen allows generating systematically (for all `lmul`s) the pseudo
instructions and link them to the real instruction.

`GenericTable` is used to create a table of the pseudo instructions that
contains information about the extra operands.

### Example of the codegen flow

Let’s consider a very simple case using a whole-register op (this example
uses `lmul=2`)

```
%c = add <vscale x 4 x i32> %a, %b
```

Input ISEL DAG:

```
t5: nxv4i32 = add t2, t4
```

Output ISEL DAG:

```
t5: nxv4i32 = PseudoVADD_VV_M2 IMPLICIT_DEF:nxv4i32, t2, t4,
Register:nxv4i1 $noreg, Register:i64 $x0, TargetConstant:i64<32>
```

`MachineInstrs` after `InstrEmitter` (before `CustomInserter`)

- `%3:vrm2 = IMPLICIT_DEF`
- `early-clobber %2:vrm2 = PseudoVADD_VV_M2 %3:vrm2(tied-def 0), %0:vrm2,
%1:vrm2, $noreg, $x0, 32, implicit $vl, implicit $vtype`

(If you wonder about the `early-clobber` it is needed to fulfill some
constraints between sources and destination registers under `lmul>1`)

`MachineInstrs` after `CustomInserter`

- `%3:vrm2 = IMPLICIT_DEF`
- `dead %4:gprnox0 = PseudoVSETVLI $x0, 9, implicit-def $vl, implicit-def
$vtype`
- `early-clobber %2:vrm2 = PseudoVADD_VV_M2 %3:vrm2(tied-def 0), %0:vrm2,
%1:vrm2, $noreg, $noreg, -1, implicit $vl, implicit $vtype`

Now optimization passes could remove unnecesary `PseudoVSETVLI`s (if
neither `vtype` or `vl` change) or use the special form that only sets
`vtype` (if `vl` stays the same). Again, not covered in this RFC.

Post Register Allocation:

- `dead renamable $x10 = PseudoVSETVLI $x0, 9, implicit-def $vl,
implicit-def $vtype`
- `early-clobber renamable $v2m2 = PseudoVADD_VV_M2 undef renamable
$v2m2(tied-def 0), killed renamable $v16m2, killed renamable $v18m2,
$noreg, $noreg, -1, implicit $vl, implicit $vtype`

Finally `AsmPrinter` lowers the PseudoInstructions into real MCInsts,
discarding unneeded operands.

- `<MCInst #3745 <MCOperand Reg:45> <MCOperand Reg:35> <MCOperand Imm:9>>`
  `vsetvli    a0, zero, e32,m2`
- `<MCInst #3468 <MCOperand Reg:5> <MCOperand Reg:19> <MCOperand Reg:21>
<MCOperand Reg:0>>`
  `vadd.vv    v2, v16, v18`

## Discussion

Some cons:

- We need to mirror all the RVV instructions into pseudo instructions (at
least 7 pseudos for each RVV instruction). MachineInstrs do not seem to be
overloadable per register class.
- Usefulness of `vl` and `sew` operands in the pseudo instructions seems
rather limited: only to communicate between `InstrEmitter` and the
`CustomInserter`.

Pros:

- We can express what we need (`vl`, `vtype`) and all what we want (merging
semantics, whole register operations, explicit vl, optimize vsetvl, etc).

## Acknowledgements

We want to acknowledge the SiFive team who has been collaborating with us.
They have provided very valuable feedback during the internal discussions
we have had while shaping this proposal.

This work has been done as part of the European Processor Initiative
project. The European Processor Initiative (EPI) (FPA: 800928) has received
funding from the European Union’s Horizon 2020 research and innovation
programme under grant agreement EPI-SGA1: 826647

Kind regards,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201016/287d40ef/attachment-0001.html>