[llvm] [Docs][RISCV] Document RISC-V vector codegen (PR #96740)

Tue Jul 2 20:58:09 PDT 2024

https://github.com/lukel97 updated https://github.com/llvm/llvm-project/pull/96740

>From 830701601518a8ba2e58a303be01448154cc11a2 Mon Sep 17 00:00:00 2001
From: Luke Lau <luke at igalia.com>
Date: Wed, 26 Jun 2024 15:38:22 +0800
Subject: [PATCH 01/10] [Docs][RISCV] Document RISC-V vector codegen

This is a revival of https://reviews.llvm.org/D142348, and attempts to document how RVV semantics can be expressed in LLVM IR as well as how codegen works in the backend.

Parts of this are taken from the original RFC https://lists.llvm.org/pipermail/llvm-dev/2020-October/145850.html, but I've largely rewritten this from the original differential revision to exclude explaining the specification itself and instead just focus on the LLVM specific bits. (I figured that there's better material available elsewhere for learning about RVV itself)

I've also updated it to include as much as I know about fixed vector codegen as well as the recent changes to vsetvli insertion. Let me know if I'm missing anything else that would be useful to document.
---
 llvm/docs/RISCV/RISCVVectorExtension.rst | 285 +++++++++++++++++++++++
 llvm/docs/UserGuides.rst                 |   4 +
 2 files changed, 289 insertions(+)
 create mode 100644 llvm/docs/RISCV/RISCVVectorExtension.rst

diff --git a/llvm/docs/RISCV/RISCVVectorExtension.rst b/llvm/docs/RISCV/RISCVVectorExtension.rst
new file mode 100644
index 0000000000000..41436f79dd44c
--- /dev/null
+++ b/llvm/docs/RISCV/RISCVVectorExtension.rst
@@ -0,0 +1,285 @@
+=========================
+ RISC-V Vector Extension
+=========================
+
+.. contents::
+   :local:
+
+The RISC-V target readily supports the 1.0 version of the `RISC-V Vector Extension (RVV) <https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc>`_, but requires some tricks to handle its unique design.
+This guide gives an overview of how RVV is modelled in LLVM IR and how the backend generates code for it.
+
+Mapping to LLVM IR types
+========================
+
+RVV adds 32 ``VLEN`` sized registers, where ``VLEN`` is an unknown constant to the compiler. To be able to represent ``VLEN`` sized values, the RISC-V backend takes the same approach as AArch64's SVE and uses `scalable vector types <https://lists.llvm.org/pipermail/llvm-dev/2018-July/124396.html>`_.
+
+Scalable vector types are of the form ``<vscale x n x ty>``, which indicate a vector with a multiple of ``n`` elements of type ``ty``. ``n`` and ``ty`` then end up controlling LMUL and SEW respectively.
+
+LLVM supports only ``ELEN=32`` or ``ELEN=64``, so ``vscale`` is defined as ``VLEN/64`` (see ``RISCV::RVVBitsPerBlock``).
+This makes the LLVM IR types stable between the two ``ELEN`` s considered, i.e. every LLVM IR scalable vector type has exactly one corresponding pair of element type and LMUL, and vice-versa.
+
++-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+
+|                   | LMUL=⅛        | LMUL=¼         | LMUL=½           | LMUL=1            | LMUL=2            | LMUL=4            | LMUL=8            |
++===================+===============+================+==================+===================+===================+===================+===================+
+| i64 (ELEN=64)     | N/A           | N/A            | N/A              | <v x 1 x i64>     | <v x 2 x i64>     | <v x 4 x i64>     | <v x 8 x i64>     |
++-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+
+| i32               | N/A           | N/A            | <v x 1 x i32>    | <v x 2 x i32>     | <v x 4 x i32>     | <v x 8 x i32>     | <v x 16 x i32>    |
++-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+
+| i16               | N/A           | <v x 1 x i16>  | <v x 2 x i16>    | <v x 4 x i16>     | <v x 8 x i16>     | <v x 16 x i16>    | <v x 32 x i16>    |
++-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+
+| i8                | <v x 1 x i8>  | <v x 2 x i8>   | <v x 4 x i8>     | <v x 8 x i8>      | <v x 16 x i8>     | <v x 32 x i8>     | <v x 64 x i8>     |
++-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+
+| double (ELEN=64)  | N/A           | N/A            | N/A              | <v x 1 x double>  | <v x 2 x double>  | <v x 4 x double>  | <v x 8 x double>  |
++-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+
+| float             | N/A           | N/A            | <v x 1 x float>  | <v x 2 x float>   | <v x 4 x float>   | <v x 8 x float>   | <v x 16 x float>  |
++-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+
+| half              | N/A           | <v x 1 x half> | <v x 2 x half>   | <v x 4 x half>    | <v x 8 x half>    | <v x 16 x half>   | <v x 32 x half>   |
++-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+
+
+(Read ``<v x k x ty>`` as ``<vscale x k x ty>``)
+
+
+Mask vector types
+-----------------
+
+As for mask vectors, they are physically represented using a layout of densely packed bits in a vector register.
+They are mapped to the following LLVM IR types:
+
+- <vscale x 1 x i1>
+- <vscale x 2 x i1>
+- <vscale x 4 x i1>
+- <vscale x 8 x i1>
+- <vscale x 16 x i1>
+- <vscale x 32 x i1>
+- <vscale x 64 x i1>
+
+Two types with the same ratio SEW/LMUL will have the same related mask type. For instance, two different comparisons one under SEW=64, LMUL=2 and the other under SEW=32, LMUL=1 will both generate a mask <vscale x 2 x i1>.
+
+Representation in LLVM IR
+=========================
+
+Vector instructions can be represented in three main ways in LLVM IR:
+
+1. Regular instructions on both fixed and scalable vector types
+
+   .. code-block:: llvm
+
+       %c = add <vscale x 4 x i32> %a, %b
+
+2. RISC-V vector intrinsics, which mirror the `C intrinsics specification <https://github.com/riscv-non-isa/rvv-intrinsic-doc>`_
+
+   These come in unmasked variants:
+
+   .. code-block:: llvm
+
+       %c = call @llvm.riscv.vadd.nxv4i32.nxv4i32(
+              <vscale x 4 x i32> %passthru,
+	      <vscale x 4 x i32> %a,
+	      <vscale x 4 x i32> %b,
+	      i64 %avl
+	    )
+
+   As well as masked variants:
+
+   .. code-block:: llvm
+
+       %c = call @llvm.riscv.vadd.nxv4i32.nxv4i32(
+              <vscale x 4 x i32> %passthru,
+	      <vscale x 4 x i32> %a,
+	      <vscale x 4 x i32> %b,
+	      i64 %avl
+	    )
+
+   Both allow setting the AVL as well as controlling the inactive/tail elements via the passthru operand, but the masked variant also provides operands for the mask and ``vta``/``vma`` policy bits.
+
+   The only valid types are scalable vector types.
+
+3. :doc:`Vector predication (VP) intrinsics </Proposals/VectorPredication>`
+
+   .. code-block:: llvm
+
+       %c = call @llvm.vp.add.nxv4i32(
+	      <vscale x 4 x i32> %a,
+	      <vscale x 4 x i32> %b,
+	      <vscale x 4 x i1> %m
+	      i32 %evl
+	    )
+
+   Unlike RISC-V intrinsics, VP intrinsics are target agnostic so they can be emitted from other optimisation passes in the middle-end (like the loop vectorizer). They also support fixed length vector types.
+
+SelectionDAG lowering
+=====================
+
+For regular **scalable** vector LLVM IR instructions, their corresponding SelectionDAG nodes are legal on RISC-V and don't require any custom lowering.
+
+.. code-block::
+
+   t5: nxv4i32 = add t2, t4
+
+RISC-V vector intrinsics are also always scalable and so don't need custom lowering:
+
+.. code-block::
+
+   t12: nxv4i32 = llvm.riscv.vadd TargetConstant:i64<10056>, undef:nxv4i32, t2, t4, t6
+
+Fixed length vectors
+--------------------
+
+The only legal vector MVTs on RISC-V are scalable, so fixed length vectors need to be custom lowered performed in a scalable container type.
+
+1. The fixed length vector operands are inserted into scalable containers via ``insert_subvector``. The container size is chosen to have a minimum size big enough to fit the fixed length vector (see ``getContainerForFixedLengthVector``).
+2. The operation is then performed via a scalable **VL (vector length) node**. These are custom nodes that contain an AVL operand which is set to the size of the fixed length vector, and are defined in RISCVInstrInfoVVLPatterns.td.
+3. The result is put back into a fixed length vector via ``extract_subvector``.
+
+.. code-block::
+
+   t2: nxv2i32,ch = CopyFromReg t0, Register:nxv2i32 %0
+     t4: v4i32 = extract_subvector t2, Constant:i64<0>
+       t6: nxv2i32,ch = CopyFromReg t0, Register:nxv2i32 %1
+     t7: v4i32 = extract_subvector t6, Constant:i64<0>
+   t8: v4i32 = add t4, t7
+
+   // custom lowered to:
+
+       t2: nxv2i32,ch = CopyFromReg t0, Register:nxv2i32 %0
+       t6: nxv2i32,ch = CopyFromReg t0, Register:nxv2i32 %1
+       t15: nxv2i1 = RISCVISD::VMSET_VL Constant:i64<4>
+     t16: nxv2i32 = RISCVISD::ADD_VL t2, t6, undef:nxv2i32, t15, Constant:i64<4>
+   t17: v4i32 = extract_subvector t16, Constant:i64<0>
+
+VL nodes often have a passthru or mask operand, which are usually set to undef and all ones for fixed length vectors.
+
+The ``insert_subvector`` and ``extract_subvector`` nodes responsible for wrapping and unwrapping will get combined away, and eventually we will lower all fixed vector types to scalable. Note that the vectors at the interface of a function are always scalable vectors.
+
+.. note::
+
+   The only ``insert_subvector`` and ``extract_subvector`` nodes that make it through lowering are those that can be performed as an exact subregister insert or extract. This means that any fixed length vector ``insert_subvector`` and ``extract_subvector`` nodes that aren't legalized must lie on a register group boundary, so the exact ``VLEN`` must be known at compile time (i.e. compiled with ``-mrvv-vector-bits=zvl`` or ``-mllvm -riscv-v-vector-bits-max=VLEN``, or have an exact ``vscale_range`` attribute).
+
+Vector predication intrinsics
+-----------------------------
+
+VP intrinsics also get custom lowered via VL nodes in order to set the EVL and mask.
+
+.. code-block::
+
+   t12: nxv2i32 = vp_add t2, t4, t6, Constant:i64<8>
+
+   // custom lowered to:
+
+   t18: nxv2i32 = RISCVISD::ADD_VL t2, t4, undef:nxv2i32, t6, Constant:i64<8>
+
+
+Instruction selection
+=====================
+
+VL and VTYPE need to be configured correctly, so we can't just directly select the underlying vector MachineInstrs. Instead a layer of pseudo instructions get selected which carry the extra information needed to emit the necessary ``vsetvli`` instructions later.
+
+.. code-block::
+
+   %c:vrm2 = PseudoVADD_VV_M2 %passthru:vrm2(tied-def 0), %a:vrm2, %b:vrm2, %vl:gpr, 5
+
+Each vector instruction has multiple pseudo instructions defined in ``RISCVInstrInfoVPseudos.td``.
+
+The pseudos have operands for the AVL and SEW (encoded as a power of 2), as well as potentially the mask, policy or rounding mode if applicable.
+The passhthru operand is tied to the destination register to control the inactive/tail elements.
+
+For each possible LMUL there is a variant of the pseudo instruction, as it affects the register class needed for the operands, and similarly there are ``_MASK`` variants that control whether or not the instruction is masked.
+
+For scalable vectors that should use VLMAX, the AVL is set to a sentinel value of -1.
+
+There are patterns for target agnostic SelectionDAG nodes in ``RISCVInstrInfoVSDPatterns.td``, VL nodes in ``RISCVInstrInfoVVLPatterns.td`` and RVV intrinsics in ``RISCVInstrInfoVPseudos.td``.
+
+Mask patterns
+-------------
+
+For the VL patterns we only match to masked pseudos to reduce the size of the match table, even if the node's mask is all ones and could be an unmasked pseudo. The ``RISCVDAGToDAGISel::doPeepholeMaskedRVV`` will detects that the mask is all ones during post-processing and convert it into its unmasked form.
+
+.. code-block::
+
+     t15: nxv4i1 = RISCVISD::VMSET_VL Constant:i32<-1>
+   t16: nxv4i32 = PseudoVADD_MASK_VV_M2 t0, t2, t4, t15, -1, 5
+
+   // gets optimized to:
+
+   t16: nxv4i32 = PseudoVADD_VV_M2 t0, t2, t4, 4, 5
+
+.. note::
+
+   Any vmset_vl can be treated as an all ones mask since the tail elements past VL are undef and can be replaced with ones.
+
+For masked pseudos the mask operand is copied to the physical ``$v0`` register with a glued ``CopyToReg`` node:
+
+.. code-block::
+
+     t23: ch,glue = CopyToReg t0, Register:nxv4i1 $v0, t6
+   t25: nxv4i32 = PseudoVADD_VV_M2_MASK Register:nxv4i32 $noreg, t2, t4, Register:nxv4i1 $v0, TargetConstant:i64<8>, TargetConstant:i64<5>, TargetConstant:i64<1>, t23:1
+
+Register allocation
+===================
+
+Register allocation is split between vector and scalar registers, with vector allocation running first:
+
+.. code-block::
+
+  $v8m2 = PseudoVADD_VV_M2 $v8m2(tied-def 0), $v8m2, $v10m2, %vl:gpr, 5
+
+.. note::
+
+   We split register allocation between vectors and scalars so that :ref:`RISCVInsertVSETVLI` can run after vector register allocation, but still before scalar register allocation as it may need to create a new virtual register to set the AVL to VLMAX.
+
+   Performing RISCVInsertVSETVLI after vector register allocation imposes fewer constraints on the machine scheduler since it cannot schedule instructions past vsetvlis, and it allows us to emit further vector pseudos during spilling or constant rematerialization.
+
+There are four register classes for vectors:
+
+- ``VR`` for vector registers (``v0``, ``v1,``, ..., ``v32``). Used when :math:`\text{LMUL} \leq 1` and mask registers.
+- ``VRM2`` for vector groups of length 2 i.e. :math:`\text{LMUL}=2` (``v0m2``, ``v2m2``, ..., ``v30m2``)
+- ``VRM4`` for vector groups of length 4 i.e. :math:`\text{LMUL}=4` (``v0m4``, ``v4m4``, ..., ``v28m4``)
+- ``VRM8`` for vector groups of length 8 i.e. :math:`\text{LMUL}=8` (``v0m8``, ``v8m8``, ..., ``v24m8``)
+
+:math:`\text{LMUL} \lt 1` types and mask types do not benefit from having a dedicated class, so ``VR`` is used in their case.
+
+Some instructions have a constraint that a register operand cannot be ``V0`` or overlap with ``V0``, so for these cases we also have ``VRNoV0`` variants.
+
+.. _RISCVInsertVSETVLI:
+
+RISCVInsertVSETVLI
+==================
+
+After vector registers are allocated, the RISCVInsertVSETVLI pass will insert the necessary vsetvlis for the pseudos.
+
+.. code-block::
+
+  dead $x0 = PseudoVSETVLI %vl:gpr, 209, implicit-def $vl, implicit-def $vtype
+  $v8m2 = PseudoVADD_VV_M2 $v8m2(tied-def 0), $v8m2, $v10m2, $noreg, 5, implicit $vl, implicit $vtype
+
+The physical ``$vl`` and ``$vtype`` registers are implicitly defined by the ``PseudoVSETVLI``, and are implicitly used by the ``PseudoVADD``.
+The VTYPE operand (``209`` in this example) is encoded as per the specification via ``RISCVVType::encodeVTYPE``.
+
+RISCVInsertVSETVLI performs dataflow analysis to emit as few vsetvlis as possible. It will also try to minimize the number of vsetvlis that set VL, i.e. it will emit ``vsetvli x0, x0`` if only VTYPE needs changed but VL doesn't.
+
+Pseudo expansion and printing
+=============================
+
+After scalar register allocation, the ``RISCVExpandPseudoInsts.cpp`` pass expands out the ``PseudoVSETVLI``.
+
+.. code-block::
+
+   dead $x0 = VSETVLI $x1, 209, implicit-def $vtype, implicit-def $vl
+   renamable $v8m2 = PseudoVADD_VV_M2 $v8m2(tied-def 0), $v8m2, $v10m2, $noreg, 5, implicit $vl, implicit $vtype
+
+Note that the vector pseudo remains as it's needed to encode the register class for the LMUL, so the VL and SEW operands are unused.
+
+``RISCVAsmPrinter`` will then lower the pseudo instructions into real ``MCInsts``.
+
+.. code-block:: nasm
+
+   vsetvli a0, zero, e32, m2, ta, ma
+   vadd.vv v8, v8, v10
+
+
+See also
+========
+
+- `2023 LLVM Dev Mtg - Vector codegen in the RISC-V backend <https://youtu.be/-ox8iJmbp0c?feature=shared>`_
+- `2023 LLVM Dev Mtg - How to add an C intrinsic and code-gen it, using the RISC-V vector C intrinsics <https://youtu.be/t17O_bU1jks?feature=shared>`_
+- `2021 LLVM Dev Mtg “Optimizing code for scalable vector architectures” <https://youtu.be/daWLCyhwrZ8?feature=shared>`_
diff --git a/llvm/docs/UserGuides.rst b/llvm/docs/UserGuides.rst
index 18d273a51daf6..bf7cdda89a009 100644
--- a/llvm/docs/UserGuides.rst
+++ b/llvm/docs/UserGuides.rst
@@ -64,6 +64,7 @@ intermediate LLVM representation.
    Remarks
    RemoveDIsDebugInfo
    RISCVUsage
+   RISCV/RISCVVectorExtension
    SourceLevelDebugging
    SPIRVUsage
    StackSafetyAnalysis
@@ -284,3 +285,6 @@ Additional Topics
 
 :doc:`RISCVUsage`
    This document describes using the RISCV-V target.
+
+:doc:`RISCV/RISCVVectorExtension`
+   This document describes how the RISC-V Vector extension can be expressed in LLVM IR and how code is generated for it in the backend.

>From 51afa69b54b352c1b0ec279a52b9dc1cda91aca5 Mon Sep 17 00:00:00 2001
From: Luke Lau <luke at igalia.com>
Date: Wed, 26 Jun 2024 18:08:21 +0800
Subject: [PATCH 02/10] Add a missing word, reference langref for scalable
 vector types

---
 llvm/docs/RISCV/RISCVVectorExtension.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/llvm/docs/RISCV/RISCVVectorExtension.rst b/llvm/docs/RISCV/RISCVVectorExtension.rst
index 41436f79dd44c..7869c6699633d 100644
--- a/llvm/docs/RISCV/RISCVVectorExtension.rst
+++ b/llvm/docs/RISCV/RISCVVectorExtension.rst
@@ -11,7 +11,7 @@ This guide gives an overview of how RVV is modelled in LLVM IR and how the backe
 Mapping to LLVM IR types
 ========================
 
-RVV adds 32 ``VLEN`` sized registers, where ``VLEN`` is an unknown constant to the compiler. To be able to represent ``VLEN`` sized values, the RISC-V backend takes the same approach as AArch64's SVE and uses `scalable vector types <https://lists.llvm.org/pipermail/llvm-dev/2018-July/124396.html>`_.
+RVV adds 32 ``VLEN`` sized registers, where ``VLEN`` is an unknown constant to the compiler. To be able to represent ``VLEN`` sized values, the RISC-V backend takes the same approach as AArch64's SVE and uses `scalable vector types <https://llvm.org/docs/LangRef.html#t-vector>`_.
 
 Scalable vector types are of the form ``<vscale x n x ty>``, which indicate a vector with a multiple of ``n`` elements of type ``ty``. ``n`` and ``ty`` then end up controlling LMUL and SEW respectively.
 
@@ -125,7 +125,7 @@ RISC-V vector intrinsics are also always scalable and so don't need custom lower
 Fixed length vectors
 --------------------
 
-The only legal vector MVTs on RISC-V are scalable, so fixed length vectors need to be custom lowered performed in a scalable container type.
+The only legal vector MVTs on RISC-V are scalable, so fixed length vectors need to be custom lowered and performed in a scalable container type:
 
 1. The fixed length vector operands are inserted into scalable containers via ``insert_subvector``. The container size is chosen to have a minimum size big enough to fit the fixed length vector (see ``getContainerForFixedLengthVector``).
 2. The operation is then performed via a scalable **VL (vector length) node**. These are custom nodes that contain an AVL operand which is set to the size of the fixed length vector, and are defined in RISCVInstrInfoVVLPatterns.td.

>From d1106ac4cc372b079548211aea2316753faa1bb3 Mon Sep 17 00:00:00 2001
From: Luke Lau <luke at igalia.com>
Date: Wed, 26 Jun 2024 21:54:39 +0800
Subject: [PATCH 03/10] Address some review comments

- Fix use of which
- Clarify what VP operands are used for in VL nodes
- Clarify why RVV intrinsics are legal
- Be more precise on usage of "legal" fixed vector types. They are legal, there's just no patterns for them
- Clarify how fixed length vector arguments are passed in

Still some comments to address!
---
 llvm/docs/RISCV/RISCVVectorExtension.rst | 26 ++++++++++++++----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/llvm/docs/RISCV/RISCVVectorExtension.rst b/llvm/docs/RISCV/RISCVVectorExtension.rst
index 7869c6699633d..8cbc9261bd908 100644
--- a/llvm/docs/RISCV/RISCVVectorExtension.rst
+++ b/llvm/docs/RISCV/RISCVVectorExtension.rst
@@ -16,7 +16,7 @@ RVV adds 32 ``VLEN`` sized registers, where ``VLEN`` is an unknown constant to t
 Scalable vector types are of the form ``<vscale x n x ty>``, which indicate a vector with a multiple of ``n`` elements of type ``ty``. ``n`` and ``ty`` then end up controlling LMUL and SEW respectively.
 
 LLVM supports only ``ELEN=32`` or ``ELEN=64``, so ``vscale`` is defined as ``VLEN/64`` (see ``RISCV::RVVBitsPerBlock``).
-This makes the LLVM IR types stable between the two ``ELEN`` s considered, i.e. every LLVM IR scalable vector type has exactly one corresponding pair of element type and LMUL, and vice-versa.
+This makes the LLVM IR types stable between the two ``ELEN`` s considered, i.e., every LLVM IR scalable vector type has exactly one corresponding pair of element type and LMUL, and vice-versa.
 
 +-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+
 |                   | LMUL=⅛        | LMUL=¼         | LMUL=½           | LMUL=1            | LMUL=2            | LMUL=4            | LMUL=8            |
@@ -83,7 +83,7 @@ Vector instructions can be represented in three main ways in LLVM IR:
 
    .. code-block:: llvm
 
-       %c = call @llvm.riscv.vadd.nxv4i32.nxv4i32(
+       %c = call @llvm.riscv.vadd.mask.nxv4i32.nxv4i32(
               <vscale x 4 x i32> %passthru,
 	      <vscale x 4 x i32> %a,
 	      <vscale x 4 x i32> %b,
@@ -110,13 +110,15 @@ Vector instructions can be represented in three main ways in LLVM IR:
 SelectionDAG lowering
 =====================
 
-For regular **scalable** vector LLVM IR instructions, their corresponding SelectionDAG nodes are legal on RISC-V and don't require any custom lowering.
+For most regular **scalable** vector LLVM IR instructions, their corresponding SelectionDAG nodes are legal on RISC-V and don't require any custom lowering.
 
 .. code-block::
 
    t5: nxv4i32 = add t2, t4
 
-RISC-V vector intrinsics are also always scalable and so don't need custom lowering:
+This is because the TableGen patterns for RVV are only defined for scalable vector types.
+
+RISC-V vector intrinsics only support scalable vector types, so they are also legal.
 
 .. code-block::
 
@@ -125,10 +127,11 @@ RISC-V vector intrinsics are also always scalable and so don't need custom lower
 Fixed length vectors
 --------------------
 
-The only legal vector MVTs on RISC-V are scalable, so fixed length vectors need to be custom lowered and performed in a scalable container type:
+Because there are no fixed length vector patterns, fixed length vectors need to be custom lowered and performed in a scalable "container" type:
 
-1. The fixed length vector operands are inserted into scalable containers via ``insert_subvector``. The container size is chosen to have a minimum size big enough to fit the fixed length vector (see ``getContainerForFixedLengthVector``).
-2. The operation is then performed via a scalable **VL (vector length) node**. These are custom nodes that contain an AVL operand which is set to the size of the fixed length vector, and are defined in RISCVInstrInfoVVLPatterns.td.
+1. The fixed length vector operands are inserted into scalable containers with ``insert_subvector`` nodes. The container type is chosen such that its minimum size will fit the fixed length vector (see ``getContainerForFixedLengthVector``).
+2. The operation is then performed on the container type via a **VL (vector length) node**. These are custom nodes defined in ``RISCVInstrInfoVVLPatterns.td`` that mirror target agnostic SelectionDAG nodes, as well as some RVV instructions. They contain an AVL operand, which is set to the number of elements in the fixed length vector.
+   Some nodes also have a passthru or mask operand, which will usually be set to undef and all ones when lowering fixed length vectors.
 3. The result is put back into a fixed length vector via ``extract_subvector``.
 
 .. code-block::
@@ -149,7 +152,7 @@ The only legal vector MVTs on RISC-V are scalable, so fixed length vectors need
 
 VL nodes often have a passthru or mask operand, which are usually set to undef and all ones for fixed length vectors.
 
-The ``insert_subvector`` and ``extract_subvector`` nodes responsible for wrapping and unwrapping will get combined away, and eventually we will lower all fixed vector types to scalable. Note that the vectors at the interface of a function are always scalable vectors.
+The ``insert_subvector`` and ``extract_subvector`` nodes responsible for wrapping and unwrapping will get combined away, and eventually we will lower all fixed vector types to scalable. Note that fixed length vectors at the interface of a function are passed in a scalable vector container.
 
 .. note::
 
@@ -158,7 +161,7 @@ The ``insert_subvector`` and ``extract_subvector`` nodes responsible for wrappin
 Vector predication intrinsics
 -----------------------------
 
-VP intrinsics also get custom lowered via VL nodes in order to set the EVL and mask.
+VP intrinsics also get custom lowered via VL nodes.
 
 .. code-block::
 
@@ -168,11 +171,12 @@ VP intrinsics also get custom lowered via VL nodes in order to set the EVL and m
 
    t18: nxv2i32 = RISCVISD::ADD_VL t2, t4, undef:nxv2i32, t6, Constant:i64<8>
 
+The VP EVL and mask are used for the VL node's AVL and mask respectively, whilst the passthru is set to undef. A passthru can be emulated to get tail/mask undisturbed behaviour by using ``@llvm.vp.merge``. It will get lowered as a ``vmerge``, but will likely be merged back into the underlying instruction's mask via ``RISCVDAGToDAGISel::performCombineVMergeAndVOps``.
 
 Instruction selection
 =====================
 
-VL and VTYPE need to be configured correctly, so we can't just directly select the underlying vector MachineInstrs. Instead a layer of pseudo instructions get selected which carry the extra information needed to emit the necessary ``vsetvli`` instructions later.
+VL and VTYPE need to be configured correctly, so we can't just directly select the underlying vector MachineInstrs. Instead pseudo instructions are selected, which carry the extra information needed to emit the necessary vsetvlis later.
 
 .. code-block::
 
@@ -192,7 +196,7 @@ There are patterns for target agnostic SelectionDAG nodes in ``RISCVInstrInfoVSD
 Mask patterns
 -------------
 
-For the VL patterns we only match to masked pseudos to reduce the size of the match table, even if the node's mask is all ones and could be an unmasked pseudo. The ``RISCVDAGToDAGISel::doPeepholeMaskedRVV`` will detects that the mask is all ones during post-processing and convert it into its unmasked form.
+For the VL patterns we only match to masked pseudos to reduce the size of the match table, even if the node's mask is all ones and could be an unmasked pseudo. ``RISCVFoldMasks::convertToUnmasked`` will detect if the mask is all ones and convert it into its unmasked form.
 
 .. code-block::
 

>From 704954bd1db82e7f6a0c9d4a243d4fbffb60bb40 Mon Sep 17 00:00:00 2001
From: Luke Lau <luke at igalia.com>
Date: Thu, 27 Jun 2024 11:37:54 +0800
Subject: [PATCH 04/10] Fix masked intrinsic example

---
 llvm/docs/RISCV/RISCVVectorExtension.rst | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/llvm/docs/RISCV/RISCVVectorExtension.rst b/llvm/docs/RISCV/RISCVVectorExtension.rst
index 8cbc9261bd908..85a25bab74814 100644
--- a/llvm/docs/RISCV/RISCVVectorExtension.rst
+++ b/llvm/docs/RISCV/RISCVVectorExtension.rst
@@ -87,7 +87,9 @@ Vector instructions can be represented in three main ways in LLVM IR:
               <vscale x 4 x i32> %passthru,
 	      <vscale x 4 x i32> %a,
 	      <vscale x 4 x i32> %b,
-	      i64 %avl
+	      <vscale x 4 x i1> %mask,
+	      i64 %avl,
+	      i64 0 ; policy (must be an immediate)
 	    )
 
    Both allow setting the AVL as well as controlling the inactive/tail elements via the passthru operand, but the masked variant also provides operands for the mask and ``vta``/``vma`` policy bits.

>From ebbc9277c68703000b8a614b6951435625d26178 Mon Sep 17 00:00:00 2001
From: Luke Lau <luke at igalia.com>
Date: Thu, 27 Jun 2024 16:46:57 +0800
Subject: [PATCH 05/10] Add original RFC to see also

---
 llvm/docs/RISCV/RISCVVectorExtension.rst | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/llvm/docs/RISCV/RISCVVectorExtension.rst b/llvm/docs/RISCV/RISCVVectorExtension.rst
index 85a25bab74814..137e5faebb011 100644
--- a/llvm/docs/RISCV/RISCVVectorExtension.rst
+++ b/llvm/docs/RISCV/RISCVVectorExtension.rst
@@ -283,9 +283,11 @@ Note that the vector pseudo remains as it's needed to encode the register class
    vadd.vv v8, v8, v10
 
 
+
 See also
 ========
 
+- `[llvm-dev] [RFC] Code generation for RISC-V V-extension <https://lists.llvm.org/pipermail/llvm-dev/2020-October/145850.html>`_
 - `2023 LLVM Dev Mtg - Vector codegen in the RISC-V backend <https://youtu.be/-ox8iJmbp0c?feature=shared>`_
 - `2023 LLVM Dev Mtg - How to add an C intrinsic and code-gen it, using the RISC-V vector C intrinsics <https://youtu.be/t17O_bU1jks?feature=shared>`_
 - `2021 LLVM Dev Mtg “Optimizing code for scalable vector architectures” <https://youtu.be/daWLCyhwrZ8?feature=shared>`_

>From ed6f34c2cb4ed26e8c7919b47b64c5343e1bfdda Mon Sep 17 00:00:00 2001
From: Luke Lau <luke at igalia.com>
Date: Thu, 27 Jun 2024 16:47:33 +0800
Subject: [PATCH 06/10] Clarify that defining vscale = VLEN / 64 prevents VLEN
 = 32 from being supported

---
 llvm/docs/RISCV/RISCVVectorExtension.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/llvm/docs/RISCV/RISCVVectorExtension.rst b/llvm/docs/RISCV/RISCVVectorExtension.rst
index 137e5faebb011..74e148cea60c0 100644
--- a/llvm/docs/RISCV/RISCVVectorExtension.rst
+++ b/llvm/docs/RISCV/RISCVVectorExtension.rst
@@ -16,7 +16,7 @@ RVV adds 32 ``VLEN`` sized registers, where ``VLEN`` is an unknown constant to t
 Scalable vector types are of the form ``<vscale x n x ty>``, which indicate a vector with a multiple of ``n`` elements of type ``ty``. ``n`` and ``ty`` then end up controlling LMUL and SEW respectively.
 
 LLVM supports only ``ELEN=32`` or ``ELEN=64``, so ``vscale`` is defined as ``VLEN/64`` (see ``RISCV::RVVBitsPerBlock``).
-This makes the LLVM IR types stable between the two ``ELEN`` s considered, i.e., every LLVM IR scalable vector type has exactly one corresponding pair of element type and LMUL, and vice-versa.
+Note this means that ``VLEN>=64``, so ``VLEN=32`` isn't currently supported.
 
 +-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+
 |                   | LMUL=⅛        | LMUL=¼         | LMUL=½           | LMUL=1            | LMUL=2            | LMUL=4            | LMUL=8            |

>From 2e877eba529651ffa0f9fb6972342ddd2f93b3a9 Mon Sep 17 00:00:00 2001
From: Luke Lau <luke at igalia.com>
Date: Thu, 27 Jun 2024 16:48:33 +0800
Subject: [PATCH 07/10] indicate->indicates, reword how n and ty control LMUL
 and SEW

---
 llvm/docs/RISCV/RISCVVectorExtension.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/llvm/docs/RISCV/RISCVVectorExtension.rst b/llvm/docs/RISCV/RISCVVectorExtension.rst
index 74e148cea60c0..90217a2622f5b 100644
--- a/llvm/docs/RISCV/RISCVVectorExtension.rst
+++ b/llvm/docs/RISCV/RISCVVectorExtension.rst
@@ -13,7 +13,8 @@ Mapping to LLVM IR types
 
 RVV adds 32 ``VLEN`` sized registers, where ``VLEN`` is an unknown constant to the compiler. To be able to represent ``VLEN`` sized values, the RISC-V backend takes the same approach as AArch64's SVE and uses `scalable vector types <https://llvm.org/docs/LangRef.html#t-vector>`_.
 
-Scalable vector types are of the form ``<vscale x n x ty>``, which indicate a vector with a multiple of ``n`` elements of type ``ty``. ``n`` and ``ty`` then end up controlling LMUL and SEW respectively.
+Scalable vector types are of the form ``<vscale x n x ty>``, which indicates a vector with a multiple of ``n`` elements of type ``ty``.
+On RISC-V ``n`` and ``ty`` control LMUL and SEW respectively.
 
 LLVM supports only ``ELEN=32`` or ``ELEN=64``, so ``vscale`` is defined as ``VLEN/64`` (see ``RISCV::RVVBitsPerBlock``).
 Note this means that ``VLEN>=64``, so ``VLEN=32`` isn't currently supported.

>From 9943aa7a2795c935f33dfb503e8fa3d829247df1 Mon Sep 17 00:00:00 2001
From: Luke Lau <luke at igalia.com>
Date: Thu, 27 Jun 2024 16:49:46 +0800
Subject: [PATCH 08/10] Address review comments

- Put more code terms inside backticks
- Update VP link
- fixed length -> fixed-length
- Add table comparing operands and properties of different LLVM IR constructs
- Add missing mask intrinsic operands
- Add table listing different pseudo permutations
- Add missing policy operand to PseudoVADD examples
- Use MIR in masked pattern examples
- i.e. -> i.e.,
- Move vp vmerge passthru emulation from SelectionDAG section to LLVM IR section
- Fix typos
- Update SelectionDAG indentation
---
 llvm/docs/RISCV/RISCVVectorExtension.rst | 152 ++++++++++++++---------
 1 file changed, 96 insertions(+), 56 deletions(-)

diff --git a/llvm/docs/RISCV/RISCVVectorExtension.rst b/llvm/docs/RISCV/RISCVVectorExtension.rst
index 90217a2622f5b..55265b1990676 100644
--- a/llvm/docs/RISCV/RISCVVectorExtension.rst
+++ b/llvm/docs/RISCV/RISCVVectorExtension.rst
@@ -43,29 +43,31 @@ Note this means that ``VLEN>=64``, so ``VLEN=32`` isn't currently supported.
 Mask vector types
 -----------------
 
-As for mask vectors, they are physically represented using a layout of densely packed bits in a vector register.
+Mask vectors are physically represented using a layout of densely packed bits in a vector register.
 They are mapped to the following LLVM IR types:
 
-- <vscale x 1 x i1>
-- <vscale x 2 x i1>
-- <vscale x 4 x i1>
-- <vscale x 8 x i1>
-- <vscale x 16 x i1>
-- <vscale x 32 x i1>
-- <vscale x 64 x i1>
+- ``<vscale x 1 x i1>``
+- ``<vscale x 2 x i1>``
+- ``<vscale x 4 x i1>``
+- ``<vscale x 8 x i1>``
+- ``<vscale x 16 x i1>``
+- ``<vscale x 32 x i1>``
+- ``<vscale x 64 x i1>``
 
-Two types with the same ratio SEW/LMUL will have the same related mask type. For instance, two different comparisons one under SEW=64, LMUL=2 and the other under SEW=32, LMUL=1 will both generate a mask <vscale x 2 x i1>.
+Two types with the same SEW/LMUL ratio will have the same related mask type.
+For instance, two different comparisons one under SEW=64, LMUL=2 and the other under SEW=32, LMUL=1 will both generate a mask ``<vscale x 2 x i1>``.
 
 Representation in LLVM IR
 =========================
 
 Vector instructions can be represented in three main ways in LLVM IR:
 
-1. Regular instructions on both fixed and scalable vector types
+1. Regular instructions on both scalable and fixed-length vector types
 
    .. code-block:: llvm
 
        %c = add <vscale x 4 x i32> %a, %b
+       %f = add <4 x i32> %d, %e
 
 2. RISC-V vector intrinsics, which mirror the `C intrinsics specification <https://github.com/riscv-non-isa/rvv-intrinsic-doc>`_
 
@@ -97,7 +99,7 @@ Vector instructions can be represented in three main ways in LLVM IR:
 
    The only valid types are scalable vector types.
 
-3. :doc:`Vector predication (VP) intrinsics </Proposals/VectorPredication>`
+3. :ref:`Vector predication (VP) intrinsics <int_vp>`
 
    .. code-block:: llvm
 
@@ -108,7 +110,23 @@ Vector instructions can be represented in three main ways in LLVM IR:
 	      i32 %evl
 	    )
 
-   Unlike RISC-V intrinsics, VP intrinsics are target agnostic so they can be emitted from other optimisation passes in the middle-end (like the loop vectorizer). They also support fixed length vector types.
+   Unlike RISC-V intrinsics, VP intrinsics are target agnostic so they can be emitted from other optimisation passes in the middle-end (like the loop vectorizer). They also support fixed-length vector types.
+
+   VP intrinsics also don't have passthru operands, but tail/mask undisturbed behaviour can be emulated by using the output in a ``@llvm.vp.merge``.
+   It will get lowered as a ``vmerge``, but will be merged back into the underlying instruction's mask via ``RISCVDAGToDAGISel::performCombineVMergeAndVOps``.
+
+
+The different properties of the above representations are summarized below:
+
++----------------------+--------------+-----------------+----------+------------------+----------------------+-----------------+
+|                      | AVL          | Masking         | Passthru | Scalable vectors | Fixed-length vectors | Target agnostic |
++======================+==============+=================+==========+==================+======================+=================+
+| LLVM IR instructions | Always VLMAX | No              | None     | Yes              | Yes                  | Yes             |
++----------------------+--------------+-----------------+----------+------------------+----------------------+-----------------+
+| RVV intrinsics       | Yes          | Yes             | Yes      | Yes              | No                   | No              |
++----------------------+--------------+-----------------+----------+------------------+----------------------+-----------------+
+| VP intrinsics        | Yes (EVL)    | Yes             | No       | Yes              | Yes                  | Yes             |
++----------------------+--------------+-----------------+----------+------------------+----------------------+-----------------+
 
 SelectionDAG lowering
 =====================
@@ -119,33 +137,31 @@ For most regular **scalable** vector LLVM IR instructions, their corresponding S
 
    t5: nxv4i32 = add t2, t4
 
-This is because the TableGen patterns for RVV are only defined for scalable vector types.
-
-RISC-V vector intrinsics only support scalable vector types, so they are also legal.
+RISC-V vector intrinsics also don't require any custom lowering.
 
 .. code-block::
 
    t12: nxv4i32 = llvm.riscv.vadd TargetConstant:i64<10056>, undef:nxv4i32, t2, t4, t6
 
-Fixed length vectors
+Fixed-length vectors
 --------------------
 
-Because there are no fixed length vector patterns, fixed length vectors need to be custom lowered and performed in a scalable "container" type:
+Because there are no fixed-length vector patterns, fixed-length vectors need to be custom lowered and performed in a scalable "container" type:
 
-1. The fixed length vector operands are inserted into scalable containers with ``insert_subvector`` nodes. The container type is chosen such that its minimum size will fit the fixed length vector (see ``getContainerForFixedLengthVector``).
-2. The operation is then performed on the container type via a **VL (vector length) node**. These are custom nodes defined in ``RISCVInstrInfoVVLPatterns.td`` that mirror target agnostic SelectionDAG nodes, as well as some RVV instructions. They contain an AVL operand, which is set to the number of elements in the fixed length vector.
-   Some nodes also have a passthru or mask operand, which will usually be set to undef and all ones when lowering fixed length vectors.
-3. The result is put back into a fixed length vector via ``extract_subvector``.
+1. The fixed-length vector operands are inserted into scalable containers with ``insert_subvector`` nodes. The container type is chosen such that its minimum size will fit the fixed-length vector (see ``getContainerForFixedLengthVector``).
+2. The operation is then performed on the container type via a **VL (vector length) node**. These are custom nodes defined in ``RISCVInstrInfoVVLPatterns.td`` that mirror target agnostic SelectionDAG nodes, as well as some RVV instructions. They contain an AVL operand, which is set to the number of elements in the fixed-length vector.
+   Some nodes also have a passthru or mask operand, which will usually be set to ``undef`` and all ones when lowering fixed-length vectors.
+3. The result is put back into a fixed-length vector via ``extract_subvector``.
 
 .. code-block::
 
-   t2: nxv2i32,ch = CopyFromReg t0, Register:nxv2i32 %0
-     t4: v4i32 = extract_subvector t2, Constant:i64<0>
+       t2: nxv2i32,ch = CopyFromReg t0, Register:nxv2i32 %0
        t6: nxv2i32,ch = CopyFromReg t0, Register:nxv2i32 %1
+     t4: v4i32 = extract_subvector t2, Constant:i64<0>
      t7: v4i32 = extract_subvector t6, Constant:i64<0>
    t8: v4i32 = add t4, t7
 
-   // custom lowered to:
+   // is custom lowered to:
 
        t2: nxv2i32,ch = CopyFromReg t0, Register:nxv2i32 %0
        t6: nxv2i32,ch = CopyFromReg t0, Register:nxv2i32 %1
@@ -153,13 +169,13 @@ Because there are no fixed length vector patterns, fixed length vectors need to
      t16: nxv2i32 = RISCVISD::ADD_VL t2, t6, undef:nxv2i32, t15, Constant:i64<4>
    t17: v4i32 = extract_subvector t16, Constant:i64<0>
 
-VL nodes often have a passthru or mask operand, which are usually set to undef and all ones for fixed length vectors.
+VL nodes often have a passthru or mask operand, which are usually set to ``undef`` and all ones for fixed-length vectors.
 
-The ``insert_subvector`` and ``extract_subvector`` nodes responsible for wrapping and unwrapping will get combined away, and eventually we will lower all fixed vector types to scalable. Note that fixed length vectors at the interface of a function are passed in a scalable vector container.
+The ``insert_subvector`` and ``extract_subvector`` nodes responsible for wrapping and unwrapping will get combined away, and eventually we will lower all fixed-length vector types to scalable. Note that fixed-length vectors at the interface of a function are passed in a scalable vector container.
 
 .. note::
 
-   The only ``insert_subvector`` and ``extract_subvector`` nodes that make it through lowering are those that can be performed as an exact subregister insert or extract. This means that any fixed length vector ``insert_subvector`` and ``extract_subvector`` nodes that aren't legalized must lie on a register group boundary, so the exact ``VLEN`` must be known at compile time (i.e. compiled with ``-mrvv-vector-bits=zvl`` or ``-mllvm -riscv-v-vector-bits-max=VLEN``, or have an exact ``vscale_range`` attribute).
+   The only ``insert_subvector`` and ``extract_subvector`` nodes that make it through lowering are those that can be performed as an exact subregister insert or extract. This means that any fixed-length vector ``insert_subvector`` and ``extract_subvector`` nodes that aren't legalized must lie on a register group boundary, so the exact ``VLEN`` must be known at compile time (i.e., compiled with ``-mrvv-vector-bits=zvl`` or ``-mllvm -riscv-v-vector-bits-max=VLEN``, or have an exact ``vscale_range`` attribute).
 
 Vector predication intrinsics
 -----------------------------
@@ -170,56 +186,80 @@ VP intrinsics also get custom lowered via VL nodes.
 
    t12: nxv2i32 = vp_add t2, t4, t6, Constant:i64<8>
 
-   // custom lowered to:
+   // is custom lowered to:
 
    t18: nxv2i32 = RISCVISD::ADD_VL t2, t4, undef:nxv2i32, t6, Constant:i64<8>
 
-The VP EVL and mask are used for the VL node's AVL and mask respectively, whilst the passthru is set to undef. A passthru can be emulated to get tail/mask undisturbed behaviour by using ``@llvm.vp.merge``. It will get lowered as a ``vmerge``, but will likely be merged back into the underlying instruction's mask via ``RISCVDAGToDAGISel::performCombineVMergeAndVOps``.
+The VP EVL and mask are used for the VL node's AVL and mask respectively, whilst the passthru is set to ``undef``.
 
 Instruction selection
 =====================
 
-VL and VTYPE need to be configured correctly, so we can't just directly select the underlying vector MachineInstrs. Instead pseudo instructions are selected, which carry the extra information needed to emit the necessary vsetvlis later.
+VL and VTYPE need to be configured correctly, so we can't just directly select the underlying vector ``MachineInstr``. Instead pseudo instructions are selected, which carry the extra information needed to emit the necessary ``vsetvli``\s later.
 
 .. code-block::
 
-   %c:vrm2 = PseudoVADD_VV_M2 %passthru:vrm2(tied-def 0), %a:vrm2, %b:vrm2, %vl:gpr, 5
+   %c:vrm2 = PseudoVADD_VV_M2 %passthru:vrm2(tied-def 0), %a:vrm2, %b:vrm2, %vl:gpr, 5 /*sew*/, 3 /*policy*/
 
 Each vector instruction has multiple pseudo instructions defined in ``RISCVInstrInfoVPseudos.td``.
+There is a variant of each pseudo for each possible LMUL, as well as a masked variant. So a typical instruction like ``vadd.vv`` would have the following pseudos:
+
+.. code-block::
+
+   %rd:vr = PseudoVADD_VV_MF8 %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, %avl:gpr, sew:imm, policy:imm
+   %rd:vr = PseudoVADD_VV_MF4 %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, %avl:gpr, sew:imm, policy:imm
+   %rd:vr = PseudoVADD_VV_MF2 %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, %avl:gpr, sew:imm, policy:imm
+   %rd:vr = PseudoVADD_VV_M1 %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, %avl:gpr, sew:imm, policy:imm
+   %rd:vrm2 = PseudoVADD_VV_M2 %passthru:vrm2(tied-def 0), %rs2:vrm2, %rs1:vrm2, %avl:gpr, sew:imm, policy:imm
+   %rd:vrm4 = PseudoVADD_VV_M4 %passthru:vrm4(tied-def 0), %rs2:vrm4, %rs1:vrm4, %avl:gpr, sew:imm, policy:imm
+   %rd:vrm8 = PseudoVADD_VV_M8 %passthru:vrm8(tied-def 0), %rs2:vrm8, %rs1:vrm8, %avl:gpr, sew:imm, policy:imm
+   %rd:vr = PseudoVADD_VV_MF8_MASK %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, mask:$v0, %avl:gpr, sew:imm, policy:imm
+   %rd:vr = PseudoVADD_VV_MF4_MASK %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, mask:$v0, %avl:gpr, sew:imm, policy:imm
+   %rd:vr = PseudoVADD_VV_MF2_MASK %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, mask:$v0, %avl:gpr, sew:imm, policy:imm
+   %rd:vr = PseudoVADD_VV_M1_MASK %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, mask:$v0, %avl:gpr, sew:imm, policy:imm
+   %rd:vrm2 = PseudoVADD_VV_M2_MASK %passthru:vrm2(tied-def 0), %rs2:vrm2, %%rs1:vrm2, mask:$v0, %avl:gpr, sew:imm, policy:imm
+   %rd:vrm4 = PseudoVADD_VV_M4_MASK %passthru:vrm4(tied-def 0), %rs2:vrm4, %rs1:vrm4, mask:$v0, %avl:gpr, sew:imm, policy:imm
+   %rd:vrm8 = PseudoVADD_VV_M8_MASK %passthru:vrm8(tied-def 0), %rs2:vrm8, %rs1:vrm8, mask:$v0, %avl:gpr, sew:imm, policy:imm
+
+.. note::
 
-The pseudos have operands for the AVL and SEW (encoded as a power of 2), as well as potentially the mask, policy or rounding mode if applicable.
-The passhthru operand is tied to the destination register to control the inactive/tail elements.
+   Whilst the SEW can be encoded in an operand, we need to use separate pseudos for each LMUL since different register groups will require different register classes: see :ref:`rvv_register_allocation`.
 
-For each possible LMUL there is a variant of the pseudo instruction, as it affects the register class needed for the operands, and similarly there are ``_MASK`` variants that control whether or not the instruction is masked.
 
-For scalable vectors that should use VLMAX, the AVL is set to a sentinel value of -1.
+Pseudos have operands for the AVL and SEW (encoded as a power of 2), as well as potentially the mask, policy or rounding mode if applicable.
+The passthru operand is tied to the destination register which will determine the inactive/tail elements.
+
+For scalable vectors that should use VLMAX, the AVL is set to a sentinel value of ``-1``.
 
 There are patterns for target agnostic SelectionDAG nodes in ``RISCVInstrInfoVSDPatterns.td``, VL nodes in ``RISCVInstrInfoVVLPatterns.td`` and RVV intrinsics in ``RISCVInstrInfoVPseudos.td``.
 
 Mask patterns
 -------------
 
-For the VL patterns we only match to masked pseudos to reduce the size of the match table, even if the node's mask is all ones and could be an unmasked pseudo. ``RISCVFoldMasks::convertToUnmasked`` will detect if the mask is all ones and convert it into its unmasked form.
+For masked pseudos the mask operand is copied to the physical ``$v0`` register during instruction selection with a glued ``CopyToReg`` node:
 
 .. code-block::
 
-     t15: nxv4i1 = RISCVISD::VMSET_VL Constant:i32<-1>
-   t16: nxv4i32 = PseudoVADD_MASK_VV_M2 t0, t2, t4, t15, -1, 5
+     t23: ch,glue = CopyToReg t0, Register:nxv4i1 $v0, t6
+   t25: nxv4i32 = PseudoVADD_VV_M2_MASK Register:nxv4i32 $noreg, t2, t4, Register:nxv4i1 $v0, TargetConstant:i64<8>, TargetConstant:i64<5>, TargetConstant:i64<1>, t23:1
 
-   // gets optimized to:
+The patterns in ``RISCVInstrInfoVVLPatterns.td`` only match masked pseudos to reduce the size of the match table, even if the node's mask is all ones and could be an unmasked pseudo.
+``RISCVFoldMasks::convertToUnmasked`` will detect if the mask is all ones and convert it into its unmasked form.
 
-   t16: nxv4i32 = PseudoVADD_VV_M2 t0, t2, t4, 4, 5
+.. code-block::
 
-.. note::
+   $v0 = PseudoVMSET_M_B16 -1, 32
+   %rd:vrm2 = PseudoVADD_VV_M2_MASK %passthru:vrm2(tied-def 0), %rs2:vrm2, %rs1:vrm2, $v0, %avl:gpr, sew:imm, policy:imm
 
-   Any vmset_vl can be treated as an all ones mask since the tail elements past VL are undef and can be replaced with ones.
+   // gets optimized to:
 
-For masked pseudos the mask operand is copied to the physical ``$v0`` register with a glued ``CopyToReg`` node:
+   %rd:vrm2 = PseudoVADD_VV_M2 %passthru:vrm2(tied-def 0), %rs2:vrm2, %rs1:vrm2, %avl:gpr, sew:imm, policy:imm
 
-.. code-block::
+.. note::
 
-     t23: ch,glue = CopyToReg t0, Register:nxv4i1 $v0, t6
-   t25: nxv4i32 = PseudoVADD_VV_M2_MASK Register:nxv4i32 $noreg, t2, t4, Register:nxv4i1 $v0, TargetConstant:i64<8>, TargetConstant:i64<5>, TargetConstant:i64<1>, t23:1
+   Any ``vmset.m`` can be treated as an all ones mask since the tail elements past AVL are ``undef`` and can be replaced with ones.
+
+.. _rvv_register_allocation:
 
 Register allocation
 ===================
@@ -228,20 +268,20 @@ Register allocation is split between vector and scalar registers, with vector al
 
 .. code-block::
 
-  $v8m2 = PseudoVADD_VV_M2 $v8m2(tied-def 0), $v8m2, $v10m2, %vl:gpr, 5
+  $v8m2 = PseudoVADD_VV_M2 $v8m2(tied-def 0), $v8m2, $v10m2, %vl:gpr, 5, 3
 
 .. note::
 
-   We split register allocation between vectors and scalars so that :ref:`RISCVInsertVSETVLI` can run after vector register allocation, but still before scalar register allocation as it may need to create a new virtual register to set the AVL to VLMAX.
+   Register allocation is split between vectors and scalars so that :ref:`RISCVInsertVSETVLI` can run after vector register allocation, but before scalar register allocation. It needs to be run before scalar register allocation as it may need to create a new virtual register to set the AVL to VLMAX.
 
-   Performing RISCVInsertVSETVLI after vector register allocation imposes fewer constraints on the machine scheduler since it cannot schedule instructions past vsetvlis, and it allows us to emit further vector pseudos during spilling or constant rematerialization.
+   Performing ``RISCVInsertVSETVLI`` after vector register allocation imposes fewer constraints on the machine scheduler since it cannot schedule instructions past ``vsetvli``\s, and it allows us to emit further vector pseudos during spilling or constant rematerialization.
 
 There are four register classes for vectors:
 
 - ``VR`` for vector registers (``v0``, ``v1,``, ..., ``v32``). Used when :math:`\text{LMUL} \leq 1` and mask registers.
-- ``VRM2`` for vector groups of length 2 i.e. :math:`\text{LMUL}=2` (``v0m2``, ``v2m2``, ..., ``v30m2``)
-- ``VRM4`` for vector groups of length 4 i.e. :math:`\text{LMUL}=4` (``v0m4``, ``v4m4``, ..., ``v28m4``)
-- ``VRM8`` for vector groups of length 8 i.e. :math:`\text{LMUL}=8` (``v0m8``, ``v8m8``, ..., ``v24m8``)
+- ``VRM2`` for vector groups of length 2 i.e., :math:`\text{LMUL}=2` (``v0m2``, ``v2m2``, ..., ``v30m2``)
+- ``VRM4`` for vector groups of length 4 i.e., :math:`\text{LMUL}=4` (``v0m4``, ``v4m4``, ..., ``v28m4``)
+- ``VRM8`` for vector groups of length 8 i.e., :math:`\text{LMUL}=8` (``v0m8``, ``v8m8``, ..., ``v24m8``)
 
 :math:`\text{LMUL} \lt 1` types and mask types do not benefit from having a dedicated class, so ``VR`` is used in their case.
 
@@ -252,7 +292,7 @@ Some instructions have a constraint that a register operand cannot be ``V0`` or
 RISCVInsertVSETVLI
 ==================
 
-After vector registers are allocated, the RISCVInsertVSETVLI pass will insert the necessary vsetvlis for the pseudos.
+After vector registers are allocated, the ``RISCVInsertVSETVLI`` pass will insert the necessary ``vsetvli``\s for the pseudos.
 
 .. code-block::
 
@@ -262,19 +302,19 @@ After vector registers are allocated, the RISCVInsertVSETVLI pass will insert th
 The physical ``$vl`` and ``$vtype`` registers are implicitly defined by the ``PseudoVSETVLI``, and are implicitly used by the ``PseudoVADD``.
 The VTYPE operand (``209`` in this example) is encoded as per the specification via ``RISCVVType::encodeVTYPE``.
 
-RISCVInsertVSETVLI performs dataflow analysis to emit as few vsetvlis as possible. It will also try to minimize the number of vsetvlis that set VL, i.e. it will emit ``vsetvli x0, x0`` if only VTYPE needs changed but VL doesn't.
+``RISCVInsertVSETVLI`` performs dataflow analysis to emit as few ``vsetvli``\s as possible. It will also try to minimize the number of ``vsetvli``\s that set VL, i.e., it will emit ``vsetvli x0, x0`` if only VTYPE needs changed but VL doesn't.
 
 Pseudo expansion and printing
 =============================
 
-After scalar register allocation, the ``RISCVExpandPseudoInsts.cpp`` pass expands out the ``PseudoVSETVLI``.
+After scalar register allocation, the ``RISCVExpandPseudoInsts.cpp`` pass expands the ``PseudoVSETVLI`` instructions.
 
 .. code-block::
 
    dead $x0 = VSETVLI $x1, 209, implicit-def $vtype, implicit-def $vl
    renamable $v8m2 = PseudoVADD_VV_M2 $v8m2(tied-def 0), $v8m2, $v10m2, $noreg, 5, implicit $vl, implicit $vtype
 
-Note that the vector pseudo remains as it's needed to encode the register class for the LMUL, so the VL and SEW operands are unused.
+Note that the vector pseudo remains as it's needed to encode the register class for the LMUL. Its VL and SEW operands are no longer used.
 
 ``RISCVAsmPrinter`` will then lower the pseudo instructions into real ``MCInsts``.
 

>From 0023d000f0fe18a16a3be526586dc14bf9b71361 Mon Sep 17 00:00:00 2001
From: Luke Lau <luke at igalia.com>
Date: Thu, 27 Jun 2024 17:13:11 +0800
Subject: [PATCH 09/10] Don't backtick design time constants, backtick
 registers. Correct VL -> AVL

---
 llvm/docs/RISCV/RISCVVectorExtension.rst | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/llvm/docs/RISCV/RISCVVectorExtension.rst b/llvm/docs/RISCV/RISCVVectorExtension.rst
index 55265b1990676..8e0b694f39d21 100644
--- a/llvm/docs/RISCV/RISCVVectorExtension.rst
+++ b/llvm/docs/RISCV/RISCVVectorExtension.rst
@@ -11,13 +11,13 @@ This guide gives an overview of how RVV is modelled in LLVM IR and how the backe
 Mapping to LLVM IR types
 ========================
 
-RVV adds 32 ``VLEN`` sized registers, where ``VLEN`` is an unknown constant to the compiler. To be able to represent ``VLEN`` sized values, the RISC-V backend takes the same approach as AArch64's SVE and uses `scalable vector types <https://llvm.org/docs/LangRef.html#t-vector>`_.
+RVV adds 32 VLEN sized registers, where VLEN is an unknown constant to the compiler. To be able to represent VLEN sized values, the RISC-V backend takes the same approach as AArch64's SVE and uses `scalable vector types <https://llvm.org/docs/LangRef.html#t-vector>`_.
 
 Scalable vector types are of the form ``<vscale x n x ty>``, which indicates a vector with a multiple of ``n`` elements of type ``ty``.
 On RISC-V ``n`` and ``ty`` control LMUL and SEW respectively.
 
-LLVM supports only ``ELEN=32`` or ``ELEN=64``, so ``vscale`` is defined as ``VLEN/64`` (see ``RISCV::RVVBitsPerBlock``).
-Note this means that ``VLEN>=64``, so ``VLEN=32`` isn't currently supported.
+LLVM only supports ELEN=32 or ELEN=64, so ``vscale`` is defined as VLEN/64 (see ``RISCV::RVVBitsPerBlock``).
+Note this means that VLEN must be at least 64, so VLEN=32 isn't currently supported.
 
 +-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+
 |                   | LMUL=⅛        | LMUL=¼         | LMUL=½           | LMUL=1            | LMUL=2            | LMUL=4            | LMUL=8            |
@@ -175,7 +175,7 @@ The ``insert_subvector`` and ``extract_subvector`` nodes responsible for wrappin
 
 .. note::
 
-   The only ``insert_subvector`` and ``extract_subvector`` nodes that make it through lowering are those that can be performed as an exact subregister insert or extract. This means that any fixed-length vector ``insert_subvector`` and ``extract_subvector`` nodes that aren't legalized must lie on a register group boundary, so the exact ``VLEN`` must be known at compile time (i.e., compiled with ``-mrvv-vector-bits=zvl`` or ``-mllvm -riscv-v-vector-bits-max=VLEN``, or have an exact ``vscale_range`` attribute).
+   The only ``insert_subvector`` and ``extract_subvector`` nodes that make it through lowering are those that can be performed as an exact subregister insert or extract. This means that any fixed-length vector ``insert_subvector`` and ``extract_subvector`` nodes that aren't legalized must lie on a register group boundary, so the exact VLEN must be known at compile time (i.e., compiled with ``-mrvv-vector-bits=zvl`` or ``-mllvm -riscv-v-vector-bits-max=VLEN``, or have an exact ``vscale_range`` attribute).
 
 Vector predication intrinsics
 -----------------------------
@@ -195,7 +195,7 @@ The VP EVL and mask are used for the VL node's AVL and mask respectively, whilst
 Instruction selection
 =====================
 
-VL and VTYPE need to be configured correctly, so we can't just directly select the underlying vector ``MachineInstr``. Instead pseudo instructions are selected, which carry the extra information needed to emit the necessary ``vsetvli``\s later.
+``vl`` and ``vtype`` need to be configured correctly, so we can't just directly select the underlying vector ``MachineInstr``. Instead pseudo instructions are selected, which carry the extra information needed to emit the necessary ``vsetvli``\s later.
 
 .. code-block::
 
@@ -302,7 +302,7 @@ After vector registers are allocated, the ``RISCVInsertVSETVLI`` pass will inser
 The physical ``$vl`` and ``$vtype`` registers are implicitly defined by the ``PseudoVSETVLI``, and are implicitly used by the ``PseudoVADD``.
 The VTYPE operand (``209`` in this example) is encoded as per the specification via ``RISCVVType::encodeVTYPE``.
 
-``RISCVInsertVSETVLI`` performs dataflow analysis to emit as few ``vsetvli``\s as possible. It will also try to minimize the number of ``vsetvli``\s that set VL, i.e., it will emit ``vsetvli x0, x0`` if only VTYPE needs changed but VL doesn't.
+``RISCVInsertVSETVLI`` performs dataflow analysis to emit as few ``vsetvli``\s as possible. It will also try to minimize the number of ``vsetvli``\s that set VL, i.e., it will emit ``vsetvli x0, x0`` if only ``vtype`` needs changed but ``vl`` doesn't.
 
 Pseudo expansion and printing
 =============================
@@ -314,9 +314,9 @@ After scalar register allocation, the ``RISCVExpandPseudoInsts.cpp`` pass expand
    dead $x0 = VSETVLI $x1, 209, implicit-def $vtype, implicit-def $vl
    renamable $v8m2 = PseudoVADD_VV_M2 $v8m2(tied-def 0), $v8m2, $v10m2, $noreg, 5, implicit $vl, implicit $vtype
 
-Note that the vector pseudo remains as it's needed to encode the register class for the LMUL. Its VL and SEW operands are no longer used.
+Note that the vector pseudo remains as it's needed to encode the register class for the LMUL. Its AVL and SEW operands are no longer used.
 
-``RISCVAsmPrinter`` will then lower the pseudo instructions into real ``MCInsts``.
+``RISCVAsmPrinter`` will then lower the pseudo instructions into real ``MCInst``\s.
 
 .. code-block:: nasm
 

>From 55e4a0cab20f5c3ee6ad1c11942becb2c6bed98f Mon Sep 17 00:00:00 2001
From: Luke Lau <luke at igalia.com>
Date: Wed, 3 Jul 2024 11:57:09 +0800
Subject: [PATCH 10/10] Wording tweaks, VTYPE -> ``vtype``, remove redundant
 phrasing

---
 llvm/docs/RISCV/RISCVVectorExtension.rst | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/llvm/docs/RISCV/RISCVVectorExtension.rst b/llvm/docs/RISCV/RISCVVectorExtension.rst
index 8e0b694f39d21..39836a4b1ab9c 100644
--- a/llvm/docs/RISCV/RISCVVectorExtension.rst
+++ b/llvm/docs/RISCV/RISCVVectorExtension.rst
@@ -5,8 +5,8 @@
 .. contents::
    :local:
 
-The RISC-V target readily supports the 1.0 version of the `RISC-V Vector Extension (RVV) <https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc>`_, but requires some tricks to handle its unique design.
-This guide gives an overview of how RVV is modelled in LLVM IR and how the backend generates code for it.
+The RISC-V target supports the 1.0 version of the `RISC-V Vector Extension (RVV) <https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc>`_.
+This guide gives an overview of how it's modelled in LLVM IR and how the backend generates code for it.
 
 Mapping to LLVM IR types
 ========================
@@ -272,7 +272,7 @@ Register allocation is split between vector and scalar registers, with vector al
 
 .. note::
 
-   Register allocation is split between vectors and scalars so that :ref:`RISCVInsertVSETVLI` can run after vector register allocation, but before scalar register allocation. It needs to be run before scalar register allocation as it may need to create a new virtual register to set the AVL to VLMAX.
+   Register allocation is split so that :ref:`RISCVInsertVSETVLI` can run after vector register allocation, but before scalar register allocation. It needs to be run before scalar register allocation as it may need to create a new virtual register to set the AVL to VLMAX.
 
    Performing ``RISCVInsertVSETVLI`` after vector register allocation imposes fewer constraints on the machine scheduler since it cannot schedule instructions past ``vsetvli``\s, and it allows us to emit further vector pseudos during spilling or constant rematerialization.
 
@@ -300,7 +300,7 @@ After vector registers are allocated, the ``RISCVInsertVSETVLI`` pass will inser
   $v8m2 = PseudoVADD_VV_M2 $v8m2(tied-def 0), $v8m2, $v10m2, $noreg, 5, implicit $vl, implicit $vtype
 
 The physical ``$vl`` and ``$vtype`` registers are implicitly defined by the ``PseudoVSETVLI``, and are implicitly used by the ``PseudoVADD``.
-The VTYPE operand (``209`` in this example) is encoded as per the specification via ``RISCVVType::encodeVTYPE``.
+The ``vtype`` operand (``209`` in this example) is encoded as per the specification via ``RISCVVType::encodeVTYPE``.
 
 ``RISCVInsertVSETVLI`` performs dataflow analysis to emit as few ``vsetvli``\s as possible. It will also try to minimize the number of ``vsetvli``\s that set VL, i.e., it will emit ``vsetvli x0, x0`` if only ``vtype`` needs changed but ``vl`` doesn't.