[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

Fri Nov 4 07:37:26 PDT 2016

+CC rest of the SVE LLVM development team.

Thanks,
Amara

On 11/4/16, 7:13 AM, "llvm-dev on behalf of Graham Hunter via llvm-dev" <llvm-dev-bounces at lists.llvm.org on behalf of llvm-dev at lists.llvm.org> wrote:

    Hi,

    We've been working for the last two years on support for ARM's Scalable Vector Extension in LLVM, and we'd like to upstream our work. We've had to make several design decisions without community input, and would like to discuss the major changes we've made. To help with the discussions, I've attached a technical document (also in plain text below) to describe the semantics of our changes. We've made the entire patch available as a github repository at https://github.com/ARM-software/LLVM-SVE and our presentation slides and video from the DevMeeting should be available on the llvm website soon.

    A few caveats:
    * This is not intended for code review, just to serve as context for discussion when we post patches for individual pieces (the first of which should be posted within the next two weeks)
    * It's based on a version of upstream llvm which is a few months old at this point
    * This is a warts-and-all release of our development tree, with plenty of TODOs and unfinished experiments
    * We haven't posted our clang changes yet

    -Graham

    Support for Scalable Vector Architectures in LLVM IR

    # Background

    *ARMv8-A Scalable Vector Extensions* is a vector ISA extension for AArch64,
    featuring lane predication, scatter-gather, and speculative loads. It is
    intended to scale with hardware such that the same binary running on a processor
    with wider vector registers can take advantage of the increased compute power
    without recompilation. This document describes changes made to LLVM that allow
    its vectorizer to better target SVE.

    As the vector length is no longer a statically known value, the way in which the
    LLVM vectorizer generates code required modifications such that certain values
    are now runtime evaluated expressions. The following section documents the
    additional IR instructions needed to achieve this. During the design of these
    extensions, we endeavoured to strike a balance between elegant and succinct
    design and the pragmatic realities of generating efficient code for the SVE
    architecture. We believe that the current design has a relatively small impact
    on the IR and type system given the unusual characteristics of the target.

    # Types

    ## Overview:

    ## IR Class Changes:

    To represent a vector of unknown length a scaling property is added to the
    `VectorType` class whose element count becomes an unknown multiple of a known
    minimum element count. To be specific `unsigned NumElements` is replaced by
    `class ElementCount` with its two members:

    * `unsigned Min`: the minimum number of elements.
    * `bool Scalable`: is the element count an unknown multiple of `Min`?

    For non-scalable vectors (`Scalable=false`) the scale is assumed to be one and
    thus `Min` becomes identical to the original `NumElements` property.

    This mixture of static and runtime quantities allow us to reason about the
    relationship between different scalable vector types without knowing their
    exact length.

    ## IR Interface:

    ```cpp
      static VectorType *get(Type *ELType, ElementCount EC);
      // Legacy interface whilst code is migrated.
      static VectorType *get(Type *ElType, unsigned NumEls, bool Scalable=false);

      ElementCount getElementCount();
      bool isScalable();
    ```

    ## IR Textual Form:

    The textual IR for vectors becomes:

    `<[n x] <m> x <type>>`

    where `type` is the scalar type of each element, `m` corresponds to `Min` and
    the optional string literal `n x ` signifies `Scalable` is *true* when present,
    false otherwise. The textual IR for non-scalable vectors is thus unchanged.

    Scalable vectors with the same `Min` value have the same number of elements,
    and the same number of bytes if `Min * sizeof(type)` is the same:

    `<n x 4 x i32>` and `<n x 4 x i8>` have the same number of elements.

    `<n x 4 x i32>` and `<n x 8 x i16>` have the same number of bytes.

    ## SelectionDAG

    New scalable vector MVTs are added, one for each existing vector type. Scalable
    vector MVTs are modelled in the same way as the IR. Hence, `<n x 4 x i32>`
    becomes `nxv4i32`.

    ## MVT Interface:

    ```cpp
      static MVT getVectorVT(MVT VT, ElementCount EC);
      bool isScalableVector() const;
      static mvt_range integer_scalable_valuetypes();
      static mvt_range fp_scalable_valuetypes();
    ```

    To minimise the effort required for common code to preserve the scalable flag
    we extend the helper functions within MVT/EVT classes to cover more cases. For
    example:

    ```cpp
      /// Return a VT for a vector type whose attributes match ourselves
      /// but with an element type chosen by the caller.
      EVT changeVectorElementType(EVT EltVT)`
    ```

    # Instructions

    The majority of instructions work seamlessly when applied to scalable vector
    types. However, on occasion assumptions are made that allow vectorization logic
    to be reduced directly to constants completely bypassing the IR (e.g. when the
    element count is known). These assumption are generally unsafe for scalable
    vectors which forces a requirement to express more logic in IR. To help with
    this the following instruction are added to the IR.

    ## *elementcount*

    ### Syntax:

    `<result> = elementcount <n x m x ty> <v1> as <ty2>`

    ### Overview:

    This instruction returns the actual number of elements in the input vector of
    type `<n x m x ty>` as the scalar type `<ty2>`. This is primarily used to
    increment induction variables and replaces many uses of `VF` within the loop
    vectorizer.

    ### IRBuilder Interface:

    ```cpp
      Value *CreateElementCount(Type *Ty, Value *V, const Twine &Name = "");
    ```

    ### Fixed-Width Behaviour:

    A constant with the value of `Min` is created.

    ### SelectionDAG:

    See [*ISD::ELEMENT_COUNT*](#isdelementcount).

    ## *seriesvector*

    ### Syntax:

    `<result> = seriesvector <ty> <v1>, <v2> as <n x m x ty>`

    ### Overview:

    This instruction returns a vector of type `<n x m x ty>` with elements forming
    the arithmetic series:

    `elt[z] = v1 + z * v2`

    where `z` is the element index and `<ty>` a scalar type. This is primarily used
    to represent a vector of induction values leading to:

    * Predicate creation using vector compares for fully predicated loops (see also:
      [*propff*](#propff), [*test*](#test)).
    * Creating offset vectors for gather/scatter via `getelementptr`.
    * Creating masks for `shufflevector`.

    For the following loop, a seriesvector instruction based on `i` is used to
    create the data vector to store:

    ```cpp
      unsigned a[LIMIT];

      for (unsigned i = 0; i < LIMIT; i++) {
        a[i] = i;
      }
    ```

    *seriesvector* instructions can optionally have *NUW* and *NSW* flags attached
    to them. The semantics of these flags are intended to match those of the usual
    scalar instruction wrap flags, but apply element-wise to the result vector. If
    any element addition of the current value and step results in a signed or
    unsigned overflow with the corresponding flag set, then the result is a poison
    value for the entire vector. However, this feature is not currently used.

    ### IRBuilder Interface:

    ```cpp
      Value *CreateSeriesVector(VectorType::ElementCount EC, Value *Start,
                                Value* Step, const Twine &Name = "",
                                bool HasNUW = false, bool HasNSW = false);
    ```

    ### Fixed-Width Behaviour:

    A constant vector is created with the same arithmetic series.

    ### SelectionDAG:

    See [*ISD::SERIES_VECTOR*](#isdseriesvector).

    ## *test*

    ### Syntax:

    `<result> = test <cond> <ty> <v1>`

    ### Overview:

    This instruction returns a scalar boolean value based on the comparison of a
    boolean or vector boolean operand. It allows us to reason about a value as a
    whole rather than the per scalar comparison obtained from say *icmp*. The most
    common use is testing the result of *icmp*/*fcmp*.

    We use this for the controlling predicate of fully predicated loops. Loops with
    a termination condition as follows:

    ```cpp
      for (i = 0; i < LIMIT; ++i)
    ```

    are handled by testing their control predicate for `first true` to signify
    another iteration is required.

    Support for scalar booleans is simply to provide symmetry so that all variants
    of *icmp*/*fcmp* can be passed as input to *test*.

    #### Supported Conditions:

    * all false
    * all true
    * any false
    * any true
    * first false
    * first true
    * last false
    * last true

    ### IRBuilder Interface:

    ```cpp
      Value *CreateTest(TestInst::Predicate P, Value *V, const Twine &Name = "");
    ```

    ### Fixed-Width Behaviour

    Same as scalable.

    ### SelectionDAG:

    See [*ISD::TEST_VECTOR*](#isdtestvector).

    ## *propff*

    ### Syntax:

    `<result> = propff <ty> <v1>, <v2>`

    ### Overview:

    This instruction creates a partitioned boolean vector based on the position of
    the first boolean false value in the concatenation of input boolean vectors `v1`
    and `v2`. Given vectors of length `k`, an element `x[i]` in the result vector is
    true if, and only if, each element `y[0]...y[i+k]` in the concatenated input
    vector `y=v1:v2` is true.

    The following examples show the results for a full `v1` and a `v2` which has
    reached a termination condition, and a `v1` which previously reached a
    termination condition with any `v2`.

    \ ![Propff Examples](Propff_Behaviour.png)

    A core use of this instruction is to protect against integer overflow that can
    occur when maintaining the induction vector of a fully predicated loop. For SVE
    the issue is further compounded with its support of non-power-of-2 vector
    lengths.

    We believe *propff* to be the least complex partitioning instruction to provide
    sufficient abstraction yet achieve the expected SVE code quality. Although
    other partitioning schemes can be modelled using *propff* we feel a more
    [capable partitioning instruction](#partition) is worth highlighting because
    it can simplify the vectorization of more loops.

    ### IRBuilder Interface:

    ```cpp
      Value *CreatePropFF(Value* P1, Value *P2, const Twine &Name = "");
    ```

    ### Fixed-Width Behaviour:

    Same as scalable.

    ### SelectionDAG:

    See [*ISD::PROPAGATE\_FIRST\_ZERO*](#isdpropagatefirstzero).

    ## *shufflevector*

    ### Syntax:

    `<result> = shufflevector <ty1> <v1>, <ty1> <v2>, <ty2> <mask>`

    ### Overview:

    Not a new instruction but *shufflevector* is extended to accept non-constant
    masks. For code that expects the original restriction
    `ShuffleVectorInst::getMaskValue` has changed to guard against unsafe uses.

    ```cpp
      // [old] Returns the mask value.
      static int getMaskValue(Constant *Mask, unsigned i);
      // [new] Returns true when Result contains the mask value, false otherwise.
      static bool getMaskValue(Value *Mask, unsigned i, int &Result);
    ```

    Similar changes are made to related interfaces and their users.

    ### IRBuilder Interface:

    ```cpp
      Value *CreateShuffleVector(Value *V1, Value *V2, Value *Mask,
                                 const Twine &Name = "");
    ```

    ### Fixed-Width Behaviour:

    No change.

    ### SelectionDAG:

    See [*ISD::VECTOR\_SHUFFLE\_VAR*](#isdvectorshufflevar).

    # Constants

    Scalable vectors of known constants cannot be represented within LLVM due to
    their unknown element count. Optimizations performed on scalable vectors become
    much more reliant on `ConstantExpr`s. Most of the instructions talked about in
    this document also have a matching `ConstantExpr` with the most common constant
    folds duplicated to work with them.

    This is most prevalent when considering vector splats. These are simulated via
    a sequence of `insertelement`and `shufflevector` that are usually resolved to a
    `Constant`. For scalable vectors this cannot be done and the original sequence
    is maintained. To mitigate this, `PatternMatch` is extended to better support
    `ConstantExpr`s along with new helpers like:

    ```cpp
      #define m_SplatVector(X) \
        m_ShuffleVector( \
          m_InsertElement(m_Undef(), X, m_Zero()), \
          m_Value(), \
          m_Zero()) \
    ```

    Zero is the exception with `zeroinitializer` applying equally well to scalable
    and non-scalable vectors.

    # Predicated floating point arithmetic {#predfparith}

    When a loop is fully predicated it becomes necessary to have masked versions of
    faulting instructions. Today we use the existing masked memory intrinsics to
    ensure we do not access memory that would not have been accessed by the
    original scalar loop.

    The same principle applies to floating point instructions whereby we should not
    trigger an exception that would not have been produced by the original scalar
    loop.

    We don't handle this today but are seeking advice as to the preferred method.
    Options include:

    1. Add masked versions of the affected instructions.
    2. Extend existing instructions to take a predicate.
    3. Construct IR that contains "safe" values for undefined locations.

    Given the choice, Option 1 is our preferred route as it's in line with the
    introduction of the masked memory intrinsics. Although, given the greater
    potential for optimization, instruction forms may be preferable to intrinsics.

    See also [*Predicated floating point intrinsics*](#predfpint)

    # Intrinsics

    ## *llvm.masked.spec.load*

    ### Syntax:

    `<result> = call {<data>, <pred>} @llvm.masked.spec.load.<type>(<ptr>, <align>, <mask>, <merge>)`

    ### Overview:

    This intrinsic represents a load whereby only the first active lane is expected
    to succeed with remaining lanes loaded speculatively. Along with the data this
    intrinsic returns a predicate indicating which lanes of data are valid.

    ### Interface:

    ```cpp
      CallInst *CreateMaskedSpecLoad(Value *Ptr, unsigned Align, Value *Mask,
                                     Value *PassThru = 0, const Twine &Name = "");
    ```

    ### Fixed-Width Behaviour:

    Same as scalable.

    ## *llvm.ctvpop*

    ### Syntax:

    ### Overview:

    This intrinsic returns a count of the set bits across a complete vector. It's
    most useful when operating on predicates as it allows a portable way to create
    predicate vectors that are partitioned differently (e.g. including the lane
    with the first change, rather than *propff*'s exclusive behaviour).

    ### Interface:
    ```cpp
      CallInst *CreateCntVPop(Value *Vec, const Twine &Name);
    ```

    ### Fixed-Width Behaviour

    Same as scalable. However, it can also be modelled using `*llvm.ctpop*`
    followed by a scalarized horizontal reduction. This is not possible with
    scalable vectors because scalarization is not an option.

    ## *horizontal reductions*

    Loops containing reductions are first vectorized as if no such reduction
    exists, thus building a vector of accumulated results. When exiting the vector
    loop a final in-vector reduction is done by effectively scalarizing the
    operation across each of the result vector's elements.

    For scalable vectors, whose element count is unknown, such scalarization is not
    possible. For this reason we introduce target specific intrinsics to support
    common reductions (e.g. horizontal add), along with TargetTransformInfo
    extensions to query their availability. If a scalable vector loop requires a
    reduction that's not provided by the target its vectorization is prevented.

    When fully predicated vectorization is required, additional work is done within
    the vector loop to ensure inactive lanes don't affect the accumulated result.
    See also [Predicated floating point arithmetic](#predfparith).

    ## *Predicated floating point intrinsics* {#predfpint}

    Libcalls not directly supported by the code generator must be serialized (See
    [here](#pass_libcall)). Also, for fully predicated loops the scalar calls must
    only occur for active lanes so we introduce predicated versions of the most
    common routines (e.g. `*llvm.pow`). This is mostly transparent to the loop
    vectorizer beyond the requirement to pass an extra parameter.

    ### TTI Interface:

    ```cpp
      bool canReduceInVector(const RecurrenceDescriptor &Desc, bool NoNaN) const;
      Value* getReductionIntrinsic(IRBuilder<> &Builder,
                                   const RecurrenceDescriptor& Desc, bool NoNaN,
                                   Value* Src) const;
    ```

    # SelectionDAG Nodes

    ## *ISD::BUILD_VECTOR* {#isdbuildvector}

    `BUILD_VECTOR(ELT0, ELT1, ELT2, ELT3,...)`

    This node takes as input a set of scalars to concatenate into a single vector.
    By its very nature this node is incompatible with scalable vectors because
    we don't know how many scalars it will takes to fill one.  All common code
    using this node has either been rewritten or bypassed.

    ## *ISD::ELEMENT_COUNT* {#isdelementcount}

    `ELEMENT_COUNT(TYPE)`

    See [*elementcount*](#elementcount).

    ## *ISD::EXTRACT_SUBVECTOR* {#isdextractsubvector}

    `EXTRACT_SUBVECTOR(VECTOR, IDX)`

    Usage of this node is typically linked to legalization where it's used to split
    vectors. The index parameter is often absolute and proportional to the input
    vector's element count.

    For scalable vectors an absolute index makes little sense. We have changed this
    parameter's meaning to become implicitly multiplied by `n` to match its main
    usage.

    The change has no effect when applied to non-scalable vectors, because `n ==
    1`. No target specific code is affected and in many cases common code becomes
    compatible with scalable vectors. For example:

    `nxv2i64 extract_subvector(nxv4i64, 2)`

    The real first lane becomes `n * 2`, resulting in the extraction of the top
    half of the input vector. This maintains the intension of the original code for
    both scalable and non-scalable vectors.

    For common code that truly requires an absolute index we recommend a new
    distinct ISD node to better differentiate such patterns.

    ## *ISD::INSERT_SUBVECTOR* {#isdinsertsubvector}

    `INSERT_SUBVECTOR(VECTOR1, VECTOR2, IDX)`

    We have made the equivalent change to this node's index parameter to match the
    behaviour of [ISD::EXTRACT_SUBVECTOR](#ISD::EXTRACT_SUBVECTOR).

    ## *ISD::PROPAGATE\_FIRST\_ZERO* {#isdpropagatefirstzero}

    `PROPAGATE_FIRST_ZERO(VEC1, VEC2)`

    See [*propff*](#propff).

    ## *ISD::SERIES_VECTOR* {#isdseriesvector}

    `SERIES_VECTOR(INITIAL, STEP)`

    See [*seriesvector*](#seriesvector).

    ## *ISD::SPLAT_VECTOR* {#isdsplatvector}

    `SPLAT_VECTOR(VAL)`

    Within the code generator a splat is represented by ISD::BUILD_VECTOR and is
    thus incompatible with scalable vectors.

    We initially made use of [ISD::SERIES_VECTOR](#seriesvector) with a zero stride
    but that brings with it a requirement for [ISD::SERIES_VECTOR](#seriesvector)
    to support floating-point types.  For this reason we created an explicit node
    that also allowed less complex looking legalization/selection code.

    ## *ISD::TEST_VECTOR* {#isdtestvector}

    `TEST_VECTOR(VEC, PRED)`

    `VEC` is the boolean vector being tested, and `PRED` is a `TestCode` enum under
    the `llvm::ISD` namespace which contains all the supported conditions.

    See [*test*](#test).

    ## *ISD::VECTOR\_SHUFFLE\_VAR* {#isdvectorshufflevar}

    `VECTOR_SHUFFLE_VAR(VEC1, VEC2, VEC3)`

    See [*shufflevector*](#shufflevector).

    # AArch64 Specific Changes

    ## Instruction Selection

    In order to allow proper instruction selection there must be a direct mapping
    from MVTs to SVE registers. SVE data registers have a length in multiples of
    128bits (with 128bits being the minimum) and predicate registers have a bit for
    every byte of a data register.

    Given the 128bit minimum we map scalable vector MVTs whose static component is
    also 128bits (e.g. MVT::nxv4i32) directly to SVE data registers. Scalable
    vector MVTs with an `i1` element type and a static element count of 16
    (128/8 = 16) or fewer (e.g. MVT::nxv4i1) are mapped to SVE predicate registers.

    All other integer MVTs are considered illegal and are either split or promoted
    accordingly. A similar rule applies to vector floating point MVTs but those
    types whose static component is less that 128bits (MVT::nx2f32) are also mapped
    directly to SVE data registers but in a form whereby elements are effectively
    interleaved with enough undefined elements to fulfil the 128bit requirement.

    We do this so that predicate bits correctly align to their data counterpart.
    For example, for all vector MVTs with two elements, a predicate of nxv2i1 is
    used, regardless of the data vector's element type.

    ## Stack Regions for SVE Objects

    As SVE registers have an unknown size their associated fill/spill instructions
    require an offset that will be implicitly scaled by the vector length instead
    of bytes or element size. To accommodate this we introduce the concept of
    `Stack Regions` that are areas on the stack associated with specific data types
    or register classes.

    Each `Stack Region` controls its own allocation, deallocation and internal
    offset calculations. For SVE we create separate `Stack Region`s for its data
    and predicate registers. Objects not belonging to a `Stack Region` use a
    default so that existing targets are not affected.

    ## SVE-Specific Optimizations

    The following SVE specific IR transformation passes are added to better guide
    code generation. In some instances this has affected how loops are vectorized
    because it's assumed they will occur. They also point to work other target's
    may wish to consider.

    ### SVE Post-Vectorization Optimization Pass

    This pass is responsible for converting generically vectorized loops into a
    form that more closely resembles SVE's style of vectorization. For example,
    rewriting a vectorized loop's control flow to utilize SVE's `while` instruction.

    NOTE: This pass is very sensitive to the wrap flags (i.e. NSW/NUW). Much effort
    has gone into ensuring they are preserved during vectorization to the extent of
    introducing `llvm.assume` calls when beneficial.

    ### SVE Expand Libcall Pass{#pass_libcall}

    In order to keep the loop vectorizer generic we maintain the ability to
    vectorize libcalls we know the code generator cannot handle because scalable
    vectors cannot be scalarized. This pass serializes such calls by introducing
    loops using SVE's predicate iteration instructions.

    Note that although such serialization can be achieved generically using
    `extractelement`/`insertelement`, our experiments showed no route to efficient
    code generation.

    ### SVE Addressing Modes Pass

    This pass alleviates the negative effects of `Loop Strength Reduction` on
    address computations for SVE targeted loops, leading to better code quality.

    # Full Example

    The following section shows how a C loop with a sum reduction is represented in
    scalable IR (assuming -Ofast) and the final generated code.

    ```cpp
      int SimpleReduction(int *a, int count) {
        int res = 0;
        for (int i = 0; i < count; ++i)
          res += a[i];

        return res;
      }
    ```

    The IR representation shows each of the new instructions being used to control
    the loop, along with one of the horizontal reduction intrinsics. The main
    sequence (noted by the *;; Control* comments below) for controlling loop
    iterations is:

    1. Using *elementcount* to increment the induction variable
    2. Using *seriesvector* starting from the current induction variable value to
        compare against a splat of the loop's trip count
    3. Using *propff* to ensure the resulting predicate strictly partitions the
        predicate and does not wrap
    4. Using *test* to determine whether the first lane is active (and thus at
        least one more iteration is required)

    ```llvm
      define i32 @SimpleReduction(i32* nocapture readonly %a, i32 %count) #0 {
      entry:
        %cmp6 = icmp sgt i32 %count, 0
        br i1 %cmp6, label %min.iters.checked, label %for.cond.cleanup

      min.iters.checked:
        %0 = add i32 %count, -1
        %1 = zext i32 %0 to i64
        %2 = add nuw nsw i64 %1, 1
        %wide.end.idx.splatinsert = insertelement <n x 4 x i64> undef, i64 %2, i32 0
        %wide.end.idx.splat = shufflevector <n x 4 x i64> %wide.end.idx.splatinsert,
                                  <n x 4 x i64> undef, <n x 4 x i32> zeroinitializer
        %3 = icmp ugt <n x 4 x i64> %wide.end.idx.splat, seriesvector (i64 0, i64 1)
        %predicate.entry = propff <n x 4 x i1> shufflevector (<n x 4 x i1> insertelement
                                  (<n x 4 x i1> undef, i1 true, i32 0), <n x 4 x i1> undef,
                                  <n x 4 x i32> zeroinitializer), %3
        br label %vector.body

      vector.body:
                                                            %min.iters.checked
        %index = phi i64 [ 0, %min.iters.checked ], [ %index.next, %vector.body ]
        %predicate = phi <n x 4 x i1> [ %predicate.entry, %min.iters.checked ],
                                             [ %predicate.next, %vector.body ]
        %vec.phi = phi <n x 4 x i32> [ zeroinitializer, %min.iters.checked ],
                                                        [ %8, %vector.body ]
        %4 = icmp ult i64 %index, 4294967296
        call void @llvm.assume(i1 %4)
        %5 = getelementptr inbounds i32, i32* %a, i64 %index
        %6 = bitcast i32* %5 to <n x 4 x i32>*
        %wide.masked.load = call <n x 4 x i32> @llvm.masked.load.nxv4i32(<n x 4 x i32>* %6,
                            i32 4, <n x 4 x i1> %predicate, <n x 4 x i32> undef), !tbaa !1
        %7 = select <n x 4 x i1> %predicate, <n x 4 x i32> %wide.masked.load,
                                               <n x 4 x i32> zeroinitializer
        %8 = add nsw <n x 4 x i32> %vec.phi, %7
        %index.next = add nuw nsw i64 %index, elementcount (<n x 4 x i64> undef) ;; Control 1
        %9 = add nuw nsw i64 %index, elementcount (<n x 4 x i64> undef)
        %10 = seriesvector i64 %9, i64 1 as <n x 4 x i64>                        ;; Control 2
        %11 = icmp ult <n x 4 x i64> %10, %wide.end.idx.splat
        %predicate.next = propff <n x 4 x i1> %predicate, %11                    ;; Control 3
        %12 = test first true <n x 4 x i1> %predicate.next                       ;; Control 4
        br i1 %12, label %vector.body, label %middle.block, !llvm.loop !5

      middle.block:
        %.lcssa = phi <n x 4 x i32> [ %8, %vector.body ]
        %13 = call i64 @llvm.aarch64.sve.uaddv.nxv4i32(<n x 4 x i1> shufflevector
             (<n x 4 x i1> insertelement (<n x 4 x i1> undef, i1 true, i32 0),
             <n x 4 x i1> undef, <n x 4 x i32> zeroinitializer), <n x 4 x i32> %.lcssa)
        %14 = trunc i64 %13 to i32
        br label %for.cond.cleanup

      for.cond.cleanup:
        %res.0.lcssa = phi i32 [ 0, %entry ], [ %14, %middle.block ]
        ret i32 %res.0.lcssa
      }
    ```

    The main vector body of the resulting code is one instruction longer than it
    would be for NEON, but no scalar tail is required and performance will scale
    with register length. The *seriesvector*, *shufflevector*(splat), *icmp*,
    *propff*, *test* sequence has been recognized and transformed into the `whilelo`
    instruction.

    \newpage

    ```nasm
      SimpleReduction:
      // BB#0:
                subs    w9, w1, #1
                b.lt    .LBB0_4
      // BB#1:
                add     x9, x9, #1
                mov     x8, xzr
                whilelo p0.s, xzr, x9
                mov     z0.s, #0
      .LBB0_2:
                ld1w    {z1.s}, p0/z, [x0, x8, lsl #2]
                incw    x8
                add     z0.s, p0/m, z0.s, z1.s
                whilelo p0.s, x8, x9
                b.mi    .LBB0_2
      // BB#3:
                ptrue   p0.s
                uaddv   d0, p0, z0.s
                fmov    w0, s0
                ret
      .LBB0_4:
                mov     w0, wzr
                ret
    ```

    <!--
    Commented out until after DevMeeting talk
    # Clang Changes

    ## ACLE Types

    (TODO)
    -->

    # Appendix

    The following is an alternative to the [*propff*](#propff) instruction described
    above. We haven't implemented this (or worked out all the specific details)
    since we can synthesize the initial partitions required with extra
    operations, although matching those and transforming to appropriate AArch64 SVE
    instructions is fragile given other optimization passes. This instruction is
    more complex, but would allow us to explicitly specify the exact partition
    desired.

    ## *partition*

    ### Syntax:

    `<result> = partition first <part_on> [inclusive] <ty> <p>  [, propagating <pprop>]`

    `<result> = partition next <part_on> [inclusive] <ty> <p> from <pcont>`

    ### Overview:

    This instruction creates a partitioned boolean vector based on an input vector
    `p`. There are two main modes to consider. The *first* mode is an extended
    version of [*propff*](#propff) where `part_on` is a boolean which determines
    whether the partition is made based on the first false value or the first true
    value. Propagating from a previous vector is optional. The `inclusive` option
    produces a partition which includes the first true or false value; the default
    is an exclusive partition which only covers the elements before the first true
    or false value.

    In cases where there are multiple subsets of data in a vector which must be
    processed independently, we need a way to take the original boolean vector and
    an existing partition to create the next partition to work on. This is provided
    by the *next* form of *partition*. The `pcont` argument is the existing
    partitioned vector, generated either by a *partition first* or another
    *partition next*. If there are no more partitions after the current one, a
    boolean vector of all false will be returned.

    ### Fixed-Width Behaviour:

    Same as scalable.