[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Mon Jun 11 08:48:50 PDT 2018

Hi Robin,

Thanks for the comments; replies inline (except for the stack regions question;
Sander is handling that side of things).

-Graham

On 8 Jun 2018, at 16:24, Robin Kruppe <robin.kruppe at gmail.com<mailto:robin.kruppe at gmail.com>> wrote:

Hi Graham,

First of all, thanks a lot for updating the RFC and also for putting up the
patches, they are quite interesting and illuminate some details I was curious
about. I have some minor questions and comments inline below but overall I
believe this is both a reasonably small extension of LLVM IR and powerful
enough to support SVE, RVV, and hopefully future ISAs with variable length
vectors. Details may change as we gather more experience, but it's a very
good starting point.

One thing I am missing is a discussion of how stack frame layout will be
handled. Earlier RFCs mentioned a concept called "Stack Regions" but (IIRC)
gave little details and it was a long time ago anyway. What are your current
plans here?

I'll reply separately to the sub-thread about RISC-V codegen.

Cheers,
Robin

On 5 June 2018 at 15:15, Graham Hunter <Graham.Hunter at arm.com<mailto:Graham.Hunter at arm.com>> wrote:
Hi,

Now that Sander has committed enough MC support for SVE, here's an updated
RFC for variable length vector support with a set of 14 patches (listed at the end)
to demonstrate code generation for SVE using the extensions proposed in the RFC.

I have some ideas about how to support RISC-V's upcoming extension alongside
SVE; I'll send an email with some additional comments on Robin's RFC later.

Feedback and questions welcome.

-Graham

=============================================================
Supporting SIMD instruction sets with variable vector lengths
=============================================================

In this RFC we propose extending LLVM IR to support code-generation for variable
length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
approach is backwards compatible and should be as non-intrusive as possible; the
only change needed in other backends is how size is queried on vector types, and
it only requires a change in which function is called. We have created a set of
proof-of-concept patches to represent a simple vectorized loop in IR and
generate SVE instructions from that IR. These patches (listed in section 7 of
this rfc) can be found on Phabricator and are intended to illustrate the scope
of changes required by the general approach described in this RFC.

==========
Background
==========

*ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
AArch64 which is intended to scale with hardware such that the same binary
running on a processor with longer vector registers can take advantage of the
increased compute power without recompilation.

As the vector length is no longer a compile-time known value, the way in which
the LLVM vectorizer generates code requires modifications such that certain
values are now runtime evaluated expressions instead of compile-time constants.

Documentation for SVE can be found at
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a

========
Contents
========

The rest of this RFC covers the following topics:

1. Types -- a proposal to extend VectorType to be able to represent vectors that
   have a length which is a runtime-determined multiple of a known base length.

2. Size Queries - how to reason about the size of types for which the size isn't
   fully known at compile time.

3. Representing the runtime multiple of vector length in IR for use in address
   calculations and induction variable comparisons.

4. Generating 'constant' values in IR for vectors with a runtime-determined
   number of elements.

5. A brief note on code generation of these new operations for AArch64.

6. An example of C code and matching IR using the proposed extensions.

7. A list of patches demonstrating the changes required to emit SVE instructions
   for a loop that has already been vectorized using the extensions described
   in this RFC.

========
1. Types
========

To represent a vector of unknown length a boolean `Scalable` property has been
added to the `VectorType` class, which indicates that the number of elements in
the vector is a runtime-determined integer multiple of the `NumElements` field.
Most code that deals with vectors doesn't need to know the exact length, but
does need to know relative lengths -- e.g. get a vector with the same number of
elements but a different element type, or with half or double the number of
elements.

In order to allow code to transparently support scalable vectors, we introduce
an `ElementCount` class with two members:

- `unsigned Min`: the minimum number of elements.
- `bool Scalable`: is the element count an unknown multiple of `Min`?

For non-scalable vectors (``Scalable=false``) the scale is considered to be
equal to one and thus `Min` represents the exact number of elements in the
vector.

The intent for code working with vectors is to use convenience methods and avoid
directly dealing with the number of elements. If needed, calling
`getElementCount` on a vector type instead of `getVectorNumElements` can be used
to obtain the (potentially scalable) number of elements. Overloaded division and
multiplication operators allow an ElementCount instance to be used in much the
same manner as an integer for most cases.

This mixture of compile-time and runtime quantities allow us to reason about the
relationship between different scalable vector types without knowing their
exact length.

The runtime multiple is not expected to change during program execution for SVE,
but it is possible. The model of scalable vectors presented in this RFC assumes
that the multiple will be constant within a function but not necessarily across
functions. As suggested in the recent RISC-V rfc, a new function attribute to
inherit the multiple across function calls will allow for function calls with
vector arguments/return values and inlining/outlining optimizations.

IR Textual Form
---------------

The textual form for a scalable vector is:

``<scalable <n> x <type>>``

where `type` is the scalar type of each element, `n` is the minimum number of
elements, and the string literal `scalable` indicates that the total number of
elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
for indicating that the vector is scalable, and could be substituted by another.
For fixed-length vectors, the `scalable` is omitted, so there is no change in
the format for existing vectors.

Scalable vectors with the same `Min` value have the same number of elements, and
the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
used within the same function):

``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
  elements.

``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
  bytes.

IR Bitcode Form
---------------

To serialize scalable vectors to bitcode, a new boolean field is added to the
type record. If the field is not present the type will default to a fixed-length
vector type, preserving backwards compatibility.

Alternatives Considered
-----------------------

We did consider one main alternative -- a dedicated target type, like the
x86_mmx type.

A dedicated target type would either need to extend all existing passes that
work with vectors to recognize the new type, or to duplicate all that code
in order to get reasonable code generation and autovectorization.

This hasn't been done for the x86_mmx type, and so it is only capable of
providing support for C-level intrinsics instead of being used and recognized by
passes inside llvm.

Although our current solution will need to change some of the code that creates
new VectorTypes, much of that code doesn't need to care about whether the types
are scalable or not -- they can use preexisting methods like
`getHalfElementsVectorType`. If the code is a little more complex,
`ElementCount` structs can be used instead of an `unsigned` value to represent
the number of elements.

===============
2. Size Queries
===============

This is a proposal for how to deal with querying the size of scalable types.
While it has not been implemented in full, the general approach works well
for calculating offsets into structures with scalable types in a modified
version of ComputeValueVTs in our downstream compiler.

Current IR types that have a known size all return a single integer constant.
For scalable types a second integer is needed to indicate the number of bytes
which need to be scaled by the runtime multiple to obtain the actual length.

For primitive types, getPrimitiveSizeInBits will function as it does today,
except that it will no longer return a size for vector types (it will return 0,
as it does for other derived types). The majority of calls to this function are
already for scalar rather than vector types.

For derived types, a function (getSizeExpressionInBits) to return a pair of
integers (one to indicate unscaled bits, the other for bits that need to be
scaled by the runtime multiple) will be added. For backends that do not need to
deal with scalable types, another function (getFixedSizeExpressionInBits) that
only returns unscaled bits will be provided, with a debug assert that the type
isn't scalable.

Similar functionality will be added to DataLayout.

Comparing two of these sizes together is straightforward if only unscaled sizes
are used. Comparisons between scaled sizes is also simple when comparing sizes
within a function (or across functions with the inherit flag mentioned in the
changes to the type), but cannot be compared otherwise. If a mix is present,
then any number of unscaled bits will not be considered to have a greater size
than a smaller number of scaled bits, but a smaller number of unscaled bits
will be considered to have a smaller size than a greater number of scaled bits
(since the runtime multiple is at least one).

You mention it's not fully implemented yet, but perhaps you have some thoughts
on what the APIs for this should look like?

For size comparisons it's concerning that it could silently give misleading
results when operating across function boundaries. One solution could be a
function-specific API like `compareSizesIn(Type *, Type *, Function *)`, but
that extra parameter may turn out to be very viral. OTOH, maybe that's
unavoidable complexity from having type "sizes" vary between functions.

Alternatively, general size queries could return "incomparable" and "only if
in the same function" results in addition to smaller/larger/equal. This might
nudge code to handling all these possibilities as well as it can.

I agree that would be nice to catch invalid comparisons; I've considered a
function that would take two 'Value*'s instead of types so that the parent
could be determined, but I don't think that works in all cases.

Using an optional 'Function*' argument in a size comparison function will
work; I think many of the existing size queries are on scalar values in code
that has already explicitly checked that it's not working on aggregate types.

It may be a better idea to catch misuses when trying to clone instructions
into a different function.

Future Work
-----------

Since we cannot determine the exact size of a scalable vector, the
existing logic for alias detection won't work when multiple accesses
share a common base pointer with different offsets.

However, SVE's predication will mean that a dynamic 'safe' vector length
can be determined at runtime, so after initial support has been added we
can work on vectorizing loops using runtime predication to avoid aliasing
problems.

Alternatives Considered
-----------------------

Marking scalable vectors as unsized doesn't work well, as many parts of
llvm dealing with loads and stores assert that 'isSized()' returns true
and make use of the size when calculating offsets.

Seconded, I encountered those as well.

We have considered introducing multiple helper functions instead of
using direct size queries, but that doesn't cover all cases. It may
still be a good idea to introduce them to make the purpose in a given
case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.

+1 for clear helpers, but this sentiment is somewhat independent of the
changes to type sizes. For example, `isBitCastableTo` seems appealing to me
not just because it clarifies the intent of the size comparison but also
because it can encapsulate all the cases where types have the same size but
bitcasts still aren't allowed (ptr<->int, aggregates). With a good API for the
{scaled, unscaled} pairs, the size comparison should not be much more complex
than today.

Agreed, it's orthogonal to the scalable vector part.

Aside: I was curious, so I grepped and found that this specific predicate
already exists under the name CastInst::isBitCastable.

So it does; there's a few more cases around the codebase that don't use that function
though. I may create a cleanup patch for them.

Other possibilities might be 'requires[Sign|Zero]Extension', 'requiresTrunc', 'getLargestType',
etc.

========================================
3. Representing Vector Length at Runtime
========================================

With a scalable vector type defined, we now need a way to represent the runtime
length in IR in order to generate addresses for consecutive vectors in memory
and determine how many elements have been processed in an iteration of a loop.

We have added an experimental `vscale` intrinsic to represent the runtime
multiple. Multiplying the result of this intrinsic by the minimum number of
elements in a vector gives the total number of elements in a scalable vector.

Fixed-Length Code
-----------------

Assuming a vector type of <4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %index.next = add i64 %index, 4
  ;; <check and branch>
``
Scalable Equivalent
-------------------

Assuming a vector type of <scalable 4 x <ty>>
``
vector.body:
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  ;; <loop body>
  ;; Increment induction var
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()

I didn't see anything about this in the text (apologies if I missed
something), but it appears this intrinsic is overloaded to be able to return
any integer width. It's not a big deal either way, but is there a particular
reason for doing that, rather than picking one sufficiently large integer type
and combining it with trunc/zext as appropriate?

It's a leftover from me converting from the constant, which could be of any
integer type.

  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)

Just to check, is the nesting `add i64 %index, mul (i64 %vscale64, i64 4)` a
pseudo-IR shorthand or an artifact from when vscale was proposed as a constant
expression or something? I would have expected:

```
%vscale64 = call i64 @llvm.experimental.vector.vscale.64()
%vscale64.x4 = mul i64 %vscale64, 4
%index.next = add i64 %index, %vscale64.x4
```

The latter, though in this case I updated the example IR by hand so didn't catch
that case ;)

The IR in the example patch was generated by the compiler, and does split out the
multiply into a separate instruction (which is then strength-reduced to a shift).

  ;; <check and branch>
``
===========================
4. Generating Vector Values
===========================
For constant vector values, we cannot specify all the elements as we can for
fixed-length vectors; fortunately only a small number of easily synthesized
patterns are required for autovectorization. The `zeroinitializer` constant
can be used in the same manner as fixed-length vectors for a constant zero
splat. This can then be combined with `insertelement` and `shufflevector`
to create arbitrary value splats in the same manner as fixed-length vectors.

For constants consisting of a sequence of values, an experimental `stepvector`
intrinsic has been added to represent a simple constant of the form
`<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
start can be added, and changing the step requires multiplying by a splat.

+1 for making this intrinsic minimal and having canonical IR instruction
sequences for things like stride and starting offset.

Fixed-Length Code
-----------------
``
  ;; Splat a value
  %insert = insertelement <4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
  ;; Add a constant sequence
  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
``
Scalable Equivalent
-------------------
``
  ;; Splat a value
  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Splat offset + stride (the same in this case)
  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  ;; Create sequence for scalable vector
  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
  ;; Add the runtime-generated sequence
  %add = add <scalable 4 x i32> %splat, %addoffset
``
Future Work
-----------

Intrinsics cannot currently be used for constant folding. Our downstream
compiler (using Constants instead of intrinsics) relies quite heavily on this
for good code generation, so we will need to find new ways to recognize and
fold these values.

==================
5. Code Generation
==================

IR splats will be converted to an experimental splatvector intrinsic in
SelectionDAGBuilder.

All three intrinsics are custom lowered and legalized in the AArch64 backend.

Two new AArch64ISD nodes have been added to represent the same concepts
at the SelectionDAG level, while splatvector maps onto the existing
AArch64ISD::DUP.

GlobalISel
----------

Since GlobalISel was enabled by default on AArch64, it was necessary to add
scalable vector support to the LowLevelType implementation. A single bit was
added to the raw_data representation for vectors and vectors of pointers.

In addition, types that only exist in destination patterns are planted in
the enumeration of available types for generated code. While this may not be
necessary in future, generating an all-true 'ptrue' value was necessary to
convert a predicated instruction into an unpredicated one.

==========
6. Example
==========

The following example shows a simple C loop which assigns the array index to
the array elements matching that index. The IR shows how vscale and stepvector
are used to create the needed values and to advance the index variable in the
loop.

C Code
------

``
void IdentityArrayInit(int *a, int count) {
  for (int i = 0; i < count; ++i)
    a[i] = i;
}
``

Scalable IR Vector Body
-----------------------

``
vector.body.preheader:
  ;; Other setup
  ;; Stepvector used to create initial identity vector
  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
  br vector.body

vector.body
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]

           ;; stepvector used for index identity on entry to loop body ;;
  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
                                     [ %stepvector, %vector.body.preheader ]
  %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
  %vscale32 = trunc i64 %vscale64 to i32
  %1 = add i64 %0, mul (i64 %vscale64, i64 4)

           ;; vscale splat used to increment identity vector ;;
  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
  %2 = getelementptr inbounds i32, i32* %a, i64 %0
  %3 = bitcast i32* %2 to <scalable 4 x i32>*
  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4

           ;; vscale used to increment loop index
  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
  %4 = icmp eq i64 %index.next, %n.vec
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
``

==========
7. Patches
==========

List of patches:

1. Extend VectorType: https://reviews.llvm.org/D32530
2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
5. SVE Calling Convention: https://reviews.llvm.org/D47771
6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
7. Add VScale intrinsic: https://reviews.llvm.org/D47773
8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
10. Initial store patterns: https://reviews.llvm.org/D47776
11. Initial addition patterns: https://reviews.llvm.org/D47777
12. Initial left-shift patterns: https://reviews.llvm.org/D47778
13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
14. Prevectorized loop unit test: https://reviews.llvm.org/D47780

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180611/d9011d45/attachment.html>