[llvm-dev] [RFC][SVE] Supporting Scalable Vector Architectures in LLVM IR (take 2)

Thu Jun 1 07:22:24 PDT 2017

Hi,

Here's the updated RFC for representing scalable vector types and associated constants in IR. I added a section to address questions that came up on the recent patch review.

-Graham

===================================================
Supporting Scalable Vector Architectures in LLVM IR
===================================================

==========
Background
==========

*ARMv8-A Scalable Vector Extensions* (SVE) is a vector ISA extension for
AArch64, featuring lane predication, scatter-gather, and speculative loads. It
is intended to scale with hardware such that the same binary running on a
processor with longer vector registers can take advantage of the increased
compute power without recompilation.

As the vector length is no longer a statically known value, the way in which the
LLVM vectorizer generates code required modifications such that certain values
are now runtime evaluated expressions. This document describes changes proposed
to LLVM that allow its vectorizer to better target SVE.

Documentation for SVE can be found at
https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a

=====
Types
=====

To represent a vector of unknown length a boolean `Scalable` property has been
added to the `VectorType` class. Most code that deals with vectors doesn't need
to know the exact length, but does need to know relative lengths -- e.g. get
a vector with the same number of elements but a different element type, or with
half or double the number of elements.

In order to allow code to transparently support scalable vectors, we introduce
an `ElementCount` class with two members:

- `unsigned Min`: the minimum number of elements.
- `bool Scalable`: is the element count an unknown multiple of `Min`?

For non-scalable vectors (``Scalable=false``) the scale is considered to be
equal to one and thus `Min` represents the exact number of elements in the
vector.

The intent for code working with vectors is to use convenience methods and avoid
directly dealing with the number of elements. If needed, calling
`getElementCount` on a vector type instead of `getVectorNumElements` can be used
to obtain the (potentially scalable) number of elements. Overloaded division and
multiplication operators allow an ElementCount instance to be used in much the
same manner as an integer for most cases.

This mixture of static and runtime quantities allow us to reason about the
relationship between different scalable vector types without knowing their
exact length.

IR Textual Form
---------------

The textual form for a scalable vector is:

``<[n x ]<m> x <type>>``

where `type` is the scalar type of each element, `m` is the minimum number of
elements, and the string literal `n x` indicates that the total number of
elements is an unknown multiple of `m`; `n` is just an arbitrary choice for
indicating that the vector is scalable, and could be substituted by another.
For fixed-length vectors, the `n x` is omitted, so there is no change in the
format for existing vectors.

Scalable vectors with the same `Min` value have the same number of elements, and
the same number of bytes if `Min * sizeof(type)` is the same:

``<n x 4 x i32>`` and ``<n x 4 x i8>`` have the same number of elements.

``<n x 4 x i32>`` and ``<n x 8 x i16>`` have the same number of bytes.

IR Bitcode Form
---------------

To serialize scalable vectors to bitcode, a new boolean field is added to the
type record. If the field is not present the type will default to a fixed-length
vector type, preserving backwards compatibility.

Alternatives Considered
-----------------------

We had two alternatives in mind -- a dedicated target specific type, and a
subclass inheriting from VectorType.

A dedicated target type would either need to extend all existing passes that
work with vectors to recognize the new type, or to duplicate all that code
in order to get reasonable code generation.

A subclass would take care of part of this, but would need still checks for
whether a VectorType was scalable in each place a new VectorType was required.

Although our current solution will need to change some of the code that creates
new VectorTypes, much of that code doesn't need to care about whether the types
are scalable or not -- they can use preexisting methods like
`getHalfElementsVectorType`. If the code is a little more complex,
`ElementCount` structs can be used instead of an `unsigned` value to represent
the number of elements.

=================
Runtime Constants
=================

With a scalable vector type defined, we now need a way to generate addresses for
consecutive vector values in memory and to be able to create basic constant
vector values.

For address generation, the `vscale` constant is added to represent the runtime
value of `n` in `<n x m x type>`. Multiplying `vscale` by `m` and the number of
bytes in `type` gives the total length of a scalable vector, and the backend
can pattern match to the various load and store instructions in SVE that
automatically scale with vector length.

For constant vector values, we cannot specify all the elements as we can for
fixed-length vectors; fortunately only a small number of easily synthesized
patterns are required for autovectorization. The `zeroinitializer` constant
can be used as with fixed-length vectors for a constant zero splat. This can
then be combined with `insertelement` and `shufflevector` to create arbitrary
value splats in the same manner as fixed-length vectors.

For constants consisting of a sequence of values, the `stepvector` constant is
added to represent a simple constant of the form `<0, 1, 2... num_elems-1>`. To
change the starting value a splat of the new start can be added, and changing
the step requires multiplying by a splat.

Alternatives Considered
-----------------------

We have chosen to represent these as constants to make pattern matching and
constant folding easy, particularly for the mask argument of the
`shufflevector` instruction.

Another possibility is to use intrinsics similar to the old instructions, but
currently these cannot be treated as constants. We wouldn't mind using
intrinsics here, but would want to discuss how best to support constant folding
and pattern matching without needing to add a lot more code.

=======
Example
=======

The following example shows a simple C loop which assigns the array index to
the array elements matching that index. The IR shows how vscale and stepvector
are used to create the needed values and to advance the index variable in the
loop.

C Code
------

``
void IdentityArrayInit(int *a, int count) {
  for (int i = 0; i < count; ++i)
    a[i] = i;
}
``

Scalable IR Vector Body
-----------------------

``
vector.body:                                      ; preds = %vector.body.preheader, %vector.body
  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]

           ;; stepvector used for index identity on entry to loop body ;;
  %vec.ind7 = phi <n x 4 x i32> [ %step.add8, %vector.body ],
                                [ stepvector, %vector.body.preheader ]
  %1 = add i64 %0, mul (i64 vscale, i64 4)

           ;; vscale splat used to increment identity vector ;;
  %step.add8 = add <n x 4 x i32> %vec.ind7, shufflevector (<n x 4 x i32> insertelement (<n x 4 x i32> undef,
                                                                                        i32 mul (i32 vscale, i32 4), i32 0),
                                                                                        <n x 4 x i32> undef,
                                                                                        <n x 4 x i32> zeroinitializer)
  %2 = getelementptr inbounds i32, i32* %a, i64 %0
  %3 = bitcast i32* %2 to <n x 4 x i32>*
  store <n x 4 x i32> %vec.ind7, <n x 4 x i32>* %3, align 4, !tbaa !1

           ;; vscale used to increment loop index
  %index.next = add i64 %index, mul (i64 vscale, i64 4)
  %4 = icmp eq i64 %index.next, %n.vec
  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
``

=====================
Questions and Answers
=====================

Can the vector length change at runtime?
----------------------------------------

It is possible to change vector length at runtime, but there is no model defined
for how to resolve all the issues that arise when doing so. From the compiler's
point of view, it is expected that vector length can be treated as constant but
unknown.

Since there's currently no way to indicate to the compiler where a change might
occur, a programmer deliberately requesting changes to the vector length in the
middle of a function containing autovectorized code would be considered a bug.

If changing the VL at runtime becomes desireable at some point in the future,
the types and constants presented in this design would not need to change; we
would need some way of creating a barrier between sections of IR which might
have a different vscale, but that would be an addition to the current design
instead of a change.

How do we spill/fill scalable registers on the stack?
-----------------------------------------------------

SVE registers have a (partially) unknown size at build time and their associated
fill/spill instructions require an offset that is implicitly scaled by the
vector length instead of bytes or element size. To accommodate this we
created the concept of Stack Regions that are areas on the stack associated
with specific data types or register classes.

MachineFrameInfo has been extended with a list of Stack Regions that each
contain a list of associated objects. Each Stack Region controls its own
allocation, deallocation and internal offset calculations. For SVE we create
separate regions for data and predicate registers, so the exact layout does
not need to be known ahead of time, just relative offsets within regions.

Objects not belonging to a Stack Region use the default handling, so other
existing targets are not affected.

A more complete design for this will be detailed in a dedicated RFC later.