[cfe-dev] RFC: C and C++ extension to support variable-length register-sized vector types

Sun May 13 21:21:12 PDT 2018

Richard Sandiford wrote:
> TL;DR: This is an RFC about adding variable-length register-sized vector
> types to C and C++.  Near the end of the message there are some
Phabricator
> links to the clang implementation (which we're only posting to back up
> the RFC; it's not intended for commit).
>
>
> Summary
> =======
>
> This is an RFC about some C and C++ language changes related to Arm's
> Scalable Vector Extension (SVE).  A detailed description of SVE is
> available here:
>
>     https://static.docs.arm.com/ddi0584/a/DDI0584A_a_SVE_supp_armv8A.pdf

It's been almost two weeks and no one else has replied to this, so I
thought I'd at least make sure that everyone is aware that this work is
relevant not only to ARM's SVE but also to the proposed RISC-V Vector
Extension.

I don't have detailed comments on the proposed changes to C/C++ or the
implementation, but I do want to comment on some of the differences between
SVE and RVV.

Hopefully it is possible to come up with a single proposal which will be
suitable for both ARM and RISC-V.

The following reflects my personal opinion and understanding of RVV (which
is not yet finalised) and is not an official position of SiFive or the
RISC-V Foundation or its Vector working group.

> but the only feature that really matters for this RFC is that SVE has
> no fixed or preferred vector length.  Implementations can instead choose
> from a range of possible vector lengths, with 128 bits being the minimum
> and 2048 bits being the maximum.  The actual length is variable and only
> known at runtime.

My understanding is that in SVE the vector length in bits is fixed for any
given CPU core. I don't know what happens in a big.LITTLE system.

In RVV the vector length is potentially different in every loop nest, and
even from iteration to iteration of the same loop body. The minimum vector
length is one element. I don't think there is a defined maximum.

For example, the vector length will be shorter on the last iteration of a
loop than on the others if the length of the high-level vector is not an
exact multiple of the length of the vector registers. (RISC-V doesn't need
any "last elements clean up" code after the main vector loop)

It's also possible (depending on the micro-architecture) that if a vector
load or store instruction crosses a page boundary and there is a page fault
(or even TLB miss) or protection fault, that loop iteration will complete
using a shorter vector length (possibly without taking the fault at all),
and the next iteration of the loop will start from the beginning of the
next memory page.

Note that while in SVE all vector registers in a given CPU core have the
same number of bits (and thus different numbers of elements, depending on
the element size), in RVV all vector registers in a given loop body
iteration have the same number of elements (and thus different sizes in
bits, depending on the element size).

The prologue of a RISC-V vector processing loop contains an instruction
that declares how many registers of each element size are needed. It is
expected there will be implementations that have a single pool of vector
register storage (e.g. SRAM) that is dynamically divided into registers for
each loop. If, for example, you have 1024 bytes of vector register storage
and a particular loop asks for 1 vector with byte elements and 3 registers
with single-precision float elements then you might get a vector length for
*that* loop of 78 elements, with the byte vector starting at offset 0 in
the SRAM and the float vectors starting at offsets 80, 392, and 704.

> However, even though the length is variable, the concept of a
> "register-sized" C and C++ vector type makes just as much sense for SVE
> as it does for other vector architectures.  Vector library functions
> take such register-sized vectors as input and return them as results.
> Intrinsic functions are also just as useful as they are for other vector
> architectures, and they too take register-sized vectors as input and
> return them as results.

Intrinsic functions are absolutely required, and are I think the main
reason for such a low-level register-sized vector type to exist.

I'm not sure whether user-written functions operating on register-sized
vectors are useful enough to support. User-written functions would normally
take and return a higher-level vector type, and would implement the desired
functionality in terms of calls to other user-written functions (operating
on the high level vector as a whole) and/or explicit loops iterating
through the high level vector type using intrinsic functions on the
register-sized vector type proposed here.

> All these types are opaque builtin types and are only intended to be
> used with the associated ACLE intrinsics.  There are intrinsics for
> creating vectors from scalars, loading from scalars, storing to scalars,
> reinterpreting one type as another, etc.
>
> The idea is that the vector types would only be used for short-term
> register-sized working data.  Longer-term data would typically be stored
> out to arrays.

I agree with this.

> For example, the vector function underlying:
> > #pragma omp declare simd > double sin(double); > > would be: > >
svfloat64_t mangled_sin(svfloat64_t, svbool_t); > > (The svbool_t is
because SVE functions should be predicated by default, > to avoid the need
for a scalar tail.)

Passing a predicate vector would work in RVV, but it's not necessary as any
function that takes a low-level vector-register argument will automatically
operate on the correct amount of it because it will share the same current
value in the Vector Length register.

Such a function might also *decrease* the value in the Vector Length
register if, for example, it encounters a fault in a vector load or store
within the function.

> The approach we took was to treat all the SVE types as permanently
> incomplete.

This seems reasonable.

> Specific things we wanted to remain invalid -- by inheriting the rules from
> incomplete types -- were:
>
>   * creating or accessing arrays that have sizeless types
>   * doing pointer arithmetic on pointers to sizeless types

But when writing a strip-mining loop you need to be able to increment the
pointer to the last vector-register worth of data to point to the address
of the next vector-register worth of data to be processed.

I think in this regard a sv<base>_t* should act like a base*. The compiler
doesn't know the vector length, but it knows the element size. The code at
runtime *has* to know the vector length.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180514/b2bd0493/attachment.html>