[cfe-dev] RFC: C and C++ extension to support variable-length register-sized vector types

Tue May 15 09:02:35 PDT 2018

Hi Bruce,

Thanks for the reply.

Bruce Hoult via cfe-dev <cfe-dev at lists.llvm.org> writes:
> Richard Sandiford wrote:
>> TL;DR: This is an RFC about adding variable-length register-sized vector
>> types to C and C++.  Near the end of the message there are some Phabricator
>> links to the clang implementation (which we're only posting to back up
>> the RFC; it's not intended for commit).
>> 
>> 
>> Summary
>> =======
>> 
>> This is an RFC about some C and C++ language changes related to Arm's
>> Scalable Vector Extension (SVE).  A detailed description of SVE is
>> available here:
>> 
>>     https://static.docs.arm.com/ddi0584/a/DDI0584A_a_SVE_supp_armv8A.pdf
>
> It's been almost two weeks and no one else has replied to this, so I thought
> I'd at least make sure that everyone is aware that this work is relevant not
> only to ARM's SVE but also to the proposed RISC-V Vector Extension.
>
> I don't have detailed comments on the proposed changes to C/C++ or the
> implementation, but I do want to comment on some of the differences
> between SVE and RVV.
>
> Hopefully it is possible to come up with a single proposal which will be
> suitable for both ARM and RISC-V.
>
> The following reflects my personal opinion and understanding of RVV (which is
> not yet finalised) and is not an official position of SiFive or the RISC-V
> Foundation or its Vector working group.
>
>> but the only feature that really matters for this RFC is that SVE has
>> no fixed or preferred vector length.  Implementations can instead choose
>> from a range of possible vector lengths, with 128 bits being the minimum
>> and 2048 bits being the maximum.  The actual length is variable and only
>> known at runtime.
>
> My understanding is that in SVE the vector length in bits is fixed for any
> given CPU core. I don't know what happens in a big.LITTLE system.

SVE hardware has a maximum vector length, but it's possible for software
to choose a vector length that is less than that if necessary.

> In RVV the vector length is potentially different in every loop nest, and even
> from iteration to iteration of the same loop body. The minimum vector length is
> one element. I don't think there is a defined maximum.
>
> For example, the vector length will be shorter on the last iteration of a loop
> than on the others if the length of the high-level vector is not an exact
> multiple of the length of the vector registers. (RISC-V doesn't need any "last
> elements clean up" code after the main vector loop)

My understanding from the LLVM RFC about RVV was that there were two
vector lengths of interest, the "maximum vector length" (MVL) and the
active vector length.  Is that right?  Is it likely that the MVL would
change from one iteration to another, or would only the active vector
length change?

If only the active vector length changes during the loop then I would
imagine the sizeless type proposal might map well to the MVL-dependent
register types.  The active vector length would then be an on-the-side
global property that says how many bits of those register types are
currently significant.  This would be similar to the role played by the
predicate registers in SVE.

> It's also possible (depending on the micro-architecture) that if a vector load
> or store instruction crosses a page boundary and there is a page fault (or even
> TLB miss) or protection fault, that loop iteration will complete using a
> shorter vector length (possibly without taking the fault at all), and the next
> iteration of the loop will start from the beginning of the next memory page.

Same question here I suppose: is the active length the one that would
change?  (Assuming I've understood the distinction.)

> Note that while in SVE all vector registers in a given CPU core have the same
> number of bits (and thus different numbers of elements, depending on the
> element size), in RVV all vector registers in a given loop body iteration have
> the same number of elements (and thus different sizes in bits, depending on the
> element size).

OK.

> The prologue of a RISC-V vector processing loop contains an instruction that
> declares how many registers of each element size are needed. It is expected
> there will be implementations that have a single pool of vector register
> storage (e.g. SRAM) that is dynamically divided into registers for each loop.
> If, for example, you have 1024 bytes of vector register storage and a
> particular loop asks for 1 vector with byte elements and 3 registers with
> single-precision float elements then you might get a vector length for *that*
> loop of 78 elements, with the byte vector starting at offset 0 in the SRAM and
> the float vectors starting at offsets 80, 392, and 704.
>
>
>> However, even though the length is variable, the concept of a
>> "register-sized" C and C++ vector type makes just as much sense for SVE
>> as it does for other vector architectures.  Vector library functions
>> take such register-sized vectors as input and return them as results.
>> Intrinsic functions are also just as useful as they are for other vector
>> architectures, and they too take register-sized vectors as input and
>> return them as results.
>
> Intrinsic functions are absolutely required, and are I think the main reason
> for such a low-level register-sized vector type to exist.
>
> I'm not sure whether user-written functions operating on
> register-sized vectors are useful enough to support. User-written
> functions would normally take and return a higher-level vector type,
> and would implement the desired functionality in terms of calls to
> other user-written functions (operating on the high level vector as a
> whole) and/or explicit loops iterating through the high level vector
> type using intrinsic functions on the register-sized vector type
> proposed here.

The idea here was more to support people who wanted to write custom
implementations of "#pragma omp declare simd" routines in C rather
than asm.  These routines would normally take register-sized inputs
and outputs by value, so C would need to provide a way of doing the
same.  This is how SLEEF is written, for example.  I agree it isn't
likely to be important for higher-level functions.

>> All these types are opaque builtin types and are only intended to be
>> used with the associated ACLE intrinsics.  There are intrinsics for
>> creating vectors from scalars, loading from scalars, storing to scalars,
>> reinterpreting one type as another, etc.
>>
>> The idea is that the vector types would only be used for short-term
>> register-sized working data.  Longer-term data would typically be stored
>> out to arrays.
>
> I agree with this.
>
>> For example, the vector function underlying:
>>
>> #pragma omp declare simd
>> double sin(double);
>>
>> would be:
>>
>> svfloat64_t mangled_sin(svfloat64_t, svbool_t); > > (The svbool_t is because
>
> SVE functions should be predicated by default, > to avoid the need for a scalar
> tail.)
>
> Passing a predicate vector would work in RVV, but it's not necessary as any
> function that takes a low-level vector-register argument will automatically
> operate on the correct amount of it because it will share the same current
> value in the Vector Length register.
>
> Such a function might also *decrease* the value in the Vector Length register
> if, for example, it encounters a fault in a vector load or store within the
> function.
>
>> The approach we took was to treat all the SVE types as permanently
>> incomplete.
>
> This seems reasonable.
>
>> Specific things we wanted to remain invalid -- by inheriting the rules from
>> incomplete types -- were:
>>
>>   * creating or accessing arrays that have sizeless types
>>   * doing pointer arithmetic on pointers to sizeless types
>
> But when writing a strip-mining loop you need to be able to increment the
> pointer to the last vector-register worth of data to point to the address of
> the next vector-register worth of data to be processed.
>
> I think in this regard a sv<base>_t* should act like a base*. The compiler
> doesn't know the vector length, but it knows the element size. The code at
> runtime *has* to know the vector length.

The idea with the SVE intrinsics was that normal loads and stores would
operate on <base>_t* rather than sv<base>_t*, and any pointer increment
would be to bump the <base>_t* by the number of elements.  E.g. the SVE
intrinsic code to do a normal contiguous load followed by a pointer
increment would be:

  svbool_t pg;
  uint32_t *ptr;

  svint32_t vec = svld1(pg, ptr);
  ptr += svcntw();

Using and dereferencing svint32_t* wouldn't be correct when pg is only
partial, such as the last iteration of the loop, since there aren't
guaranteed to be a full vector's worth of elements at *ptr.  However,
it would be valid to do:

  svint32_t v1, v2;
  ...
  std::swap (v1, v2);

where std::swap operates on svint32_t&s.  The same could be done in C
using:

  void swap(svint32_t *a, svint32_t *b) {
    svint32_t tmp = *a;
    *a = *b;
    *b = *a;
  }
  ...
    svint32_t v1, v2;
    ...
    swap (&v1, &v2);

We included pointers and references to sizeless types for cases like these.

It sounds like the same arrangement could also work for RVV, if the
types did represent MVL-dependent register types and if svlenw() were
replaced by the RVV intrinsic to read the active vector length.

Thanks,
Richard