[cfe-dev] RFC: C and C++ extension to support variable-length register-sized vector types

Tue May 1 05:16:34 PDT 2018

TL;DR: This is an RFC about adding variable-length register-sized vector
types to C and C++.  Near the end of the message there are some Phabricator
links to the clang implementation (which we're only posting to back up
the RFC; it's not intended for commit).

Summary
=======

This is an RFC about some C and C++ language changes related to Arm's
Scalable Vector Extension (SVE).  A detailed description of SVE is
available here:

    https://static.docs.arm.com/ddi0584/a/DDI0584A_a_SVE_supp_armv8A.pdf

but the only feature that really matters for this RFC is that SVE has
no fixed or preferred vector length.  Implementations can instead choose
from a range of possible vector lengths, with 128 bits being the minimum
and 2048 bits being the maximum.  The actual length is variable and only
known at runtime.

However, even though the length is variable, the concept of a
"register-sized" C and C++ vector type makes just as much sense for SVE
as it does for other vector architectures.  Vector library functions
take such register-sized vectors as input and return them as results.
Intrinsic functions are also just as useful as they are for other vector
architectures, and they too take register-sized vectors as input and
return them as results.

The RFC is about a way of adding these variable-length register-sized
vector types to C and C++.  Specifically, the idea is to treat them
as a new form of incomplete type, with rules that are more relaxed
than for normal incomplete types.

The main question is: does the approach below seem reasonable?

SVE ACLE
========

For reference, Arm has published an SVE "ACLE" that specifies the SVE
types and intrinsic functions in detail:

    https://static.docs.arm.com/100987/0000/acle_sve_100987_0000_00_en.pdf

but I'll try to keep the RFC self-contained.

Scope
=====

The RFC is purely about low-level register-sized vector types that can
be used with intrinsic functions to write hand-optimised vector library
routines (such as libmvec) or to optimise an algorithm for SVE.
The RFC isn't about defining a new high-level and portable vector
programming extension such as P0241.  (That kind of extension is
useful too of course.)

The RFC only discusses the C semantics in detail.  The ACLE has a
similar set of changes for C++, but the fundamental approach is very
similar, so it seemed better to concentrate on C to start with.

Contents
========

1. The types in more detail
2. Requirements
3. Possible approaches
4. Outline of the type system changes
5. Rationale for choosing this approach
6. Edits to the C standard
7. User-defined sizeless types
8. clang implementation
9. Examples

1. The types in more detail
===========================

The ACLE defines a vector type sv<base>_t for each supported element type
<base>_t, so that the complete set is:

    svint8_t      svint16_t     svint32_t     svint64_t
    svuint8_t     svuint16_t    svuint32_t    svuint64_t
                  svfloat16_t   svfloat32_t   svfloat64_t

The types in each column have the same number of lanes and have twice
as many lanes as those in the column to the right.  Every vector has
the same number of bytes in total.

The ACLE also defines a single predicate type:

    svbool_t

that has the same number of lanes as svint8_t and svuint8_t.

All these types are opaque builtin types and are only intended to be
used with the associated ACLE intrinsics.  There are intrinsics for
creating vectors from scalars, loading from scalars, storing to scalars,
reinterpreting one type as another, etc.

The idea is that the vector types would only be used for short-term
register-sized working data.  Longer-term data would typically be stored
out to arrays.

For example, the vector function underlying:

    #pragma omp declare simd
    double sin(double);

would be:

    svfloat64_t mangled_sin(svfloat64_t, svbool_t);

(The svbool_t is because SVE functions should be predicated by default,
to avoid the need for a scalar tail.)

2. Requirements
===============

One of the main questions that we needed to answer for the ACLE was:
how do we add the variable-length types above to the type system?
The key requirements were:

  * The approach must work in both C and C++.

  * It must be possible to define automatic variables with these types.

  * It must be possible to pass and return objects of these types
    (since that's what intrinsics and vector library routines need to do).

  * It must be possible to use the types in _Generic associations
    (since the ACLE uses _Generic to provide tgmath.h-style overloads).

  * It must be possible to create pointers or references to the types
    (for passing or returning by pointer or reference, and because not
    allowing references would be semantically difficult in C++).

3. Possible approaches
======================

It seems that any approach to defining the ACLE types would fall into
one of three categories:

  (1) Limit the types in such a way that there is no concept of size.

  (2) Define the size of the types to be variable.

  (3) Define the size of the types to be constant, either with the
      constant being large enough for all possible vector lengths or
      with the types pointing to separate memory (as for C++ classes
      like std::string).

The approach we chose comes under (1).  The next sections describe this
approach informally in more detail, explain the rationale for choosing it,
and then give a more formal definition, as an edit to the standard.

4. Outline of the type system changes
=====================================

C classifies types as "complete" (the size of objects can be calculated)
or "incomplete" (the size of objects can't be calculated).  There's very
little you can do with a type until it becomes complete.

The approach we took was to treat all the SVE types as permanently
incomplete.  On its own, this would put them in a similar situation to
"void" (although they wouldn't be exactly the same, since there are some
specific rules for "void" that don't apply to incomplete types in general).
We then relaxed specific rules until the types were actually useful.

To do this, we classified types as:

  * "indefinite" (lacking sufficient information to create an object of
    that type) or "definite" (having sufficient information)

  * "sized" (will have a known size when definite) or "sizeless" (will
    never have a known size)

  * "incomplete" (lacking sufficient information to determine the size of
    objects of that type) or "complete" (having sufficient information)

where the wording for the final bullet is taken verbatim from the
C standard.  "Complete" is now equivalent to "sized and definite".
All standard types are "sized" (even "void", although it's always
indefinite).

We then needed to make some rules use the distinction between "indefinite"
and "definite" rather than "incomplete" and "complete".  Referring back
to the requirements above, the specific things we wanted to allow were:

  * automatic variables with sizeless type
  * function parameters and return values with sizeless type
  * use of sizeless types with _Generic
  * pointers to sizeless types

Specific things we wanted to remain invalid -- by inheriting the rules from
incomplete types -- were:

  * creating or accessing arrays that have sizeless types
  * doing pointer arithmetic on pointers to sizeless types
  * using sizeof and _Alignof with a sizeless type (or object of sizeless type)
  * (sized) unions or structures with sizeless members
  * applying _Atomic to a sizeless type

It also seemed worth adding an extra restriction:

  * variables with sizeless type must not have static or thread-local
    storage duration

In practice it's impossible to define such variables with incomplete type,
but having an explicit rule means that things like:

    extern svint8_t foo;

are outright invalid rather than simply useless (because no other
translation unit could ever define foo).  Similarly, without an
explicit rule:

    svint8_t foo;

would be a valid tentative definition at the point it occurs and only
become invalid at the end of the translation unit, because svint8_t is
never completed.

This restriction isn't critical but it should allow better diagnostics.

5. Rationale for choosing this approach
=======================================

To recap the classification above, any approach would fall into
one of three categories:

  (1) Limit the types in such a way that there is no concept of size.

  (2) Define the size of the types to be variable.

  (3) Define the size of the types to be constant, either with the
      constant being large enough for all possible vector lengths or
      with the types pointing to separate memory (as for C++ classes
      like std::string).

(2) seemed initially appealing since C already has the concept of
variable-length arrays.  However, variable-length built-in types
would work in a significantly different way.  Arrays often decay to
pointers (which of course are fixed-length types), whereas vector
types never would.  Unlike arrays, it should be possible to pass
variable-length vectors to functions, return them from functions,
and assign them by value.

One particular difficulty is that the semantics of variable-length arrays
rely on having a point at which the array size is evaluated.  It would
be difficult to extend this approach to declarations of functions that
pass or return variable-length types.

As well as the extension itself being relatively complex (especially
for C++), it might be difficult to define it in a way that interacts
naturally with other (unseen) extensions, even those that are aware of
variable-length arrays.  Also, AIUI, variable-length arrays were added
to an early draft of C++14, but were later removed as too controversial
and didn't make it into the final standard.  C++17 still requires sizeof
to be constant and C11 makes variable-length arrays optional.

(2) therefore felt like a complicated dead-end.

(3) can be divided into two:

(3a) The vector types have a constant size and are large enough for all
     possible vector lengths.

The main problem with this is that the maximum size of 2048 bits is much
larger than the minimum of 128 bits.  Using a fixed size of 2048 bits
would be extremely inefficient for smaller vector lengths, and of course
the whole point of the ACLE is to make things *more* efficient.

Also, we would need to define the types such that only the bytes
associated with the actual vector length are significant.  This would
make it possible to pass or return the types in registers and treat
them as register values when copying.  This perhaps has some similarity
with overaligned structures such as:

    struct s { _Alignas(16) int i; };

except that the amount of padding is only known at runtime.

There's also a significant conceptual problem: encoding a fixed size
goes against the guiding principle of SVE, in which there is no preferred
vector length.  There's nothing particularly magical about the current
limit of 2048 bits and it would be better to avoid an ABI break if the
maximum ever did increase in future.

(3b) The vector types have a constant size and refer to separate storage
     (as for std::string etc.)

This would be difficult to do without C++-style constructor, destructor,
copy and move semantics, so wouldn't work well in C.  And in C++ it would
be less efficient than the proposed approach, since presumably an Allocator
would be needed to allocate the separate storage.

A more positive justification of the ACLE approach is that it seems
to meet the requirements in the most efficient way possible.  The
vectors can use their natural (native) representation, and the type
system prevents uses that would make that representation problematic.

Also, the approach of starting with very restricted types and then
specifically allowing certain things should be more future-proof
and interact better with other (unseen) language extensions.  By default,
any language extension would treat the new types like other incomplete
types and choose conservatively-correct behaviour.  It would then be
possible to relax the language extension if this default behaviour
turns out to be too restrictive.

(That said, treating the types as permanently incomplete still won't
avoid all clashes with other extensions.  For example, we need to
allow objects of automatic storage duration to have certain forms of
incomplete type, whereas an extension might implicitly assume that all
such objects must already have complete type.  The approach should still
avoid the worst effects though.)

6. Edits to the C standard
==========================

This section specifies the behaviour for sizeless types as an edit to N1570.

6.2.5 Types
-----------

In 6.2.5/1, replace:

    At various points within a translation unit an object type may be
    /incomplete/ …

onwards with:

    Object types are further partitioned into /sized/ and /sizeless/; all
    basic and derived types defined in this standard are sized, but an
    implementation may provide additional sizeless types.

and add two additional clauses:

  * At various points within a translation unit an object type may be
    /indefinite/ (lacking sufficient information to construct an object
    of that type) or /definite/ (having sufficient information).
    An object type is said to be /complete/ if it is both sized and
    definite; all other object types are said to be /incomplete/.
    Complete types have sufficient information to determine the size
    of an object of that type while incomplete types do not.

  * Arrays, structures, unions and enumerated types are always sized,
    so for them the term /incomplete/ is equivalent to (and used
    interchangeably with) the term /indefinite/.

Change 6.2.5/19 to:

    The void type comprises an empty set of values; it is a sized
    indefinite object type that cannot be completed (made definite).

Replace "incomplete" with "indefinite" and "complete" with "definite" in
6.2.5/37, which describes how a type's state can change throughout a
translation unit.

6.3.2.1 Lvalues, arrays, and function designators
-------------------------------------------------

Replace "incomplete" with "indefinite" in 6.3.2.1/1, so that sizeless
definite types are modifiable lvalues.

Make the same replacement in 6.3.2.1/2, to prevent undefined behaviour
when lvalues have sizeless definite type.

6.5.1.1 Generic selection
-------------------------

Replace "complete object type" with "definite object type" in 6.5.1.1/2,
so that the type name in a generic association can be a sizeless definite
type.

6.5.2.2 Function calls
----------------------

Replace "complete object type" with "definite object type" in 6.5.2.2/1,
so that functions can return sizeless definite types.

Make the same change in 6.5.2.2/4, so that arguments can also have
sizeless definite type.

6.5.2.5 Compound literals
-------------------------

Replace "complete object type" with "definite object type" in 6.5.2.5/1,
so that compound literals can have sizeless definite type.

6.7 Declarations
----------------

Insert the following new clause after 6.7/4:

  * If an identifier for an object does not have automatic storage
    duration, its type must be sized rather than sizeless.

Replace "complete" with "definite" in 6.7/7, which describes when the
type of an object becomes definite.

6.7.6.3 Function declarators (including prototypes)
---------------------------------------------------

Replace "incomplete type" with "indefinite type" in 6.7.6.3/4, so that
parameters can also have sizeless definite type.

Make the same change in 6.7.6.3/12, which allows even indefinite types
to be function parameters if no function definition is present.

6.7.9 Initialization
--------------------

Replace "complete object type" with "definite object type" in 6.7.9/3,
to allow initialization of identifiers with sizeless definite type.

6.9.1 Function definitions
--------------------------

Replace "complete object type" with "definite object type" in 6.9.1/3,
so that functions can return sizeless definite types.

Make the same change in 6.9.1/7, so that adjusted parameter types can be
sizeless definite types.

J.2 Undefined behavior
----------------------

Update the entries that refer to the clauses above.

7. User-defined sizeless types
==============================

We have a follow-on proposal for allowing sizeless aggregates to be
defined directly in C, but this would of course only be useful if the
basic concept of sizeless types seems reasonable.  Since the message is
quite long already, I thought it would be better to leave them out for now.

8. clang implementation
=======================

I've uploaded a clang implementation of the RFC to Phabricator.
There are three parts:

https://reviews.llvm.org/D46307

  A hack to add two new sizeless builtin types SVInt8_t and SVInt16_t.
  This is purely so that the other two patches have something to work with.

https://reviews.llvm.org/D46308

  The clang support itself, including testcases.

https://reviews.llvm.org/D46309

  A patch that adds comments to other uses of IsCompleteType,
  RequireCompleteType, etc., explaining why they didn't change.
  No functional change.

These are purely to back up the RFC.  We're not asking for the
extension to be accepted into clang at this stage.

9. Examples
===========

By way of example, here's a naive implementation of unit-stride daxpy
using the ACLE:

    void daxpy_1_1(int64_t n, double da, double *dx, double *dy)
    {
      int64_t i = 0;
      svbool_t pg = svwhilelt_b64(i, n);
      do
        {
          svfloat64_t dx_vec = svld1(pg, &dx[i]);
          svfloat64_t dy_vec = svld1(pg, &dy[i]);
          svst1(pg, &dy[i], svmla_x(pg, dy_vec, dx_vec, da));
          i += svcntd();
          pg = svwhilelt_b64(i, n);
        }
      while (svptest_any(svptrue_b64(), pg));
    }

Of course, this isn't an interesting case in itself, since (a) the
compiler should be able to do the same thing given the scalar code and
(b) this could easily be written in a higher-level vector form.
Real daxpy optimisations would be more complicated.

A less obvious example is the following, which replaces all non-printable
ASCII characters in a nul-terminated string with '.':

    void f(uint8_t *a)
    {
      svbool_t trueb = svptrue_b8();
      svuint8_t dots = svdup_u8('.');
      svbool_t terminators;
      do
        {
          svwrffr(trueb);
          svuint8_t data = svldff1(trueb, a);
          svbool_t ld_mask = svrdffr();
          svbool_t nonascii = svcmplt(ld_mask, data, ' '-1);
          terminators = svcmpeq(ld_mask, data, 0);
          svbool_t st_mask = svbrkb_z(nonascii, terminators);
          svst1(st_mask, a, dots);
          a += svcntp_b8(trueb, ld_mask);
        }
      while (!svptest_any(trueb, terminators));
    }

This implementation is still naive but is more SVE-specific.