[llvm-dev] Complex proposal v3 + roundtable agenda

Mon Oct 21 22:34:25 PDT 2019

Ahead of the Wednesday’s roundtable at the developers’ conference, here is version three of
the proposal for first-class complex types in LLVM.  I was not able to add  Krzysztof Parzyszek’s
suggestion of a “cunzip” intrinsic returning two vectors as I could not find examples of intrinsics
that return two values at the IR level.  The Hexagon intrinsics declared to return two values do
not actually have both of their values used at the IR level as far as I can determine.  We can
discuss this more at the roundtable.

Following is a general outline for Wednesday’s roundtable.  Please have a look and make
any suggestions you’d like about topics we should cover.  Feel free to add to the list of
questions to discuss as well.

LLVM Complex Types Roundtable
-----------------------------

Introductions (name/affiliation if any/interest)

Reasons for a first-class type

  - Reasoning about algebraic optimization

  - Preserve semantics through vectorization and into target-specific lowering
    Different targets have support for different algorithms

  - Take advantage of faster & less precise algorithms with options/pragmas

  - Better diagnostics for users

  - Other motivations?

Open questions

  - A cunzip intrinsic would need to work at the IR level, returning two
    separate SSA values.  How can this be done?  The example given was a
    Hexagon-specific intrinsic that doesn't appear to make use of the two
    destinations at the IR level.

  - Are separate extratreal/extractimag intrinsics sufficient for targets that
    support such operations (e.g. NEON's VUZP)?

  - The proposal allows bitcasts of vector of complex, even though bitcasts of
    aggregates in general are disallowed.  Is this special case reasonable?

  - If we allow such bitcasts, is czip necessary, or is shufflevector + bitcast
    to vector of complex sufficient?

  - Some frontends will likely want to communicate specific algorithms for
    computing complex values (e.g. C Annex G).  What is the best way to do this?
    User compiler options?  Pragmas?  Function attributes?  Something else?

  - What TTI interfaces would be useful to inform the optimizer how best to
    lower complex operations?

  - When should lowering be done?

  - Other questions?

I am looking forward to our discussion!

                                     -David

Proposal to Support Complex Operations in LLVM
==============================================

Revision History
----------------

v1 - Initial proposal [1]

v2 - 2nd draft [2]
   - Added complex of all existing floating point types
   - Made complex a special aggregate
   - Specified literal syntax
   - Added special index values "real" and "imag" for insertvalue/extractvalue
         of complex3
   - Added czip intrinsic to create vector of complex from vectors of
         real/imaginary
   - Added extractreal and extractimag intrinsics for vectors of complex
   - Added masked vector intrinsics

v3 - This proposal
   - Added vector-of-complex types
   - Added scalable vector support
   - Added bitcasts of vector of complex

Abstract
--------

Several vendors and individuals have proposed first-class complex support in
LLVM.  Goals of this proposal include better optimization, diagnostics and
general user experience.

Introduction and Motivation
---------------------------

Recently the topic of complex numbers arose on llvm-dev with several developers
expressing a desire for first-class IR support for complex [3] [4].  Interest in
complex numbers in LLVM goes back much further [5].

Currently clang chooses to represent standard types like "double complex" and
"std::complex<float>" as structure types containing two scalar fields, for
example {double, double}.  Consequently, arrays of complex type are represented
as, for example, [8 x {double, double}].  This has consequences for how clang
converts complex operations to LLVM IR.  In general, clang emits loads of the
individual real and imaginary parts and feeds them into arithmetic operations.
Vectorization results in many shufflevector operations to massage the data into
sequences suitable for vector arithmetic.

All of the real/imaginary data manipulation obscures the underlying arithmetic.
It makes it difficult to reason about the algebraic properties of expressions.
For expressiveness and optimization ability, it will be nice to have a
higher-level representation for complex in LLVM IR.  In general, it is desirable
to defer lowering of complex until the optimizer has had a reasonable chance to
exploit its properties.

First-class support for complex can also improve the user experience.
Diagnostics could express concepts in the complex domain instead of referring to
expressions containing shuffles and other low-level data manipulation.  Users
that wish to examine IR directly will see much less gobbbledygook and can more
easily reason about the IR.

Types
-----

This proposal introduces new aggregate types to represent complex numbers.

c16      - Complex of 16-bit float
c32      - like float complex or std::complex<float>
c64      - like double complex or std::complex<double>
x86_c80  - Complex of x86_fp80
c128     - like long double complex or std::complex<long double>
ppc_c128 - Complex of ppc_fp128

Note that the references to C and C++ types above are simply explanatory.
Nothing in this proposal assumes any particular high-level language type will
map to the above LLVM types.

The "underlying type" of a complex type is the type of its real and imaginary
components.

The sizes of the complex types are twice that of their underlying types.  The
real part of the complex shall appear first in the layout of the types.  The
format of the real and imaginary parts is the same as for the complex type's
underlying type.  This should map to most common data representations of complex
in various languages.

These types are *not* considered floating point types for the purposes of
Type::isFloatTy and friends, llvm_anyfloat_ty, etc. in order to limit surprises
when introducing these types.  New APIs will allow querying and creation of
complex types:

bool Type::isComplexTy()          const;
bool Type::isComplex16Ty()        const;
bool Type::isComplex32Ty()        const;
bool Type::isComplex64Ty()        const;
bool Type::isComplexX86_FP80Ty()  const;
bool Type::isComplex128Ty()       const;
bool Type::isComplexPPC_FP128Ty() const;

The types are a special kind of aggregate, giving them access to the insertvalue
and extractvalue operations with special notation (see below).

We can define vectors of complex:

<8 x c16> - Vector of eight complex of 16-bit float (128 bits total)
<4 x c32> - Vector of four complex of 32-bit float (128 bits total)
<4 x c64> - Vector of four complex of 64-bit float (512 bits total)
...

Such vectors may be scalable:

<vscale x 8 x c16>
<vscale x 4 x c32>
<vscale x 1 x c64>
<vscale x 2 x c64>
...

Analogous ValueTypes will be used by intrinsics.

vdef c16       : ValueType<32,  uuu>
def c32       : ValueType<64,  vvv>
def c64       : ValueType<128, www>
def x86c80    : ValueType<160, xxx>
def c128      : ValueType<256, yyy>
def ppcc128   : ValueType<256, zzz>

def v8c16     : ValueType<128, aaa>
def v4c32     : ValueType<128, bbb>
def v4c64     : ValueType<512, ccc>
...

def nxv8c16   : ValueType<128, ddd>
def nxv4c32   : ValueType<128, eee>
def nxv1c64   : ValueType<128, fff>
def nxv2c64   : ValueType<256, ggg>
...

def llvm_anycomplex_ty : LLVMType<Any>;
def llvm_c16_ty        : LLVMType<c16>;
def llvm_c32_ty        : LLVMType<c32>;
def llvm_c64_ty        : LLVMType<c64>;
def llvm_x86c80_ty     : LLVMType<x86c80>;
def llvm_c128_ty       : LLVMType<c128>;
def llvm_ppcc128_ty    : LLVMType<ppcc128>;

def llvm_v8c16_ty      : LLVMType<v8c16>;
def llvm_v4c32_ty      : LLVMType<v4c32>;
def llvm_v4c64_ty      : LLVMType<v4c64>;
...

The numbering of the ValueTypes will be determined after discussion.  It may be
desirable to insert the scalar types before the existing vector types, grouping
them with the other scalar types or we may want to put them somewhere else.
Similarly, the vector types may be grouped with the other vector types or
somewhere else.

Literals
--------

Literal complex values have special spellings '(' <fp constant> '+'|'-'
<fpconstant>'i' ')':

%v1 = c64 ( 5.67 + 1.56i )
%v2 = c64 ( 55.87 - 4.23i )
%v3 = c64 ( 55.87 + -4.23i )
%v4 = c32 ( 8.24e+2 + 0.0i )
%v5 = c16 ( 0.0 + 0.0i )

Note that the literal representation requires an explicit specification of the
imaginary part, even if zero.  A "redundant" <+ negative imaginary> is allowed
to facilitate reuse of floating point constants.

Operations
----------

This proposal overloads existing floating point instructions for complex types
in order to leverage existing expression optimizations:

c64 %res   = fadd c64 %a, c64 %b
v8c64 %res = fsub v8c64 %a, v8c64 %b
c128 %res  = fmul c128 %a, c128 %b
v4c32 %res = fdiv v4c64 %a, v4c64 %b

The only valid comparisons of complex values shall be equality:

i1 %res = eq c32 %a, c32 %b
i8 %res = eq v8c32 %a, v8c32 %b
i1 %res = ne c64 %a, c64 %b
i8 %res = ne v8c64 %a, v8c64 %b

select is defined for complex:

c32 %res = select i1 %cmp, c32 %a, c32 %b
v4c64 %res = select i4 %cmp, v4c64 %a, v4c64 %b

Complex values may be casted to other complex types:

c32 %res = fptrunc c64 %a to c32
c64 %res = fpext c32 %a to c64

As a special case, vectors of complex may be bitcasted to vectors of their
underlying type:

v8f32 %res = bitcast <4 x c32> to <8 x float>

Complex types were defined as aggregates above, but special ones.  One aspect of
their specialness is allowing bitcasts of vector of complex to equal-width
vectors of their underlying type.

insertvalue and extractvalue may be used with the special index values "real"
and "imag":

%real = f32 extractvalue c32 %a, real
%real = c64 insertvalue c64 undef, f64 %r, real
%cplx = c64 insertvalue c64 %real, f64 %i, imag

The pseudo-value "real" shall evaluate to the integer constant zero and the
pseudo-valid "imag" shall evaluate to the integer constant one, as if
extractvalue/insertvalue were written with 0/1.  The use of any other index with
a complex value is undefined.

We also overload existing intrinsics:

declare c16      @llvm.sqrt.c16(c16 %val)
declare c32      @llvm.sqrt.c32(c32 %val)
declare c64      @llvm.sqrt.c64(c64 %val)
declare x86_c80  @llvm.sqrt.x86_c80(x86_c80 %val)
declare c128     @llvm.sqrt.c128(c128 %val)
declare ppc_c128 @llvm.sqrt.ppc_c128(ppc_c128 %val)

declare c16      @llvm.pow.c16(c16 %val, c16 %power)
declare c32      @llvm.pow.c32(c32 %val, c32 %power)
declare c64      @llvm.pow.c64(c64 %val, c64 %power
declare x86_c86  @llvm.pow.x86_c80(x86_c80 %val, x86_c80 %power
declare c128     @llvm.pow.c128(c128 %val, c128 %power)
declare ppc_c128 @llvm.pow.ppc_c128(ppc_c128 %val, ppc_c128 %power)

declare c16      @llvm.sin.c16(c16 %val)
declare c32      @llvm.sin.c32(c32 %val)
declare c64      @llvm.sin.c64(c64 %val)
declare x86_c80  @llvm.sin.x86_c80(x86_c80 %val)
declare c128     @llvm.sin.c128(c128 %val)
declare ppc_c128 @llvm.sin.ppc_c128(ppc_c128 %val)

declare c16      @llvm.cos.c16(c16 %val)
declare c32      @llvm.cos.c32(c32 %val)
declare c64      @llvm.cos.c64(c64 %val)
declare x86_c80  @llvm.cos.x86_c80(x86_c80 %val)
declare c128     @llvm.cos.c128(c128 %val)
declare ppc_c128 @llvm.cos.ppc_c128(ppc_c128 %val)

declare c16      @llvm.log.c16(c16 %val)
declare c32      @llvm.log.c32(c32 %val)
declare c64      @llvm.log.c64(c64 %val)
declare x86_c80  @llvm.log.x86_c80(x86_c80 %val)
declare c128     @llvm.log.c128(c128 %val)
declare ppc_c128 @llvm.log.ppc_c128(ppc_c128 %val)

declare half      @llvm.fabs.c16(c16 %val)
declare double    @llvm.fabs.c64(c64 %val)
declare x86_fp80  @llvm.fabs.x86_c80(x86_c80 %val)
declare fp128     @llvm.fabs.c128(c128 %val)
declare ppc_fp128 @llvm.fabs.ppc_c128(ppc_c128 %val)

Conversion to/from half-precision overloads the existing intrinsics.

llvm.convert.to.c16.* - Overloaded intrinsic to convert to c16.

declare c16 @llvm.convert.to.c16.c32(c32 %val)
declare c16 @llvm.convert.to.c16.c64(c64 %val)

llvm.convert.from.c16.* - Overloaded intrinsic to convert from c16.

declare c32 @llvm.convert.from.c16.c32(c16 %val)
declare c64 @llvm.convert.from.c16.c64(c16 %val)

In addition, new intrinsics will be used for complex-specific operations:

llvm.cconj.* - Overloaded intrinsic to compute the conjugate of a
               complex value

declare c16      @llvm.cconj.c16(c16 %val)
declare c32      @llvm.cconj.c32(c32 %val)
declare c64      @llvm.cconj.c64(c64 %val)
declare x86_c80  @llvm.cconj.x86_c80(x86_c80 %val)
declare c128     @llvm.cconj.c128(c128 %val)
declare ppc_c128 @llvm.cconj.ppc_c128(ppc_c128 %val)

llvm.czip.* - Overloaded intrinsic to create a vector of complex from two
              vectors of floating-point type (not all variants shown)

declare v4c32 @llvm.czip.v4c32(v4f32 %real, v4f32 %imag)
declare v4c64 @llvm.czip.v4c32(v4f64 %real, v4f64 %imag)

llvm.extractreal.* - Overloaded intrinsic to create a vector of floating-point
                     type from the real portions of a vector of complex (not all
                     variants shown)

declare v4f32 @llvm.extractreal.v4c32(v4c32 %val)
declare v4f64 @llvm.extractreal.v4c64(v4c64 %val)

llvm.extractimag.* - Overloaded intrinsic to create a vector of floating-point
                     type from the imaginary portions of a vector of complex
                     (not all variants shown)

declare v4f32 @llvm.extractimag.v4c32(v4c32 %val)
declare v4f64 @llvm.extractimag.v4c64(v4c64 %val)

Masked intrinsics are also overloaded.  The complex types are considered a
single logical entity and thus the mask bits correspond to the complex value as
a whole, not the individual real and imaginary parts:

llvm.masked.load.* - Overloaded intrinsic to load complex under mask
(not all variants shown)

declare v4c32 @llvm.masked.load.v4c32.p0v4c32(<4 x c32>* %ptr,
                                              i32 %alignment,
                                              <4 x i1> %mask,
                                              <4 x c32> %passthrough)

declare v8c32 @llvm.masked.load.v8c64.p0v8c64(<8 x c64>* %ptr,
                                              i32 %alignment,
                                              <8 x i1> %mask,
                                              <8 x c64> %passthrough)

llvm.masked.store.* - Overloaded intrinsic to store complex under mask (not all
                      variants shown)

declare void @llvm.masked.store.v4c32.p0v4c32(<4 x c32> %val,
                                              <4 x c32>* %ptr,
                                              i32 %alignment,
                                              <4 x i1> %mask)

declare void @llvm.masked.store.v8c64.p0v8c64(<8 x c64> %val,
                                              <8 x c64>* %ptr,
                                              i32 %alignment,
                                              <8 x i1> %mask)

llvm.masked.gather.* - Overloaded intrinsic to gather complex under mask (not
                       all variants shown)

declare v4c32 @llvm.masked.gather.v4c32.p0v4c32(<4 x c32 *> %ptrs,
                                                i32 %alignment,
                                                <4 x i1> %mask,
                                                <4 x c32> %passthrough)

declare v8c32 @llvm.masked.gather.v8c64.p0v8c64(<8 x c64*> %ptrs,
                                                i32 %alignment,
                                                <8 x i1> %mask,
                                                <8 x c64> %passthrough)

llvm.masked.scatter.* - Overloaded intrinsic to scatter complex under mask (not
                        all variants shown)

declare void @llvm.masked.scatter.v4c32.p0v4c32(<4 x c32> %val,
                                                <4 x c32*> %ptrs,
                                               i32 %alignment,
                                                <4 x i1> %mask)

declare void @llvm.masked.scatter.v8c64.p0v8c64(<8 x c64> %val,
                                                <8 x c64*> %ptrs,
                                                i32 %alignment,
                                                <8 x i1> %mask)

llvm.masked.expandload.* - Overloaded intrinsic to expandload complex under mask
                           (not all variants shown)

declare v4c32 @llvm.masked.expandload.v4c32.p0v4c32(c32* %ptr,
                                                    <4 x i1> %mask,
                                                    <4 x c32> %passthrough)

declare v8c32 @llvm.masked.expandload.v8c64.p0v8c64(c64* %ptr,
                                                    <8 x i1> %mask,
                                                    <8 x c64> %passthrough)

llvm.masked.compressstore.* - Overloaded intrinsic to compressstore complex
                              under mask (not all variants shown)

declare void @llvm.masked.compressstore.v4c32.p0v4c32(<4 x c32> %val,
                                                       c32* %ptr,
                                                       <4 x i1> %mask)

declare void @llvm.masked.compressstore.v8c64.p0v8c64(<8 x c64> %val,
                                                      c64* %ptr,
                                                      <8 x i1> %mask)

Conclusion
----------

This proposal introduces new complex types and overloads existing floating point
instructions and intrinsics for common complex operations and introduces new
intrinsics for complex-specific operations.

Goals of this work include better reasoning about complex operations within
LLVM, leading to better optimization, reporting and overall user experience.

This is a draft and subject to change.

[1] http://lists.llvm.org/pipermail/llvm-dev/2019-July/133558.html
[2] http://lists.llvm.org/pipermail/llvm-dev/2019-August/134815.html
[3] http://lists.llvm.org/pipermail/llvm-dev/2019-April/131516.html
[4] http://lists.llvm.org/pipermail/llvm-dev/2019-April/131523.html
[5] http://lists.llvm.org/pipermail/llvm-dev/2010-December/037072.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191022/6784ea79/attachment-0001.html>