[llvm-dev] Complex proposal v3 + roundtable agenda

Thu Nov 12 09:03:22 PST 2020

Hi,

There’s growing interest among our users to make better use of dedicated hardware instructions for complex math and I would like to re-start the discussion on the topic. Given that this original thread was started a while ago apologies if I missed anything already discussed earlier on the list or the round-table. The original mail is quoted below.

In particular, I’m interested in the AArch64 side of things, like using FCMLA [1] for complex multiplications to start with.  

To get the discussion going, I’d like to share an alternative pitch. Instead of starting with adding complex types, we could start with adding a set of intrinsics that operate on complex values packed into vectors instead.

Starting with intrinsics would allow us to bring up the lowering of those intrinsics to target-specific nodes incrementally without having to make substantial changes across the codebase, as adding new types would require. Initially, we could try and match IR patterns that correspond to complex operations late in the pipeline. We can then work on incrementally moving the point where the intrinsics are introduced earlier in the pipeline, as we adopt more passes to deal with them. This way, we won’t have to teach all passes about complex types at once or risk loosing out all the existing combines on the corresponding floating point operations. 

I think if we introduce a small set of intrinsics for complex math (like @llvm.complex.multiply) we could use them to improve code-generation in key passes like the vectorizers and deliver large improvements to our users fairly quickly. There might be some scenarios which require a dedicated IR type, but I think we can get a long way with a set of specialized intrinsics at a much lower cost. If we later decide that dedicated IR types are needed, replacing the intrinsics should be easy and we will benefit of having already updated various passes to deal with the intrinsics.

We took a similar approach when adding matrix support to LLVM and I think that worked out very well in the end. The implementation upstream generates equivalent or better code than our earlier implementation using dedicated IR matrix types, while being simpler and impacting a much smaller area of the codebase.

An independent issue to discuss is how to generate complex math intrinsics.
As part of the initial bring-up, I’d propose matching the code Clang generates for operations on std::complex<> & co to introduce the complex math intrinsics. This won’t be perfect and will miss cases, but allows us to deliver initial improvements without requiring extensive updates to existing libraries or frontends. I don’t think either the intrinsic only or the complex type variants are inherently more convenient for frontends to emit.

To better illustrate what this approach could look like, I put up a set of rough patches that introduce a @llvm.complex.multiply intrinsic (https://reviews.llvm.org/D91347 <https://reviews.llvm.org/D91347>), replace a set of fadd/fsub/fmul instructions with @llvm.complex.multiply (https://reviews.llvm.org/D91353 <https://reviews.llvm.org/D91353>) and  lower the intrinsic for FCMLA on AArch64 (https://reviews.llvm.org/D91354 <https://reviews.llvm.org/D91354>). Note that those are just rough proof-of-concept patches.

Cheers,
Florian

[1] https://developer.arm.com/docs/ddi0596/h/simd-and-floating-point-instructions-alphabetic-order/fcmla-floating-point-complex-multiply-accumulate <https://developer.arm.com/docs/ddi0596/h/simd-and-floating-point-instructions-alphabetic-order/fcmla-floating-point-complex-multiply-accumulate>
> On Oct 22, 2019, at 06:34, David Greene via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
> Ahead of the Wednesday’s roundtable at the developers’ conference, here is version three of
> the proposal for first-class complex types in LLVM.  I was not able to add  Krzysztof Parzyszek’s
> suggestion of a “cunzip” intrinsic returning two vectors as I could not find examples of intrinsics
> that return two values at the IR level.  The Hexagon intrinsics declared to return two values do
> not actually have both of their values used at the IR level as far as I can determine.  We can
> discuss this more at the roundtable.
>  
> Following is a general outline for Wednesday’s roundtable.  Please have a look and make
> any suggestions you’d like about topics we should cover.  Feel free to add to the list of
> questions to discuss as well.
>  
> LLVM Complex Types Roundtable
> -----------------------------
>  
> Introductions (name/affiliation if any/interest)
>  
> Reasons for a first-class type
>  
>   - Reasoning about algebraic optimization
>  
>   - Preserve semantics through vectorization and into target-specific lowering
>     Different targets have support for different algorithms
>  
>   - Take advantage of faster & less precise algorithms with options/pragmas
>  
>   - Better diagnostics for users
>  
>   - Other motivations?
>  
> Open questions
>  
>   - A cunzip intrinsic would need to work at the IR level, returning two
>     separate SSA values.  How can this be done?  The example given was a
>     Hexagon-specific intrinsic that doesn't appear to make use of the two
>     destinations at the IR level.
>  
>   - Are separate extratreal/extractimag intrinsics sufficient for targets that
>     support such operations (e.g. NEON's VUZP)?
>  
>   - The proposal allows bitcasts of vector of complex, even though bitcasts of
>     aggregates in general are disallowed.  Is this special case reasonable?
>  
>   - If we allow such bitcasts, is czip necessary, or is shufflevector + bitcast
>     to vector of complex sufficient?
>  
>   - Some frontends will likely want to communicate specific algorithms for
>     computing complex values (e.g. C Annex G).  What is the best way to do this?
>     User compiler options?  Pragmas?  Function attributes?  Something else?
>  
>   - What TTI interfaces would be useful to inform the optimizer how best to
>     lower complex operations?
>  
>   - When should lowering be done?
>  
>   - Other questions?
>  
> I am looking forward to our discussion!
>  
>                                      -David
>  
> Proposal to Support Complex Operations in LLVM
> ==============================================
>  
> Revision History
> ----------------
>  
> v1 - Initial proposal [1]
>  
> v2 - 2nd draft [2]
>    - Added complex of all existing floating point types
>    - Made complex a special aggregate
>    - Specified literal syntax
>    - Added special index values "real" and "imag" for insertvalue/extractvalue
>          of complex3
>    - Added czip intrinsic to create vector of complex from vectors of
>          real/imaginary
>    - Added extractreal and extractimag intrinsics for vectors of complex
>    - Added masked vector intrinsics
>  
> v3 - This proposal
>    - Added vector-of-complex types
>    - Added scalable vector support
>    - Added bitcasts of vector of complex
>  
> Abstract
> --------
>  
> Several vendors and individuals have proposed first-class complex support in
> LLVM.  Goals of this proposal include better optimization, diagnostics and
> general user experience.
>  
> Introduction and Motivation
> ---------------------------
>  
> Recently the topic of complex numbers arose on llvm-dev with several developers
> expressing a desire for first-class IR support for complex [3] [4].  Interest in
> complex numbers in LLVM goes back much further [5].
>  
> Currently clang chooses to represent standard types like "double complex" and
> "std::complex<float>" as structure types containing two scalar fields, for
> example {double, double}.  Consequently, arrays of complex type are represented
> as, for example, [8 x {double, double}].  This has consequences for how clang
> converts complex operations to LLVM IR.  In general, clang emits loads of the
> individual real and imaginary parts and feeds them into arithmetic operations.
> Vectorization results in many shufflevector operations to massage the data into
> sequences suitable for vector arithmetic.
>  
> All of the real/imaginary data manipulation obscures the underlying arithmetic.
> It makes it difficult to reason about the algebraic properties of expressions.
> For expressiveness and optimization ability, it will be nice to have a
> higher-level representation for complex in LLVM IR.  In general, it is desirable
> to defer lowering of complex until the optimizer has had a reasonable chance to
> exploit its properties.
>  
> First-class support for complex can also improve the user experience.
> Diagnostics could express concepts in the complex domain instead of referring to
> expressions containing shuffles and other low-level data manipulation.  Users
> that wish to examine IR directly will see much less gobbbledygook and can more
> easily reason about the IR.
>  
> Types
> -----
>  
> This proposal introduces new aggregate types to represent complex numbers.
>  
> c16      - Complex of 16-bit float
> c32      - like float complex or std::complex<float>
> c64      - like double complex or std::complex<double>
> x86_c80  - Complex of x86_fp80
> c128     - like long double complex or std::complex<long double>
> ppc_c128 - Complex of ppc_fp128
>  
> Note that the references to C and C++ types above are simply explanatory.
> Nothing in this proposal assumes any particular high-level language type will
> map to the above LLVM types.
>  
> The "underlying type" of a complex type is the type of its real and imaginary
> components.
>  
> The sizes of the complex types are twice that of their underlying types.  The
> real part of the complex shall appear first in the layout of the types.  The
> format of the real and imaginary parts is the same as for the complex type's
> underlying type.  This should map to most common data representations of complex
> in various languages.
>  
> These types are *not* considered floating point types for the purposes of
> Type::isFloatTy and friends, llvm_anyfloat_ty, etc. in order to limit surprises
> when introducing these types.  New APIs will allow querying and creation of
> complex types:
>  
> bool Type::isComplexTy()          const;
> bool Type::isComplex16Ty()        const;
> bool Type::isComplex32Ty()        const;
> bool Type::isComplex64Ty()        const;
> bool Type::isComplexX86_FP80Ty()  const;
> bool Type::isComplex128Ty()       const;
> bool Type::isComplexPPC_FP128Ty() const;
>  
> The types are a special kind of aggregate, giving them access to the insertvalue
> and extractvalue operations with special notation (see below).
>  
> We can define vectors of complex:
>  
> <8 x c16> - Vector of eight complex of 16-bit float (128 bits total)
> <4 x c32> - Vector of four complex of 32-bit float (128 bits total)
> <4 x c64> - Vector of four complex of 64-bit float (512 bits total)
> ...
>  
> Such vectors may be scalable:
>  
> <vscale x 8 x c16>
> <vscale x 4 x c32>
> <vscale x 1 x c64>
> <vscale x 2 x c64>
> ...
>  
> Analogous ValueTypes will be used by intrinsics.
>  
> vdef c16       : ValueType<32,  uuu>
> def c32       : ValueType<64,  vvv>
> def c64       : ValueType<128, www>
> def x86c80    : ValueType<160, xxx>
> def c128      : ValueType<256, yyy>
> def ppcc128   : ValueType<256, zzz>
>  
> def v8c16     : ValueType<128, aaa>
> def v4c32     : ValueType<128, bbb>
> def v4c64     : ValueType<512, ccc>
> ...
>  
> def nxv8c16   : ValueType<128, ddd>
> def nxv4c32   : ValueType<128, eee>
> def nxv1c64   : ValueType<128, fff>
> def nxv2c64   : ValueType<256, ggg>
> ...
>  
> def llvm_anycomplex_ty : LLVMType<Any>;
> def llvm_c16_ty        : LLVMType<c16>;
> def llvm_c32_ty        : LLVMType<c32>;
> def llvm_c64_ty        : LLVMType<c64>;
> def llvm_x86c80_ty     : LLVMType<x86c80>;
> def llvm_c128_ty       : LLVMType<c128>;
> def llvm_ppcc128_ty    : LLVMType<ppcc128>;
>  
> def llvm_v8c16_ty      : LLVMType<v8c16>;
> def llvm_v4c32_ty      : LLVMType<v4c32>;
> def llvm_v4c64_ty      : LLVMType<v4c64>;
> ...
>  
> The numbering of the ValueTypes will be determined after discussion.  It may be
> desirable to insert the scalar types before the existing vector types, grouping
> them with the other scalar types or we may want to put them somewhere else.
> Similarly, the vector types may be grouped with the other vector types or
> somewhere else.
>  
> Literals
> --------
>  
> Literal complex values have special spellings '(' <fp constant> '+'|'-'
> <fpconstant>'i' ')':
>  
> %v1 = c64 ( 5.67 + 1.56i )
> %v2 = c64 ( 55.87 - 4.23i )
> %v3 = c64 ( 55.87 + -4.23i )
> %v4 = c32 ( 8.24e+2 + 0.0i )
> %v5 = c16 ( 0.0 + 0.0i )
>  
> Note that the literal representation requires an explicit specification of the
> imaginary part, even if zero.  A "redundant" <+ negative imaginary> is allowed
> to facilitate reuse of floating point constants.
>  
> Operations
> ----------
>  
> This proposal overloads existing floating point instructions for complex types
> in order to leverage existing expression optimizations:
>  
> c64 %res   = fadd c64 %a, c64 %b
> v8c64 %res = fsub v8c64 %a, v8c64 %b
> c128 %res  = fmul c128 %a, c128 %b
> v4c32 %res = fdiv v4c64 %a, v4c64 %b
>  
> The only valid comparisons of complex values shall be equality:
>  
> i1 %res = eq c32 %a, c32 %b
> i8 %res = eq v8c32 %a, v8c32 %b
> i1 %res = ne c64 %a, c64 %b
> i8 %res = ne v8c64 %a, v8c64 %b
>  
> select is defined for complex:
>  
> c32 %res = select i1 %cmp, c32 %a, c32 %b
> v4c64 %res = select i4 %cmp, v4c64 %a, v4c64 %b
>  
> Complex values may be casted to other complex types:
>  
> c32 %res = fptrunc c64 %a to c32
> c64 %res = fpext c32 %a to c64
>  
> As a special case, vectors of complex may be bitcasted to vectors of their
> underlying type:
>  
> v8f32 %res = bitcast <4 x c32> to <8 x float>
>  
> Complex types were defined as aggregates above, but special ones.  One aspect of
> their specialness is allowing bitcasts of vector of complex to equal-width
> vectors of their underlying type.
>  
> insertvalue and extractvalue may be used with the special index values "real"
> and "imag":
>  
> %real = f32 extractvalue c32 %a, real
> %real = c64 insertvalue c64 undef, f64 %r, real
> %cplx = c64 insertvalue c64 %real, f64 %i, imag
>  
> The pseudo-value "real" shall evaluate to the integer constant zero and the
> pseudo-valid "imag" shall evaluate to the integer constant one, as if
> extractvalue/insertvalue were written with 0/1.  The use of any other index with
> a complex value is undefined.
>  
> We also overload existing intrinsics:
>  
> declare c16      @llvm.sqrt.c16(c16 %val)
> declare c32      @llvm.sqrt.c32(c32 %val)
> declare c64      @llvm.sqrt.c64(c64 %val)
> declare x86_c80  @llvm.sqrt.x86_c80(x86_c80 %val)
> declare c128     @llvm.sqrt.c128(c128 %val)
> declare ppc_c128 @llvm.sqrt.ppc_c128(ppc_c128 %val)
>  
> declare c16      @llvm.pow.c16(c16 %val, c16 %power)
> declare c32      @llvm.pow.c32(c32 %val, c32 %power)
> declare c64      @llvm.pow.c64(c64 %val, c64 %power
> declare x86_c86  @llvm.pow.x86_c80(x86_c80 %val, x86_c80 %power
> declare c128     @llvm.pow.c128(c128 %val, c128 %power)
> declare ppc_c128 @llvm.pow.ppc_c128(ppc_c128 %val, ppc_c128 %power)
>  
> declare c16      @llvm.sin.c16(c16 %val)
> declare c32      @llvm.sin.c32(c32 %val)
> declare c64      @llvm.sin.c64(c64 %val)
> declare x86_c80  @llvm.sin.x86_c80(x86_c80 %val)
> declare c128     @llvm.sin.c128(c128 %val)
> declare ppc_c128 @llvm.sin.ppc_c128(ppc_c128 %val)
>  
> declare c16      @llvm.cos.c16(c16 %val)
> declare c32      @llvm.cos.c32(c32 %val)
> declare c64      @llvm.cos.c64(c64 %val)
> declare x86_c80  @llvm.cos.x86_c80(x86_c80 %val)
> declare c128     @llvm.cos.c128(c128 %val)
> declare ppc_c128 @llvm.cos.ppc_c128(ppc_c128 %val)
>  
> declare c16      @llvm.log.c16(c16 %val)
> declare c32      @llvm.log.c32(c32 %val)
> declare c64      @llvm.log.c64(c64 %val)
> declare x86_c80  @llvm.log.x86_c80(x86_c80 %val)
> declare c128     @llvm.log.c128(c128 %val)
> declare ppc_c128 @llvm.log.ppc_c128(ppc_c128 %val)
>  
> declare half      @llvm.fabs.c16(c16 %val)
> declare double    @llvm.fabs.c64(c64 %val)
> declare x86_fp80  @llvm.fabs.x86_c80(x86_c80 %val)
> declare fp128     @llvm.fabs.c128(c128 %val)
> declare ppc_fp128 @llvm.fabs.ppc_c128(ppc_c128 %val)
>  
> Conversion to/from half-precision overloads the existing intrinsics.
>  
> llvm.convert.to.c16.* - Overloaded intrinsic to convert to c16.
>  
> declare c16 @llvm.convert.to.c16.c32(c32 %val)
> declare c16 @llvm.convert.to.c16.c64(c64 %val)
>  
> llvm.convert.from.c16.* - Overloaded intrinsic to convert from c16.
>  
> declare c32 @llvm.convert.from.c16.c32(c16 %val)
> declare c64 @llvm.convert.from.c16.c64(c16 %val)
>  
> In addition, new intrinsics will be used for complex-specific operations:
>  
> llvm.cconj.* - Overloaded intrinsic to compute the conjugate of a
>                complex value
>  
> declare c16      @llvm.cconj.c16(c16 %val)
> declare c32      @llvm.cconj.c32(c32 %val)
> declare c64      @llvm.cconj.c64(c64 %val)
> declare x86_c80  @llvm.cconj.x86_c80(x86_c80 %val)
> declare c128     @llvm.cconj.c128(c128 %val)
> declare ppc_c128 @llvm.cconj.ppc_c128(ppc_c128 %val)
>  
> llvm.czip.* - Overloaded intrinsic to create a vector of complex from two
>               vectors of floating-point type (not all variants shown)
>  
> declare v4c32 @llvm.czip.v4c32(v4f32 %real, v4f32 %imag)
> declare v4c64 @llvm.czip.v4c32(v4f64 %real, v4f64 %imag)
>  
> llvm.extractreal.* - Overloaded intrinsic to create a vector of floating-point
>                      type from the real portions of a vector of complex (not all
>                      variants shown)
>  
> declare v4f32 @llvm.extractreal.v4c32(v4c32 %val)
> declare v4f64 @llvm.extractreal.v4c64(v4c64 %val)
>  
> llvm.extractimag.* - Overloaded intrinsic to create a vector of floating-point
>                      type from the imaginary portions of a vector of complex
>                      (not all variants shown)
>  
> declare v4f32 @llvm.extractimag.v4c32(v4c32 %val)
> declare v4f64 @llvm.extractimag.v4c64(v4c64 %val)
>  
> Masked intrinsics are also overloaded.  The complex types are considered a
> single logical entity and thus the mask bits correspond to the complex value as
> a whole, not the individual real and imaginary parts:
>  
> llvm.masked.load.* - Overloaded intrinsic to load complex under mask
> (not all variants shown)
>  
> declare v4c32 @llvm.masked.load.v4c32.p0v4c32(<4 x c32>* %ptr,
>                                               i32 %alignment,
>                                               <4 x i1> %mask,
>                                               <4 x c32> %passthrough)
>  
> declare v8c32 @llvm.masked.load.v8c64.p0v8c64(<8 x c64>* %ptr,
>                                               i32 %alignment,
>                                               <8 x i1> %mask,
>                                               <8 x c64> %passthrough)
>  
> llvm.masked.store.* - Overloaded intrinsic to store complex under mask (not all
>                       variants shown)
>  
> declare void @llvm.masked.store.v4c32.p0v4c32(<4 x c32> %val,
>                                               <4 x c32>* %ptr,
>                                               i32 %alignment,
>                                               <4 x i1> %mask)
>  
> declare void @llvm.masked.store.v8c64.p0v8c64(<8 x c64> %val,
>                                               <8 x c64>* %ptr,
>                                               i32 %alignment,
>                                               <8 x i1> %mask)
>  
> llvm.masked.gather.* - Overloaded intrinsic to gather complex under mask (not
>                        all variants shown)
>  
> declare v4c32 @llvm.masked.gather.v4c32.p0v4c32(<4 x c32 *> %ptrs,
>                                                 i32 %alignment,
>                                                 <4 x i1> %mask,
>                                                 <4 x c32> %passthrough)
>  
> declare v8c32 @llvm.masked.gather.v8c64.p0v8c64(<8 x c64*> %ptrs,
>                                                 i32 %alignment,
>                                                 <8 x i1> %mask,
>                                                 <8 x c64> %passthrough)
>  
> llvm.masked.scatter.* - Overloaded intrinsic to scatter complex under mask (not
>                         all variants shown)
>  
> declare void @llvm.masked.scatter.v4c32.p0v4c32(<4 x c32> %val,
>                                                 <4 x c32*> %ptrs,
>                                                i32 %alignment,
>                                                 <4 x i1> %mask)
>  
> declare void @llvm.masked.scatter.v8c64.p0v8c64(<8 x c64> %val,
>                                                 <8 x c64*> %ptrs,
>                                                 i32 %alignment,
>                                                 <8 x i1> %mask)
>  
> llvm.masked.expandload.* - Overloaded intrinsic to expandload complex under mask
>                            (not all variants shown)
>  
> declare v4c32 @llvm.masked.expandload.v4c32.p0v4c32(c32* %ptr,
>                                                     <4 x i1> %mask,
>                                                     <4 x c32> %passthrough)
>  
> declare v8c32 @llvm.masked.expandload.v8c64.p0v8c64(c64* %ptr,
>                                                     <8 x i1> %mask,
>                                                     <8 x c64> %passthrough)
>  
> llvm.masked.compressstore.* - Overloaded intrinsic to compressstore complex
>                               under mask (not all variants shown)
>  
> declare void @llvm.masked.compressstore.v4c32.p0v4c32(<4 x c32> %val,
>                                                        c32* %ptr,
>                                                        <4 x i1> %mask)
>  
> declare void @llvm.masked.compressstore.v8c64.p0v8c64(<8 x c64> %val,
>                                                       c64* %ptr,
>                                                       <8 x i1> %mask)
>  
> Conclusion
> ----------
>  
> This proposal introduces new complex types and overloads existing floating point
> instructions and intrinsics for common complex operations and introduces new
> intrinsics for complex-specific operations.
>  
> Goals of this work include better reasoning about complex operations within
> LLVM, leading to better optimization, reporting and overall user experience.
>  
> This is a draft and subject to change.
>  
> [1] http://lists.llvm.org/pipermail/llvm-dev/2019-July/133558.html
> [2] http://lists.llvm.org/pipermail/llvm-dev/2019-August/134815.html
> [3] http://lists.llvm.org/pipermail/llvm-dev/2019-April/131516.html
> [4] http://lists.llvm.org/pipermail/llvm-dev/2019-April/131523.html
> [5] http://lists.llvm.org/pipermail/llvm-dev/2010-December/037072.html
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201112/eff2ff44/attachment-0001.html>