[cfe-dev] RFC: Add New Set of Vector Math Builtins

Mon Sep 27 11:50:17 PDT 2021

Hi,

I would like to provide a convenient way for our users to perform additional operations like min, max, abs and round on vector (and possibly matrix) types. To do so, I’d like to propose adding a new set of builtins that operate on vector-like types. The new builtins can be used to perform each operation element wise and as reduction on the elements of a vector or a matrix. The proposal also includes arithmetic reductions. Those are performed pairwise, rather then sequential to ensure they can lowered directly to vector instructions on targets like AArch64 and X86.

I considered overloading the existing math builtins, but think we can provide a better and more consistent user-experience with a new set of builtins. The drawbacks of overloading math builtins is discussed at the end.

Below is the proposed specification for element-wise and reductions builtins. They will be lowered to the corresponding LLVM intrinsic. Note that not all possible builtins are listed here. Once agreed on the general scheme, it should be easy to add additional builtins. 

Specification of the element-wise builtins with 1 operands

Provided builtins:
	• __builtin_elementwise_abs
	• __builtin_elementwise_ceil
	• __builtin_elementwise_floor
	• __builtin_elementwise_rint
	• __builtin_elementwise_round
	• __builtin_elementwise_trunc

---- 
T__builtin_elementwise_<name>(T x)

T must be one of the following types:
	• an integer type (as in C2x 6.2.5p19), but excluding enumerated types and _Bool
	• the standard floating types float or double
	• a half-precision floating point type, if one is supported on the target
	• a vector or matrix type.

For scalar types, consider the operation applied to a vector with a single element. For matrix types, consider them as a vector formed by concatenating its columns for the definitions below.

Returns: A vector Res equivalent to applying fn elementwise to the input, where fn depends on (name, element type of VT):

	• (abs, floating point type) → return the absolute value of a floating-point number x
	• (abs, integer ty) → (a < 0) ? a * -1 : a
	• (ceil, floating point type) → return the smallest integral value greater than or equal to x
	• (ceil, integer type) → invalid
	• (floor, floating point type) → return the largest integral value less than or equal to x
	• (floor, integer type) → invalid
	• (rint, floating point type) → return the integral value nearest to x (according to the prevailing rounding mode) in floating-point format
	• (rint, integer type) → invalid
	• (round, floating point type) → return the integral value nearest to x rounding half-way cases away from zero, regardless of the current rounding direction
	• (round, integer type) → invalid
	• (trunc, floating point type) → return the integral value nearest to but no larger in magnitude than x
	• (trunc, integer type) → invalid

Special values:
Unless specified otherwise fn(±0)= ±0 and fn(±infinity) = ±infinity

T Res
for (int I = 0; I < NumElements; ++I)
  Res[I] = fn(a[I])
----

Specification of the element-wise builtins with 2 operands:

Provided builtins:
	• __builtin_elementwise_max
	• __builtin_elementwise_min

---- 
T __builtin_elementwise_<name>(T x, T y)

T must be one of the following types:
	• an integer type (as in C2x 6.2.5p19), but excluding enumerated types and _Bool
	• the standard floating types float or double
	• a half-precision floating point type, if one is supported on the target
	• a vector or matrix type.

For scalar types, consider the operation applied to a vector with a single element. For matrix types, consider them as a vector formed by concatenating its columns for the definitions below.

Returns: A vector Res equivalent to applying fn(x, y) elementwise to the inputs, where fn depends on the name:

	• min → return x or y, whichever is smaller
	• max → return x or y, whichever is larger

Special values:
Unless otherwise specified, the following holds. If exactly one argument is a NaN, return the other argument. If both arguments are NaNs, fmax() return a NaN.

VT Res
for (int I = 0; I < NumElements; ++I)
  Res[I] = fn(a[I], b[I])
----

Specification of reduction builtins:

Provided builtins:
	• __builtin_reduce_min
	• __builtin_reduce_max
	• __builtin_reduce_add
	• __builtin_reduce_and
	• __builtin_reduce_or
	• __builtin_reduce_xor

---- 
ET__builtin_reduce_<name>(VT a)

VT must be a vector or matrix type with element type ET. For matrix types, consider them as a vector formed by concatenating its columns for the definitions below.

Returns: A scalar Res equivalent to applying fn as pairwise tree reduction to the input, where fn(x, y) depends on the name :

	• min → return x or y, whichever is smaller; If exactly one argument is a NaN, return the other argument. If both arguments are NaNs, fmax() return a NaN.
	• max → return x or y, whichever is larger; If exactly one argument is a NaN, return the other argument. If both arguments are NaNs, fmax() return a NaN.
	• add → +
	• and → & (integer (element) types only)
	• or → | (integer (element) types only)
	• xor → ^ (integer (element) types only)
----

The intended semantics should allow for the following lowering using LLVM intrinsics (using __builtin_{elementwise,reduce}_min as example):

declare float @llvm.vector.reduce.fmin.v4f32(<4 x float>)
define float @float_min_red(<4 x float> %a) {
  %r = call float @llvm.vector.reduce.fmin.v4f32(<4 x float> %a)
  ret float %r
}

declare <4 x float> @llvm.minnum.v4f32(<4 x float>, <4 x float>)
define <4 x float> @float_min(<4 x float> %a, <4 x float> %b) {
  %r = call <4 x float> @llvm.minnum.v4f32(<4 x float> %a, <4 x float> %b)
  ret <4 x float> %r
}

declare <4 x i32> @llvm.smin.v4i32(<4 x i32>, <4 x i32>)
define <4 x i32> @int_min(<4 x i32> %a, <4 x i32> %b) {
  %r = call <4 x i32> @llvm.smin.v4i32(<4 x i32> %a, <4 x i32> %b)
  ret <4 x i32> %r
}

declare i32 @llvm.vector.reduce.smin.v4i32(<4 x i32>)
define i32 @int_min_red(<4 x i32> %b) {
  %r = call i32 @llvm.vector.reduce.smin.v4i32(<4 x i32> %b)
  ret i32 %r
}
https://llvm.godbolt.org/z/7Yv3v8aP9

Alternatives Considered

Instead of adding a set of completely new builtins, we could overload the existing libm-based builtins (__builtin_fminf& co). The main issue with that approach is convenience for the users I think. The libm-based builtins encode the type in their name which allows us to only cover a small subset of types. For example, AFAIK there’s no builtin for integer min/max or floating point max for 16 bit floating point types. The proposal above provides a set of more user friendly builtins that work across a large range of types, similar to the existing math operators (i.e. there’s a single __builtin_maxwhich works with both integer and floating points, as well as vector & matrix versions). The reduction versions are also explicitly marked in the name.

Cheers
Florian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20210927/7637930f/attachment-0001.html>