[cfe-dev] _Float16 support

Wed Jan 23 15:29:46 PST 2019

Hey Andy,

On Tue, Jan 22, 2019 at 10:38 AM Kaylor, Andrew via cfe-dev
<cfe-dev at lists.llvm.org> wrote:
> I'd like to start a discussion about how clang supports _Float16 for target architectures that don't have direct support for 16-bit floating point arithmetic.

Thanks for bringing this up;  we'd also like to get better support,
for sysv x86-64 specifically - AArch64 is mostly fine, and ARM is
usable with +fp16.

I'm not sure much of this discussion generalizes across platforms
though (beyond Craig's potential bug fix?).  I guess the
"target-independent" question is: should we allow this kind of
"legalization" in the vreg assignment code at all? (I think that's
where it all comes from: RegsForValue, TLI::get*Register*)
It's convenient for experimental frontends: you can use weird types
(half, i3, ...) without worrying too much about it, and you usually
get something self-consistent out of the backend.  But you eventually
need to worry about it and need to make the calling convention
explicit.  But I guess that's a discussion for the other thread ;)

> The current clang language extensions documentation says, "If half-precision instructions are unavailable, values will be promoted to single-precision, similar to the semantics of __fp16 except that the results will be stored in single-precision." This is somewhat vague (to me) as to what is meant by promotion of values, and the part about results being stored in single-precision isn't what actually happens.
>
> Consider this example:
>
> _Float16 x;
> _Float16 f(_Float16 y, _Float16 z) {
>   x = y * z;
>   return x;
> }
>
> When compiling with “-march=core-avx2” that results (after some trivial cleanup) in this IR:
>
> @x = global half 0xH0000, align 2
> define half @f(half, half) {
>   %3 = fmul half %0, %1
>   store half %3, half* @x
>   ret half %3
> }
>
> That’s not too unreasonable I suppose, except for the fact that it hasn’t taken the lack of target support for half-precision arithmetic into account yet. That will happen in the selection DAG. The assembly code generated looks like this (with my annotations):
>
> f:                                      # @f
> # %bb.0:
>        vcvtps2ph       xmm1, xmm1, 4             # Convert argument 1 from single to half
>         vcvtph2ps       xmm1, xmm1                # Convert argument 1 back to single
>         vcvtps2ph       xmm0, xmm0, 4            # Convert argument 0 from single to half
>         vcvtph2ps       xmm0, xmm0                # Convert argument 0 back to single
>         vmulss             xmm0, xmm0, xmm1   # xmm0 = xmm0*xmm1 (single precision)
>         vcvtps2ph       xmm1, xmm0, 4            # Convert the single precision result to half
>         vmovd             eax, xmm1                      # Move the half precision result to eax
>         mov                 word ptr [rip + x], ax     # Store the half precision result in the global, x
>         ret                                                             # Return the single precision result still in xmm0
> .Lfunc_end0:
>                                         # -- End function
>
> Something odd has happened here, and it may not be obvious what it is. This code begins by converting xmm0 and xmm1 from single to half and then back to single. The first conversion is happening because the back end decided that it needed to change the types of the parameters to single precision but the function body is expecting half precision values. However, since the target can’t perform the required computation with half precision values they must be converted back to single for the multiplication. The single precision result of the multiplication is converted to half precision to be stored in the global value, x, but the result is returned as single precision (via xmm0).
>
> I’m not primarily worried about the extra conversions here. We can’t get rid of them because we can’t prove they aren’t rounding, but that’s a secondary issue. What I’m worried about is that we allowed/required the back end to improvise an ABI to satisfy the incoming IR, and the choice it made is questionable.

As Richard said, an ABI rule emerged from the implementation, and I
believe we should solidify it, so here's a simple strawman proposal:
pass scalars in the low 16 bits of SSE registers, don't change the
memory layout, and pack them in vectors of 16-bit elements.  That
matches the only ISA extension so far (ph<>ps conversions), and fits
well with that (as opposed to i16 coercion) as well as vectors (as
opposed to f32 promotion).  To my knowledge, there hasn't been any
alternative ABI proposal (but I haven't looked in 1 or 2 years).  It's
interesting because we technically have no way of accessing scalars
(so we have the same problems as i8/i16 vector elements, but without
the saving grace of having matching GPRs - x86, or direct copies -
aarch64), and there are not even any scalar operations.

Any thoughts?  We can suggest this to x86-psABI if folks think this is
a good idea. (I don't know about other ABIs or other architectures
though).

Concretely, this means no/little change in IRGen.  As for the SDAG
implementation, this is an unusual situation.  I've done some
experimentation a long time ago.  We can make the types legal, even
though no operations are.   It's relatively straightforward to promote
all operations (and we made sure that worked years ago for AArch64,
for the pre-v8.2 mode), but vectors are fun, because of build_vector
(where it helps to have the truncating behavior we have for integers,
but for fp), extract_vector_elt (where you need the matching extend),
and insert_vector_elt (which you have to lower using some movd and/or
pinsrw trickery, if you want to avoid the generic slow via-memory
fallback).
Alternatively, we can immediately, in call lowering/register
assignment logic (this covers the SDAG cross-BB vreg assignments Craig
mentions) promote to f32 "via" i16.  I'm afraid I don't remember the
arguments one way or the other, I can dust off my old patches and put
them up on phabricator.

-Ahmed

>
> For a point of comparison, I looked at what gcc does. Currently, gcc only allows _Float16 in C, not C++, and if you try to use it with a target that doesn’t have native support for half-precision arithmetic, it tells you “’_Float16’ is not supported on this target.” That seems preferable to making up an ABI on the fly.
>
> I haven’t looked at what happens with clang when compiling for other targets that don’t have native support for half-precision arithmetic, but I would imagine that similar problems exist.
>
> Thoughts?
>
> Thanks,
> Andy
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev