[cfe-dev] _Float16 support

Thu Jan 24 10:57:54 PST 2019

On 24 Jan 2019, at 4:46, Sjoerd Meijer wrote:

> Hello,
>
> I added _Float16 support to Clang and codegen support in the AArch64 
> and ARM backends, but have not looked into x86. Ahmed is right: 
> AArch64 is fine, only a few ACLE intrinsics are missing. ARM has rough 
> edges: scalar codegen should be mostly fine, vector codegen needs some 
> more work.
>
> Implementation for AArch64 was mostly straightforward (it only has 
> hard float ABI, and has half register/type support), but for ARM it 
> was a huge pain to plumb f16 support because of different ABIs 
> (hard/soft), different architecture extensions of FP and FP16 support, 
> and the existence of another half-precision type with different 
> semantics. Sounds like you're doing a similar exercise, and yes, 
> argument passing was one of the trickiest parts.
>
>
>> IR and SelectionDAG representational choices aside, it seems to me 
>> that,
>
>> like GCC, Clang should not be permitting _Float16 on any target  that 
>> doesn't
>
>> specify an ABI for it, because otherwise we're just creating future 
>> compatibility
>
>> problems for that target.  I'm surprised and  disappointed that it 
>> wasn't implemented
>
>> this way.
>
> Apologies, I missed that.

It's alright, oversights happen (in both patch-writing and review).  Can 
we get a volunteer to do the work to restrict this now?  I'm a little 
crushed.

John.

>
> Sjoerd.
>
> ________________________________
> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Kaylor, 
> Andrew via llvm-dev <llvm-dev at lists.llvm.org>
> Sent: 24 January 2019 00:23
> To: Ahmed Bougacha; Lu, Hongjiu
> Cc: llvm-dev; cfe-dev at lists.llvm.org
> Subject: Re: [llvm-dev] [cfe-dev] _Float16 support
>
> It seems that there are several issues here:
>
> 1. Should the front end be concerned with whether or not the IR that 
> it is emitting can be translated into a well-defined IR?
> 2. How should the selection DAG handle data types whose representation 
> isn't defined by the ABI we're targeting?
> 3. What should the ABI do with half-precision floats?
>
> Working backward...
>
> The third question here is obviously target specific. I've talked to 
> HJ Lu about this, and he's working on an update to the x86 psABI. I 
> believe that his eventual proposal will follow the lines of what you 
> (Ahmed) suggested below, but I'm not completely proficient at 
> comprehending ABI definitions so there may be some subtlety that I am 
> misunderstanding in what he told me. I also talked to Craig about 
> would be involved in making the LLVM x86 backend handle 'half' values 
> this way. That involves a good bit of work, but it can be done.
>
> The second question above probably involves a mix of 
> target-independent and target-specific code. Right now the selection 
> DAG code is operating on the assumption that it needs to do 
> *something* with any IR it is given. It tries to make a reasonable 
> choice, and the choice is consistent and predictable but not 
> necessarily what the user expects. It seems like we should at the very 
> least be producing a diagnostic so the user knows what we did (or even 
> just that we did something). Then there are the specific problems 
> Craig has brought up with the way we're currently handling 'half' 
> values. Would defining a legal f16 type take care of those problems?
>
> The first question exposes my lack of understanding of the proper role 
> of the front end. It isn't clear to me what responsibility the front 
> end has for enforcing conformance to the ABI. As a user of the 
> compiler, I would like the compiler to tell me when code I've written 
> can't be represented using the ABI I am targeting. Whether the front 
> end should detect this or the backend, I don't know. I suppose it's 
> also an open question how strictly this should be enforced. Is it a 
> warning that can be elevated to an error at the users' discretion? Is 
> it something that should be blocked by default but enabled by a 
> user-specified option? Should it always be rejected?
>
> -Andy
>
> -----Original Message-----
> From: Ahmed Bougacha <ahmed.bougacha at gmail.com>
> Sent: Wednesday, January 23, 2019 3:30 PM
> To: Kaylor, Andrew <andrew.kaylor at intel.com>
> Cc: cfe-dev at lists.llvm.org; llvm-dev <llvm-dev at lists.llvm.org>; Craig 
> Topper <craig.topper at gmail.com>; Richard Smith <richard at metafoo.co.uk>
> Subject: Re: [cfe-dev] _Float16 support
>
> Hey Andy,
>
> On Tue, Jan 22, 2019 at 10:38 AM Kaylor, Andrew via cfe-dev 
> <cfe-dev at lists.llvm.org> wrote:
>> I'd like to start a discussion about how clang supports _Float16 for 
>> target architectures that don't have direct support for 16-bit 
>> floating point arithmetic.
>
> Thanks for bringing this up;  we'd also like to get better support, 
> for sysv x86-64 specifically - AArch64 is mostly fine, and ARM is 
> usable with +fp16.
>
> I'm not sure much of this discussion generalizes across platforms 
> though (beyond Craig's potential bug fix?).  I guess the 
> "target-independent" question is: should we allow this kind of 
> "legalization" in the vreg assignment code at all? (I think that's 
> where it all comes from: RegsForValue, TLI::get*Register*) It's 
> convenient for experimental frontends: you can use weird types (half, 
> i3, ...) without worrying too much about it, and you usually get 
> something self-consistent out of the backend.  But you eventually need 
> to worry about it and need to make the calling convention explicit.  
> But I guess that's a discussion for the other thread ;)
>
>> The current clang language extensions documentation says, "If 
>> half-precision instructions are unavailable, values will be promoted 
>> to single-precision, similar to the semantics of __fp16 except that 
>> the results will be stored in single-precision." This is somewhat 
>> vague (to me) as to what is meant by promotion of values, and the 
>> part about results being stored in single-precision isn't what 
>> actually happens.
>>
>> Consider this example:
>>
>> _Float16 x;
>> _Float16 f(_Float16 y, _Float16 z) {
>>   x = y * z;
>>   return x;
>> }
>>
>> When compiling with “-march=core-avx2” that results (after some 
>> trivial cleanup) in this IR:
>>
>> @x = global half 0xH0000, align 2
>> define half @f(half, half) {
>>   %3 = fmul half %0, %1
>>   store half %3, half* @x
>>   ret half %3
>> }
>>
>> That’s not too unreasonable I suppose, except for the fact that it 
>> hasn’t taken the lack of target support for half-precision 
>> arithmetic into account yet. That will happen in the selection DAG. 
>> The assembly code generated looks like this (with my annotations):
>>
>> f:                                      # @f
>> # %bb.0:
>>        vcvtps2ph       xmm1, xmm1, 4             # Convert argument 1 
>> from single to half
>>         vcvtph2ps       xmm1, xmm1                # Convert argument 
>> 1 back to single
>>         vcvtps2ph       xmm0, xmm0, 4            # Convert argument 0 
>> from single to half
>>         vcvtph2ps       xmm0, xmm0                # Convert argument 
>> 0 back to single
>>         vmulss             xmm0, xmm0, xmm1   # xmm0 = xmm0*xmm1 
>> (single precision)
>>         vcvtps2ph       xmm1, xmm0, 4            # Convert the single 
>> precision result to half
>>         vmovd             eax, xmm1                      # Move the 
>> half precision result to eax
>>         mov                 word ptr [rip + x], ax     # Store the 
>> half precision result in the global, x
>>         ret                                                           
>>   # Return the single precision result still in xmm0
>> .Lfunc_end0:
>>                                         # -- End function
>>
>> Something odd has happened here, and it may not be obvious what it 
>> is. This code begins by converting xmm0 and xmm1 from single to half 
>> and then back to single. The first conversion is happening because 
>> the back end decided that it needed to change the types of the 
>> parameters to single precision but the function body is expecting 
>> half precision values. However, since the target can’t perform the 
>> required computation with half precision values they must be 
>> converted back to single for the multiplication. The single precision 
>> result of the multiplication is converted to half precision to be 
>> stored in the global value, x, but the result is returned as single 
>> precision (via xmm0).
>>
>> I’m not primarily worried about the extra conversions here. We 
>> can’t get rid of them because we can’t prove they aren’t 
>> rounding, but that’s a secondary issue. What I’m worried about is 
>> that we allowed/required the back end to improvise an ABI to satisfy 
>> the incoming IR, and the choice it made is questionable.
>
> As Richard said, an ABI rule emerged from the implementation, and I 
> believe we should solidify it, so here's a simple strawman proposal:
> pass scalars in the low 16 bits of SSE registers, don't change the 
> memory layout, and pack them in vectors of 16-bit elements.  That 
> matches the only ISA extension so far (ph<>ps conversions), and fits 
> well with that (as opposed to i16 coercion) as well as vectors (as 
> opposed to f32 promotion).  To my knowledge, there hasn't been any 
> alternative ABI proposal (but I haven't looked in 1 or 2 years).  It's 
> interesting because we technically have no way of accessing scalars 
> (so we have the same problems as i8/i16 vector elements, but without 
> the saving grace of having matching GPRs - x86, or direct copies - 
> aarch64), and there are not even any scalar operations.
>
> Any thoughts?  We can suggest this to x86-psABI if folks think this is 
> a good idea. (I don't know about other ABIs or other architectures 
> though).
>
> Concretely, this means no/little change in IRGen.  As for the SDAG 
> implementation, this is an unusual situation.  I've done some 
> experimentation a long time ago.  We can make the types legal, even
> though no operations are.   It's relatively straightforward to promote
> all operations (and we made sure that worked years ago for AArch64, 
> for the pre-v8.2 mode), but vectors are fun, because of build_vector 
> (where it helps to have the truncating behavior we have for integers, 
> but for fp), extract_vector_elt (where you need the matching extend), 
> and insert_vector_elt (which you have to lower using some movd and/or 
> pinsrw trickery, if you want to avoid the generic slow via-memory 
> fallback).
> Alternatively, we can immediately, in call lowering/register 
> assignment logic (this covers the SDAG cross-BB vreg assignments Craig
> mentions) promote to f32 "via" i16.  I'm afraid I don't remember the 
> arguments one way or the other, I can dust off my old patches and put 
> them up on phabricator.
>
>
> -Ahmed
>
>>
>> For a point of comparison, I looked at what gcc does. Currently, gcc 
>> only allows _Float16 in C, not C++, and if you try to use it with a 
>> target that doesn’t have native support for half-precision 
>> arithmetic, it tells you “’_Float16’ is not supported on this 
>> target.” That seems preferable to making up an ABI on the fly.
>>
>> I haven’t looked at what happens with clang when compiling for 
>> other targets that don’t have native support for half-precision 
>> arithmetic, but I would imagine that similar problems exist.
>>
>> Thoughts?
>>
>> Thanks,
>> Andy
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> IMPORTANT NOTICE: The contents of this email and any attachments are 
> confidential and may also be privileged. If you are not the intended 
> recipient, please notify the sender immediately and do not disclose 
> the contents to any other person, use it for any purpose, or store or 
> copy the information in any medium. Thank you.