[llvm-dev] [RFC] Half-Precision Support in the Arm Backends

Thu Jan 18 05:58:29 PST 2018

I would like to revive this thread, as I am struggling a lot with the FP16
implementation in the ARM backend. My implementation in
https://reviews.llvm.org/D38315 is finished (except one case), but a more
robust alternative implementation was suggested. One can indeed argue that my
current implementation is a bit fragile, because it involves manually patching
up the isel dags for a few cases. The suggestion was to look into CCState and
adjusting of the calling convention lowering, inspired by a recent discussion
on the list here: http://lists.llvm.org/pipermail/llvm-dev/2018-January/120098.html.
The benefit of this approach is that I would get most of legalization for free, which is the
fragile bit in my approach at the moment.
Anyway, it's become a long story, but (almost) in chronological order this is
what I've considered. Any hints, tips, suggestions welcome.

Goal of my exercise:
--------------------

1) Add match rules for some Armv8.2-A FP16 instructions, e.g.: fsub, fadd,
and also some conversion instructions.
2) Don't regress the "storage-only cases", i.e., when we only have the
conversion instructions available.

The Approach:
-------------

1) First, we make f16 legal only when we have FullFP16 support (i.e. when
Armv8.2-A FP16 is supported), so in ARMISelLowering.cpp we add:

  if (Subtarget->hasFullFP16()) {
    addRegisterClass(MVT::f16, &ARM::HPRRegClass);
  }

2) This is the first implementation decision, I introduce a new register class
HPR, which is an exact copy of the single-precision register class SPR, except
that it holds f16 types:

def HPR : RegisterClass<"ARM", [f16], 32, (sequence "S%u", 0, 31)> {
 ...

I think this makes sense because half-types sit in the lower 16 bits of the
S-registers, and the reason to create a separate HPR register class is to avoid
typing of all rules that currently use SPR, because if we would allow both f16
and f32 in SPR, then there's a choice and rules need to be typed.

3) Next, add match rules for some Armv8.2-A FP16 instructions. Instruction
VADDH, which has already a tablegen description but not yet a match rules,
simply becomes:

def VADDH  : AHbI<0b11100, 0b11, 0, 0,
                  (outs HPR:$Sd), (ins HPR:$Sn, HPR:$Sm),
                  IIC_fpALU16, "vadd", ".f16\t$Sd, $Sn, $Sm",
                  [(set HPR:$Sd, (fadd HPR:$Sn, HPR:$Sm))]>,       // <~~~ new match rule using HPR

This is straightforward business so far, but I already learned the hard way
that the conversion are the tricky ones, so I repeat this for an f16 -> f32
upconvert:

def VCVTBHS: ASuI<0b11101, 0b11, 0b0010, 0b01, 0, (outs SPR:$Sd), (ins HPR:$Sm),
                 IIC_fpCVTSH, "vcvtb", ".f32.f16\t$Sd, $Sm",
                 [(set SPR:$Sd, (fpextend HPR:$Sm))]>,      // <~~~~ new match rule using HPR and SPR
                 Requires<[HasFP16]>,
             Sched<[WriteFPCVT]>;

This new rule matches and fp_extend that consumes an HPR and produces a result
in SPR.

And that's essentially it and all the code changes I want to make.  With these
changes, I want to see an fadd with two f16 opernands being codegen'd, and also
I want to keep the existing tests passing of course.

The problem:
------------

The problem appears to be that introducing HPR looks like a good idea, but it
gives problems when FullFP16 is not supported. This is best illustrated with
this existing test which is a simple upconvert of f16 to f32:

define float @test_extend32(half* %addr) {
  %val16 = load half, half* %addr
  %val32 = fpext half %val16 to float
  ret float %val32
}

It should generate this code::

        ldrh    r0, [r0]          ; integer half word load
        vmov    s0, r0
        vcvtb.f32.f16   s0, s0
        vmov    r0, s0
        bx      lr

when we don't have the Armv8.2-A FP16 instructions available, and thus only
have the conversion instructions.

The problem is in the conversion rules, some rewrite rules to be more specific,
and I think this is one of the culprits:

def : Pat<(f16_to_fp GPR:$a),
          (VCVTBHS (COPY_TO_REGCLASS GPR:$a, SPR))>;

This rewrite rule is supposed to first move the GPR reg in to a S-registers:

    vmov    s0, r0

and then to the conversion:

    vcvtb.f32.f16   s0, s0

This rewrite rule gets triggered because the ISEL DAG has indeed this funny
node f16_to_fp (which models this register move and the float-convert):

t0: ch = EntryToken
          t2: i32,ch = CopyFromReg t0, Register:i32 %0
        t16: i32,ch = load<LD2[%addr], zext from i16> t0, t2, undef:i32
      t12: f32 = fp16_to_fp t16     <~~~~~~~~ FUNNY NODE HERE
    t7: i32 = bitcast t12
  t9: ch,glue = CopyToReg t0, Register:i32 %r0, t7
  t10: ch = ARMISD::RET_FLAG t9, Register:i32 %r0, t9:1

And yes, to make it even funnier, this node has an i32 operand, and that's
because we do the half-float load with an integer load instruction.

And after this rewrite, we end up with this DAG:

  t0: ch = EntryToken
            t2: i32,ch = CopyFromReg t0, Register:i32 %0
          t16: i32,ch = t2LDRHi12<Mem:LD2[%addr]> t2, TargetConstant:i32<0>, TargetConstant:i32<14>, Register:i32 %noreg, t0
        t20: f16 = COPY_TO_REGCLASS t16, TargetConstant:i32<1> <~~~~~~~~~~~~~ PROBLEM HERE
      t12: f32 = VCVTBHS t20, TargetConstant:i32<14>, Register:i32 %noreg
    t7: i32 = VMOVRS t12, TargetConstant:i32<14>, Register:i32 %noreg
  t9: ch,glue = CopyToReg t0, Register:i32 %r0, t7
  t10: ch = tBX_RET TargetConstant:i32<14>, Register:i32 %noreg, Register:i32 %r0, t9, t9:1

It goes all wrong from here, because a f16 value is produced and it is not a
legal type etc.

The reason that this happens must be because VCVTBHS that is used in
the f16_to_fp rewrite rule, is specified to consume a f16 value, so the
CopyToReg moves from GPR to an S-Registers:

def : Pat<(f16_to_fp GPR:$a),
          (VCVTBHS (COPY_TO_REGCLASS GPR:$a, SPR))>;

Conclusion:
-----------

This approach is not going to work:
- because there's  rewrite rule X which generates instructions Y that use both
  the SPR and HPR descriptions.
- Y can be instruction selected when FullFP16 is enabled, this works and no
  problems (and we don't need X).
- but X is used when FullFP16 is not supported, and generates Y that uses HPRs
  which is not legal.

Introducing a new HPR register class is a possibility though, but then special
care needs to be taken in a few cases. This is the first alternative I started
exploring.

Alternative 1
--------------

This is the approach I took in D38315, which made f16 also legal for the
storage-only case, but which I am ready to abandon:

- Define the HPR register class,
- Use this and make f16 legal also in the storage-only case,
-- This works for the storage-only case, because data processing instruction
   that have f16 operands will get promoted and fp_extend and fp_round nodes will
   be introduced. Then I manually patch up the f16 loads and stores and a few
   other corner cases by manually lowering them. This however is a fragile
   approach, which was rightfully commented by Oli, also after being inspired by a
   recent discussion about some backends doing their own CCState and function
   argument lowering. This is Alternative 3.
-- Another crucial aspect and benefit of this approach, is that this fixes
   AAPCS compliance issues: while half-precision arguments sit in single-precision
   registers, callees should read (and also write) only the lower 16 bits. I got
   this for free by making f16 legal.

Alternative 2
-------------

This alternative aims at getting something ready soon, it relies on
exploring creating one register class and a Clang that generates the
required up and down converts for the storage-only case:

- We have one SPR registerclass that contains both f16 and f32:

  def SPR : RegisterClass<"ARM", [f16, f32], 32, (sequence "S%u", 0, 31)> {
  ...

- Then I will have to type all rules that e.g. use SPR:$a and change it in (f32
  SPR:$a):
-- this will be a massive patch,
-- but hopefully will not have a problem in the rewrite rules.

- Then, I also need to modify a clang workaround, which should inserts the proper up and down
  converts for half types that are generated from the usage of _Float16,
  similarly like we already do for __fp16. This workaround can then later be replaced
  by the CCState implementation.

Alternative 3:
----------------

This is the best approach, where I start looking into the AAPCS compliance
issues first:

- Fix the AAPCS compliance issues by implementing our own CCState and function
  call lowering for f16 arguments.
-- With this we achieve that f16 does not need to be a legal type when in fact
   it shouldn't be, so then the "normal" promotions kick in for illegal f16
   operands and thus I don't need  custom lowering for loads/stores.

but after that I am unsure how to do the match rules: one or two register
classes.  It might be that the difference between Alternative 2 and this one,
is the CCState or Clang workaround. I will investigate a bit more.

________________________________
From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Sjoerd Meijer via llvm-dev <llvm-dev at lists.llvm.org>
Sent: 06 December 2017 08:32
To: Friedman, Eli; llvm-dev at lists.llvm.org
Subject: Re: [llvm-dev] [RFC] Half-Precision Support in the Arm Backends

Thanks a lot for the suggestions! I will look into using vld1/vst1, sounds good.

I am custom lowering the bitcasts, that's now the only place where FP_TO_FP16

and FP16_TO_FP nodes are created to avoid inefficient code generation. I will

double check if I can't achieve the same without using these nodes (because I

really would like to get completely rid of them).

Cheers,

Sjoerd.

>On 12/4/2017 6:44 AM, Sjoerd Meijer via llvm-dev wrote:
>>
>> Custom Lowering
>> -------------------------
>>
>> Making f16 legal and not having native load/stores instructions available,
>> (no FullFP16 support) means custom lowering loads/stores:
>> 1) Since we don't have FP16 load/store instructions available, we create
>>    integer half-word loads. I unfortunately need the FP16_TO_FP node here,
>>    because that "models" creating an integer value, which is what we need
>>    to create a "truncating i16" integer load instructions. Instead, of
>> using
>>    FP16_TO_FP, I have tried BITCASTs, but this can lead to code generation
>>    to stack loads/stores which I don't want.
>> 2) Custom lowering f16 stores is very similar, and creates truncating
>>    half-word integer stores.
>
>Technically, there are no f16 load/store instructions, yes, but we can
>use NEON vdl1 and vst1 to get something roughly equivalent, right?
>
>You probably want to custom-lower BITCAST instructions; the generic
>sequence emitted by the legalizer is pretty inefficient in most cases.
>
>---
>
>Overall, I think your approach makes sense.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180118/930958ce/attachment-0001.html>