[llvm-dev] [RFC] Half-Precision Support in the Arm Backends

Mon Jan 22 05:57:36 PST 2018

Hi Sam,

Thanks for the suggestions! I can confirm that this works, so this is indeed the

most elegant way to separate the FP16 and FullFP16 rules.

This was the last piece of the puzzle. I can now abandon my old approach,

and will go for the CCState and this tablegen separation approach; thus we get

most of the legalization for free and it is more robust than custom lowering

loads/stores (and some other corner cases).

Thanks,

Sjoerd.

________________________________
From: Sam Parker
Sent: 18 January 2018 14:28:45
To: Sjoerd Meijer; Friedman, Eli; Oliver Stannard; llvm-dev at lists.llvm.org
Cc: nd
Subject: Re: [llvm-dev] [RFC] Half-Precision Support in the Arm Backends

Hi Sjoerd,

For ISel, I think having a separate register class will give you less headache. I wondering if you could get away with not touching the instructions descriptions at all, instead defining external pattens for the FullFP16 case, like so:

def VCVTBHS: ASuI<0b11101, 0b11, 0b0010, 0b01, 0, (outs SPR:$Sd), (ins SPR:$Sm),
                 IIC_fpCVTSH, "vcvtb", ".f32.f16\t$Sd, $Sm",
                 []>,
                 Requires<[HasFP16]>,
             Sched<[WriteFPCVT]>;

def : FP16Pat<(f16_to_fp GPR:$a),
                  (VCVTBHS (COPY_TO_REGCLASS GPR:$a, SPR))>;

def : FullFP16Pat<(f32 (fpextend HPR:$Sm)),
                    (VCVTBHS (COPY_TO_REGLASS HPR:$Sm, SPR)>;

I'm not sure of the COPY_TO_REGLASS semantics, but I would (dangerously) assume that it when it comes to copying the values between registers, it will be noticed that HPR and SPR actually alias each other and so no copy is needed. I hope this approach would allow for a clean separation of the FP16 and FullFP16 implementations and negate the need to manually type cast each register access.

cheers,
sam

Sam Parker

Compilation Tools Engineer | Arm

. . . . . . . . . . . . . . . . . . . . . . . . . . .

Arm.com

________________________________
From: Sjoerd Meijer
Sent: 18 January 2018 13:58:29
To: Friedman, Eli; Sjoerd Meijer; Sam Parker; Oliver Stannard; llvm-dev at lists.llvm.org
Cc: nd
Subject: Re: [llvm-dev] [RFC] Half-Precision Support in the Arm Backends

I would like to revive this thread, as I am struggling a lot with the FP16
implementation in the ARM backend. My implementation in
https://reviews.llvm.org/D38315 is finished (except one case), but a more
robust alternative implementation was suggested. One can indeed argue that my
current implementation is a bit fragile, because it involves manually patching
up the isel dags for a few cases. The suggestion was to look into CCState and
adjusting of the calling convention lowering, inspired by a recent discussion
on the list here: http://lists.llvm.org/pipermail/llvm-dev/2018-January/120098.html.
The benefit of this approach is that I would get most of legalization for free, which is the
fragile bit in my approach at the moment.
Anyway, it's become a long story, but (almost) in chronological order this is
what I've considered. Any hints, tips, suggestions welcome.

Goal of my exercise:
--------------------

1) Add match rules for some Armv8.2-A FP16 instructions, e.g.: fsub, fadd,
and also some conversion instructions.
2) Don't regress the "storage-only cases", i.e., when we only have the
conversion instructions available.

The Approach:
-------------

1) First, we make f16 legal only when we have FullFP16 support (i.e. when
Armv8.2-A FP16 is supported), so in ARMISelLowering.cpp we add:

  if (Subtarget->hasFullFP16()) {
    addRegisterClass(MVT::f16, &ARM::HPRRegClass);
  }

2) This is the first implementation decision, I introduce a new register class
HPR, which is an exact copy of the single-precision register class SPR, except
that it holds f16 types:

def HPR : RegisterClass<"ARM", [f16], 32, (sequence "S%u", 0, 31)> {
 ...

I think this makes sense because half-types sit in the lower 16 bits of the
S-registers, and the reason to create a separate HPR register class is to avoid
typing of all rules that currently use SPR, because if we would allow both f16
and f32 in SPR, then there's a choice and rules need to be typed.

3) Next, add match rules for some Armv8.2-A FP16 instructions. Instruction
VADDH, which has already a tablegen description but not yet a match rules,
simply becomes:

def VADDH  : AHbI<0b11100, 0b11, 0, 0,
                  (outs HPR:$Sd), (ins HPR:$Sn, HPR:$Sm),
                  IIC_fpALU16, "vadd", ".f16\t$Sd, $Sn, $Sm",
                  [(set HPR:$Sd, (fadd HPR:$Sn, HPR:$Sm))]>,       // <~~~ new match rule using HPR

This is straightforward business so far, but I already learned the hard way
that the conversion are the tricky ones, so I repeat this for an f16 -> f32
upconvert:

def VCVTBHS: ASuI<0b11101, 0b11, 0b0010, 0b01, 0, (outs SPR:$Sd), (ins HPR:$Sm),
                 IIC_fpCVTSH, "vcvtb", ".f32.f16\t$Sd, $Sm",
                 [(set SPR:$Sd, (fpextend HPR:$Sm))]>,      // <~~~~ new match rule using HPR and SPR
                 Requires<[HasFP16]>,
             Sched<[WriteFPCVT]>;

This new rule matches and fp_extend that consumes an HPR and produces a result
in SPR.

And that's essentially it and all the code changes I want to make.  With these
changes, I want to see an fadd with two f16 opernands being codegen'd, and also
I want to keep the existing tests passing of course.

The problem:
------------

The problem appears to be that introducing HPR looks like a good idea, but it
gives problems when FullFP16 is not supported. This is best illustrated with
this existing test which is a simple upconvert of f16 to f32:

define float @test_extend32(half* %addr) {
  %val16 = load half, half* %addr
  %val32 = fpext half %val16 to float
  ret float %val32
}

It should generate this code::

        ldrh    r0, [r0]          ; integer half word load
        vmov    s0, r0
        vcvtb.f32.f16   s0, s0
        vmov    r0, s0
        bx      lr

when we don't have the Armv8.2-A FP16 instructions available, and thus only
have the conversion instructions.

The problem is in the conversion rules, some rewrite rules to be more specific,
and I think this is one of the culprits:

def : Pat<(f16_to_fp GPR:$a),
          (VCVTBHS (COPY_TO_REGCLASS GPR:$a, SPR))>;

This rewrite rule is supposed to first move the GPR reg in to a S-registers:

    vmov    s0, r0

and then to the conversion:

    vcvtb.f32.f16   s0, s0

This rewrite rule gets triggered because the ISEL DAG has indeed this funny
node f16_to_fp (which models this register move and the float-convert):

t0: ch = EntryToken
          t2: i32,ch = CopyFromReg t0, Register:i32 %0
        t16: i32,ch = load<LD2[%addr], zext from i16> t0, t2, undef:i32
      t12: f32 = fp16_to_fp t16     <~~~~~~~~ FUNNY NODE HERE
    t7: i32 = bitcast t12
  t9: ch,glue = CopyToReg t0, Register:i32 %r0, t7
  t10: ch = ARMISD::RET_FLAG t9, Register:i32 %r0, t9:1

And yes, to make it even funnier, this node has an i32 operand, and that's
because we do the half-float load with an integer load instruction.

And after this rewrite, we end up with this DAG:

  t0: ch = EntryToken
            t2: i32,ch = CopyFromReg t0, Register:i32 %0
          t16: i32,ch = t2LDRHi12<Mem:LD2[%addr]> t2, TargetConstant:i32<0>, TargetConstant:i32<14>, Register:i32 %noreg, t0
        t20: f16 = COPY_TO_REGCLASS t16, TargetConstant:i32<1> <~~~~~~~~~~~~~ PROBLEM HERE
      t12: f32 = VCVTBHS t20, TargetConstant:i32<14>, Register:i32 %noreg
    t7: i32 = VMOVRS t12, TargetConstant:i32<14>, Register:i32 %noreg
  t9: ch,glue = CopyToReg t0, Register:i32 %r0, t7
  t10: ch = tBX_RET TargetConstant:i32<14>, Register:i32 %noreg, Register:i32 %r0, t9, t9:1

It goes all wrong from here, because a f16 value is produced and it is not a
legal type etc.

The reason that this happens must be because VCVTBHS that is used in
the f16_to_fp rewrite rule, is specified to consume a f16 value, so the
CopyToReg moves from GPR to an S-Registers:

def : Pat<(f16_to_fp GPR:$a),
          (VCVTBHS (COPY_TO_REGCLASS GPR:$a, SPR))>;

Conclusion:
-----------

This approach is not going to work:
- because there's  rewrite rule X which generates instructions Y that use both
  the SPR and HPR descriptions.
- Y can be instruction selected when FullFP16 is enabled, this works and no
  problems (and we don't need X).
- but X is used when FullFP16 is not supported, and generates Y that uses HPRs
  which is not legal.

Introducing a new HPR register class is a possibility though, but then special
care needs to be taken in a few cases. This is the first alternative I started
exploring.

Alternative 1
--------------

This is the approach I took in D38315, which made f16 also legal for the
storage-only case, but which I am ready to abandon:

- Define the HPR register class,
- Use this and make f16 legal also in the storage-only case,
-- This works for the storage-only case, because data processing instruction
   that have f16 operands will get promoted and fp_extend and fp_round nodes will
   be introduced. Then I manually patch up the f16 loads and stores and a few
   other corner cases by manually lowering them. This however is a fragile
   approach, which was rightfully commented by Oli, also after being inspired by a
   recent discussion about some backends doing their own CCState and function
   argument lowering. This is Alternative 3.
-- Another crucial aspect and benefit of this approach, is that this fixes
   AAPCS compliance issues: while half-precision arguments sit in single-precision
   registers, callees should read (and also write) only the lower 16 bits. I got
   this for free by making f16 legal.

Alternative 2
-------------

This alternative aims at getting something ready soon, it relies on
exploring creating one register class and a Clang that generates the
required up and down converts for the storage-only case:

- We have one SPR registerclass that contains both f16 and f32:

  def SPR : RegisterClass<"ARM", [f16, f32], 32, (sequence "S%u", 0, 31)> {
  ...

- Then I will have to type all rules that e.g. use SPR:$a and change it in (f32
  SPR:$a):
-- this will be a massive patch,
-- but hopefully will not have a problem in the rewrite rules.

- Then, I also need to modify a clang workaround, which should inserts the proper up and down
  converts for half types that are generated from the usage of _Float16,
  similarly like we already do for __fp16. This workaround can then later be replaced
  by the CCState implementation.

Alternative 3:
----------------

This is the best approach, where I start looking into the AAPCS compliance
issues first:

- Fix the AAPCS compliance issues by implementing our own CCState and function
  call lowering for f16 arguments.
-- With this we achieve that f16 does not need to be a legal type when in fact
   it shouldn't be, so then the "normal" promotions kick in for illegal f16
   operands and thus I don't need  custom lowering for loads/stores.

but after that I am unsure how to do the match rules: one or two register
classes.  It might be that the difference between Alternative 2 and this one,
is the CCState or Clang workaround. I will investigate a bit more.

________________________________
From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Sjoerd Meijer via llvm-dev <llvm-dev at lists.llvm.org>
Sent: 06 December 2017 08:32
To: Friedman, Eli; llvm-dev at lists.llvm.org
Subject: Re: [llvm-dev] [RFC] Half-Precision Support in the Arm Backends

Thanks a lot for the suggestions! I will look into using vld1/vst1, sounds good.

I am custom lowering the bitcasts, that's now the only place where FP_TO_FP16

and FP16_TO_FP nodes are created to avoid inefficient code generation. I will

double check if I can't achieve the same without using these nodes (because I

really would like to get completely rid of them).

Cheers,

Sjoerd.

>On 12/4/2017 6:44 AM, Sjoerd Meijer via llvm-dev wrote:
>>
>> Custom Lowering
>> -------------------------
>>
>> Making f16 legal and not having native load/stores instructions available,
>> (no FullFP16 support) means custom lowering loads/stores:
>> 1) Since we don't have FP16 load/store instructions available, we create
>>    integer half-word loads. I unfortunately need the FP16_TO_FP node here,
>>    because that "models" creating an integer value, which is what we need
>>    to create a "truncating i16" integer load instructions. Instead, of
>> using
>>    FP16_TO_FP, I have tried BITCASTs, but this can lead to code generation
>>    to stack loads/stores which I don't want.
>> 2) Custom lowering f16 stores is very similar, and creates truncating
>>    half-word integer stores.
>
>Technically, there are no f16 load/store instructions, yes, but we can
>use NEON vdl1 and vst1 to get something roughly equivalent, right?
>
>You probably want to custom-lower BITCAST instructions; the generic
>sequence emitted by the legalizer is pretty inefficient in most cases.
>
>---
>
>Overall, I think your approach makes sense.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180122/74f004ac/attachment-0001.html>