<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>

</head>

<body dir="ltr">

<div id="divtagdefaultwrapper" style="font-size: 12pt; color: rgb(0, 0, 0); font-family: Calibri, Helvetica, sans-serif, "EmojiFont", "Apple Color Emoji", "Segoe UI Emoji", NotoColorEmoji, "Segoe UI Symbol", "Android Emoji", EmojiSymbols;" dir="ltr">

<p style="margin-top:0;margin-bottom:0"></p>

<div>I would like to revive this thread, as I am struggling a lot with the FP16<br>

implementation in the ARM backend. My implementation in<br>

https://reviews.llvm.org/D38315 is finished (except one case), but a more<br>

robust alternative implementation was suggested. One can indeed argue that my<br>

current implementation is a bit fragile, because it involves manually patching<br>

up the isel dags for a few cases. The suggestion was to look into CCState and<br>

adjusting of the calling convention lowering, inspired by a recent discussion<br>

on the list here: http://lists.llvm.org/pipermail/llvm-dev/2018-January/120098.html.<br>

The benefit of this approach is that I would get most of legalization for free, which is the<br>

fragile bit in my approach at the moment.<br>

Anyway, it's become a long story, but (almost) in chronological order this is<br>

what I've considered. Any hints, tips, suggestions welcome.<br>

<br>

Goal of my exercise:<br>

--------------------<br>

<br>

1) Add match rules for some Armv8.2-A FP16 instructions, e.g.: fsub, fadd,<br>

and also some conversion instructions.<br>

2) Don't regress the "storage-only cases", i.e., when we only have the<br>

conversion instructions available.<br>

<br>

The Approach:<br>

-------------<br>

<br>

1) First, we make f16 legal only when we have FullFP16 support (i.e. when<br>

Armv8.2-A FP16 is supported), so in ARMISelLowering.cpp we add:<br>

<br>

  if (Subtarget->hasFullFP16()) {<br>

    addRegisterClass(MVT::f16, &ARM::HPRRegClass);<br>

  }<br>

<br>

2) This is the first implementation decision, I introduce a new register class<br>

HPR, which is an exact copy of the single-precision register class SPR, except<br>

that it holds f16 types:<br>

<br>

def HPR : RegisterClass<"ARM", [f16], 32, (sequence "S%u", 0, 31)> {<br>

 ...<br>

<br>

I think this makes sense because half-types sit in the lower 16 bits of the<br>

S-registers, and the reason to create a separate HPR register class is to avoid<br>

typing of all rules that currently use SPR, because if we would allow both f16<br>

and f32 in SPR, then there's a choice and rules need to be typed.<br>

<br>

3) Next, add match rules for some Armv8.2-A FP16 instructions. Instruction<br>

VADDH, which has already a tablegen description but not yet a match rules,<br>

simply becomes:<br>

<br>

def VADDH  : AHbI<0b11100, 0b11, 0, 0,<br>

                  (outs HPR:$Sd), (ins HPR:$Sn, HPR:$Sm),<br>

                  IIC_fpALU16, "vadd", ".f16\t$Sd, $Sn, $Sm",<br>

                  [(set HPR:$Sd, (fadd HPR:$Sn, HPR:$Sm))]>,       // <~~~ new match rule using HPR<br>

<br>

This is straightforward business so far, but I already learned the hard way<br>

that the conversion are the tricky ones, so I repeat this for an f16 -> f32<br>

upconvert:<br>

<br>

def VCVTBHS: ASuI<0b11101, 0b11, 0b0010, 0b01, 0, (outs SPR:$Sd), (ins HPR:$Sm),<br>

                 IIC_fpCVTSH, "vcvtb", ".f32.f16\t$Sd, $Sm",<br>

                 [(set SPR:$Sd, (fpextend HPR:$Sm))]>,      // <~~~~ new match rule using HPR and SPR<br>

                 Requires<[HasFP16]>,<br>

             Sched<[WriteFPCVT]>;<br>

<br>

</div>

<p></p>

<p style="margin-top:0;margin-bottom:0"></p>

<div>This new rule matches and fp_extend that consumes an HPR and produces a result<br>

in SPR.<br>

<br>

And that's essentially it and all the code changes I want to make.  With these<br>

changes, I want to see an fadd with two f16 opernands being codegen'd, and also<br>

I want to keep the existing tests passing of course.<br>

<br>

The problem:<br>

------------<br>

<br>

The problem appears to be that introducing HPR looks like a good idea, but it<br>

gives problems when FullFP16 is not supported. This is best illustrated with<br>

this existing test which is a simple upconvert of f16 to f32:<br>

<br>

define float @test_extend32(half* %addr) {<br>

  %val16 = load half, half* %addr<br>

  %val32 = fpext half %val16 to float<br>

  ret float %val32<br>

}<br>

<br>

It should generate this code::<br>

<br>

        ldrh    r0, [r0]          ; integer half word load<br>

        vmov    s0, r0<br>

        vcvtb.f32.f16   s0, s0<br>

        vmov    r0, s0<br>

        bx      lr<br>

<br>

when we don't have the Armv8.2-A FP16 instructions available, and thus only<br>

have the conversion instructions.<br>

<br>

The problem is in the conversion rules, some rewrite rules to be more specific,<br>

and I think this is one of the culprits:<br>

<br>

def : Pat<(f16_to_fp GPR:$a),<br>

          (VCVTBHS (COPY_TO_REGCLASS GPR:$a, SPR))>;<br>

<br>

This rewrite rule is supposed to first move the GPR reg in to a S-registers:<br>

<br>

    vmov    s0, r0<br>

<br>

and then to the conversion:<br>

<br>

    vcvtb.f32.f16   s0, s0<br>

<br>

This rewrite rule gets triggered because the ISEL DAG has indeed this funny<br>

node f16_to_fp (which models this register move and the float-convert):<br>

<br>

t0: ch = EntryToken<br>

          t2: i32,ch = CopyFromReg t0, Register:i32 %0<br>

        t16: i32,ch = load<LD2[%addr], zext from i16> t0, t2, undef:i32<br>

      t12: f32 = fp16_to_fp t16     <~~~~~~~~ FUNNY NODE HERE<br>

    t7: i32 = bitcast t12<br>

  t9: ch,glue = CopyToReg t0, Register:i32 %r0, t7<br>

  t10: ch = ARMISD::RET_FLAG t9, Register:i32 %r0, t9:1<br>

<br>

</div>

<div>And yes, to make it even funnier, this node has an i32 operand, and that's<br>

because we do the half-float load with an integer load instruction.<br>

<br>

And after this rewrite, we end up with this DAG:<br>

<br>

  t0: ch = EntryToken<br>

            t2: i32,ch = CopyFromReg t0, Register:i32 %0<br>

          t16: i32,ch = t2LDRHi12<Mem:LD2[%addr]> t2, TargetConstant:i32<0>, TargetConstant:i32<14>, Register:i32 %noreg, t0<br>

        t20: f16 = COPY_TO_REGCLASS t16, TargetConstant:i32<1> <~~~~~~~~~~~~~ PROBLEM HERE<br>

      t12: f32 = VCVTBHS t20, TargetConstant:i32<14>, Register:i32 %noreg<br>

    t7: i32 = VMOVRS t12, TargetConstant:i32<14>, Register:i32 %noreg<br>

  t9: ch,glue = CopyToReg t0, Register:i32 %r0, t7<br>

  t10: ch = tBX_RET TargetConstant:i32<14>, Register:i32 %noreg, Register:i32 %r0, t9, t9:1<br>

<br>

It goes all wrong from here, because a f16 value is produced and it is not a<br>

legal type etc.<br>

<br>

The reason that this happens must be because VCVTBHS that is used in<br>

the f16_to_fp rewrite rule, is specified to consume a f16 value, so the<br>

CopyToReg moves from GPR to an S-Registers:<br>

<br>

def : Pat<(f16_to_fp GPR:$a),<br>

          (VCVTBHS (COPY_TO_REGCLASS GPR:$a, SPR))>;<br>

<br>

Conclusion:<br>

-----------<br>

<br>

This approach is not going to work:<br>

- because there's  rewrite rule X which generates instructions Y that use both<br>

  the SPR and HPR descriptions.<br>

- Y can be instruction selected when FullFP16 is enabled, this works and no<br>

  problems (and we don't need X).<br>

- but X is used when FullFP16 is not supported, and generates Y that uses HPRs<br>

  which is not legal.<br>

<br>

Introducing a new HPR register class is a possibility though, but then special<br>

care needs to be taken in a few cases. This is the first alternative I started<br>

exploring.</div>

<div><br>

</div>

<div>

<div>Alternative 1<br>

--------------<br>

<br>

This is the approach I took in D38315, which made f16 also legal for the<br>

storage-only case, but which I am ready to abandon:<br>

<br>

- Define the HPR register class,<br>

- Use this and make f16 legal also in the storage-only case,<br>

-- This works for the storage-only case, because data processing instruction<br>

   that have f16 operands will get promoted and fp_extend and fp_round nodes will<br>

   be introduced. Then I manually patch up the f16 loads and stores and a few<br>

   other corner cases by manually lowering them. This however is a fragile<br>

   approach, which was rightfully commented by Oli, also after being inspired by a<br>

   recent discussion about some backends doing their own CCState and function<br>

   argument lowering. This is Alternative 3.<br>

-- Another crucial aspect and benefit of this approach, is that this fixes<br>

   AAPCS compliance issues: while half-precision arguments sit in single-precision<br>

   registers, callees should read (and also write) only the lower 16 bits. I got<br>

   this for free by making f16 legal.<br>

<br>

<br>

Alternative 2<br>

-------------<br>

<br>

This alternative aims at getting something ready soon, it relies on<br>

exploring creating one register class and a Clang that generates the<br>

required up and down converts for the storage-only case:<br>

<br>

- We have one SPR registerclass that contains both f16 and f32:<br>

<br>

  def SPR : RegisterClass<"ARM", [f16, f32], 32, (sequence "S%u", 0, 31)> {<br>

  ...<br>

<br>

- Then I will have to type all rules that e.g. use SPR:$a and change it in (f32<br>

  SPR:$a):<br>

-- this will be a massive patch,<br>

-- but hopefully will not have a problem in the rewrite rules.<br>

<br>

- Then, I also need to modify a clang workaround, which should inserts the proper up and down<br>

  converts for half types that are generated from the usage of _Float16,<br>

  similarly like we already do for __fp16. This workaround can then later be replaced<br>

  by the CCState implementation.</div>

<div><br>

</div>

<div>

<div>Alternative 3:<br>

----------------<br>

<br>

This is the best approach, where I start looking into the AAPCS compliance<br>

issues first:<br>

<br>

- Fix the AAPCS compliance issues by implementing our own CCState and function<br>

  call lowering for f16 arguments.<br>

-- With this we achieve that f16 does not need to be a legal type when in fact<br>

   it shouldn't be, so then the "normal" promotions kick in for illegal f16<br>

   operands and thus I don't need  custom lowering for loads/stores.<br>

<br>

but after that I am unsure how to do the match rules: one or two register<br>

classes.  It might be that the difference between Alternative 2 and this one,<br>

is the CCState or Clang workaround. I will investigate a bit more.</div>

<br>

</div>

<br>

<br>

</div>

<br>

<p></p>

<br>

<br>

<div style="color: rgb(0, 0, 0);">

<hr style="display:inline-block;width:98%" tabindex="-1">

<div id="divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" color="#000000" face="Calibri, sans-serif"><b>From:</b> llvm-dev <llvm-dev-bounces@lists.llvm.org> on behalf of Sjoerd Meijer via llvm-dev <llvm-dev@lists.llvm.org><br>

<b>Sent:</b> 06 December 2017 08:32<br>

<b>To:</b> Friedman, Eli; llvm-dev@lists.llvm.org<br>

<b>Subject:</b> Re: [llvm-dev] [RFC] Half-Precision Support in the Arm Backends</font>

<div> </div>

</div>

<div dir="ltr">

<div id="x_divtagdefaultwrapper" dir="ltr" style="font-size:12pt; color:rgb(0,0,0); font-family:Calibri,Helvetica,sans-serif,"EmojiFont","Apple Color Emoji","Segoe UI Emoji",NotoColorEmoji,"Segoe UI Symbol","Android Emoji",EmojiSymbols">

<p style="margin-top:0; margin-bottom:0">Thanks a lot for the suggestions! I will look into using vld1/vst1, sounds good.</p>

<p style="margin-top:0; margin-bottom:0">I am custom lowering the bitcasts, that's now the only place where FP_TO_FP16

<br>

</p>

<p style="margin-top:0; margin-bottom:0">and FP16_TO_FP nodes are created to avoid inefficient code generation. I will</p>

<p style="margin-top:0; margin-bottom:0">double check if I can't achieve the same without using these nodes (because I</p>

<p style="margin-top:0; margin-bottom:0">really would like to get completely rid of them).<br>

</p>

<p style="margin-top:0; margin-bottom:0"><br>

</p>

<p style="margin-top:0; margin-bottom:0">Cheers,</p>

<p style="margin-top:0; margin-bottom:0">Sjoerd.<br>

</p>

<p style="margin-top:0; margin-bottom:0"><br>

</p>

<pre>>On 12/4/2017 6:44 AM, Sjoerd Meijer via llvm-dev wrote:

>><i>

</i>>><i> Custom Lowering

</i>><i>> -------------------------

</i>>><i>

</i>>><i> Making f16 legal and not having native load/stores instructions available,

</i>>><i> (no FullFP16 support) means custom lowering loads/stores:

</i>>><i> 1) Since we don't have FP16 load/store instructions available, we create

</i>>><i>    integer half-word loads. I unfortunately need the FP16_TO_FP node here,

</i>>><i>    because that "models" creating an integer value, which is what we need

</i>>><i>    to create a "truncating i16" integer load instructions. Instead, of 

</i>>><i> using

</i>>><i>    FP16_TO_FP, I have tried BITCASTs, but this can lead to code generation

</i>>><i>    to stack loads/stores which I don't want.

</i>>><i> 2) Custom lowering f16 stores is very similar, and creates truncating

</i>>><i>    half-word integer stores.

</i>>

>Technically, there are no f16 load/store instructions, yes, but we can 

>use NEON vdl1 and vst1 to get something roughly equivalent, right?

>

>You probably want to custom-lower BITCAST instructions; the generic 

>sequence emitted by the legalizer is pretty inefficient in most cases.

>

>---

>

>Overall, I think your approach makes sense.

</pre>

</div>

</div>

</div>

</div>

</body>

</html>