[llvm-dev] [RFC] Half-Precision Support in the Arm Backends
Sjoerd Meijer via llvm-dev
llvm-dev at lists.llvm.org
Thu Jan 18 05:58:29 PST 2018
I would like to revive this thread, as I am struggling a lot with the FP16
implementation in the ARM backend. My implementation in
https://reviews.llvm.org/D38315 is finished (except one case), but a more
robust alternative implementation was suggested. One can indeed argue that my
current implementation is a bit fragile, because it involves manually patching
up the isel dags for a few cases. The suggestion was to look into CCState and
adjusting of the calling convention lowering, inspired by a recent discussion
on the list here: http://lists.llvm.org/pipermail/llvm-dev/2018-January/120098.html.
The benefit of this approach is that I would get most of legalization for free, which is the
fragile bit in my approach at the moment.
Anyway, it's become a long story, but (almost) in chronological order this is
what I've considered. Any hints, tips, suggestions welcome.
Goal of my exercise:
--------------------
1) Add match rules for some Armv8.2-A FP16 instructions, e.g.: fsub, fadd,
and also some conversion instructions.
2) Don't regress the "storage-only cases", i.e., when we only have the
conversion instructions available.
The Approach:
-------------
1) First, we make f16 legal only when we have FullFP16 support (i.e. when
Armv8.2-A FP16 is supported), so in ARMISelLowering.cpp we add:
if (Subtarget->hasFullFP16()) {
addRegisterClass(MVT::f16, &ARM::HPRRegClass);
}
2) This is the first implementation decision, I introduce a new register class
HPR, which is an exact copy of the single-precision register class SPR, except
that it holds f16 types:
def HPR : RegisterClass<"ARM", [f16], 32, (sequence "S%u", 0, 31)> {
...
I think this makes sense because half-types sit in the lower 16 bits of the
S-registers, and the reason to create a separate HPR register class is to avoid
typing of all rules that currently use SPR, because if we would allow both f16
and f32 in SPR, then there's a choice and rules need to be typed.
3) Next, add match rules for some Armv8.2-A FP16 instructions. Instruction
VADDH, which has already a tablegen description but not yet a match rules,
simply becomes:
def VADDH : AHbI<0b11100, 0b11, 0, 0,
(outs HPR:$Sd), (ins HPR:$Sn, HPR:$Sm),
IIC_fpALU16, "vadd", ".f16\t$Sd, $Sn, $Sm",
[(set HPR:$Sd, (fadd HPR:$Sn, HPR:$Sm))]>, // <~~~ new match rule using HPR
This is straightforward business so far, but I already learned the hard way
that the conversion are the tricky ones, so I repeat this for an f16 -> f32
upconvert:
def VCVTBHS: ASuI<0b11101, 0b11, 0b0010, 0b01, 0, (outs SPR:$Sd), (ins HPR:$Sm),
IIC_fpCVTSH, "vcvtb", ".f32.f16\t$Sd, $Sm",
[(set SPR:$Sd, (fpextend HPR:$Sm))]>, // <~~~~ new match rule using HPR and SPR
Requires<[HasFP16]>,
Sched<[WriteFPCVT]>;
This new rule matches and fp_extend that consumes an HPR and produces a result
in SPR.
And that's essentially it and all the code changes I want to make. With these
changes, I want to see an fadd with two f16 opernands being codegen'd, and also
I want to keep the existing tests passing of course.
The problem:
------------
The problem appears to be that introducing HPR looks like a good idea, but it
gives problems when FullFP16 is not supported. This is best illustrated with
this existing test which is a simple upconvert of f16 to f32:
define float @test_extend32(half* %addr) {
%val16 = load half, half* %addr
%val32 = fpext half %val16 to float
ret float %val32
}
It should generate this code::
ldrh r0, [r0] ; integer half word load
vmov s0, r0
vcvtb.f32.f16 s0, s0
vmov r0, s0
bx lr
when we don't have the Armv8.2-A FP16 instructions available, and thus only
have the conversion instructions.
The problem is in the conversion rules, some rewrite rules to be more specific,
and I think this is one of the culprits:
def : Pat<(f16_to_fp GPR:$a),
(VCVTBHS (COPY_TO_REGCLASS GPR:$a, SPR))>;
This rewrite rule is supposed to first move the GPR reg in to a S-registers:
vmov s0, r0
and then to the conversion:
vcvtb.f32.f16 s0, s0
This rewrite rule gets triggered because the ISEL DAG has indeed this funny
node f16_to_fp (which models this register move and the float-convert):
t0: ch = EntryToken
t2: i32,ch = CopyFromReg t0, Register:i32 %0
t16: i32,ch = load<LD2[%addr], zext from i16> t0, t2, undef:i32
t12: f32 = fp16_to_fp t16 <~~~~~~~~ FUNNY NODE HERE
t7: i32 = bitcast t12
t9: ch,glue = CopyToReg t0, Register:i32 %r0, t7
t10: ch = ARMISD::RET_FLAG t9, Register:i32 %r0, t9:1
And yes, to make it even funnier, this node has an i32 operand, and that's
because we do the half-float load with an integer load instruction.
And after this rewrite, we end up with this DAG:
t0: ch = EntryToken
t2: i32,ch = CopyFromReg t0, Register:i32 %0
t16: i32,ch = t2LDRHi12<Mem:LD2[%addr]> t2, TargetConstant:i32<0>, TargetConstant:i32<14>, Register:i32 %noreg, t0
t20: f16 = COPY_TO_REGCLASS t16, TargetConstant:i32<1> <~~~~~~~~~~~~~ PROBLEM HERE
t12: f32 = VCVTBHS t20, TargetConstant:i32<14>, Register:i32 %noreg
t7: i32 = VMOVRS t12, TargetConstant:i32<14>, Register:i32 %noreg
t9: ch,glue = CopyToReg t0, Register:i32 %r0, t7
t10: ch = tBX_RET TargetConstant:i32<14>, Register:i32 %noreg, Register:i32 %r0, t9, t9:1
It goes all wrong from here, because a f16 value is produced and it is not a
legal type etc.
The reason that this happens must be because VCVTBHS that is used in
the f16_to_fp rewrite rule, is specified to consume a f16 value, so the
CopyToReg moves from GPR to an S-Registers:
def : Pat<(f16_to_fp GPR:$a),
(VCVTBHS (COPY_TO_REGCLASS GPR:$a, SPR))>;
Conclusion:
-----------
This approach is not going to work:
- because there's rewrite rule X which generates instructions Y that use both
the SPR and HPR descriptions.
- Y can be instruction selected when FullFP16 is enabled, this works and no
problems (and we don't need X).
- but X is used when FullFP16 is not supported, and generates Y that uses HPRs
which is not legal.
Introducing a new HPR register class is a possibility though, but then special
care needs to be taken in a few cases. This is the first alternative I started
exploring.
Alternative 1
--------------
This is the approach I took in D38315, which made f16 also legal for the
storage-only case, but which I am ready to abandon:
- Define the HPR register class,
- Use this and make f16 legal also in the storage-only case,
-- This works for the storage-only case, because data processing instruction
that have f16 operands will get promoted and fp_extend and fp_round nodes will
be introduced. Then I manually patch up the f16 loads and stores and a few
other corner cases by manually lowering them. This however is a fragile
approach, which was rightfully commented by Oli, also after being inspired by a
recent discussion about some backends doing their own CCState and function
argument lowering. This is Alternative 3.
-- Another crucial aspect and benefit of this approach, is that this fixes
AAPCS compliance issues: while half-precision arguments sit in single-precision
registers, callees should read (and also write) only the lower 16 bits. I got
this for free by making f16 legal.
Alternative 2
-------------
This alternative aims at getting something ready soon, it relies on
exploring creating one register class and a Clang that generates the
required up and down converts for the storage-only case:
- We have one SPR registerclass that contains both f16 and f32:
def SPR : RegisterClass<"ARM", [f16, f32], 32, (sequence "S%u", 0, 31)> {
...
- Then I will have to type all rules that e.g. use SPR:$a and change it in (f32
SPR:$a):
-- this will be a massive patch,
-- but hopefully will not have a problem in the rewrite rules.
- Then, I also need to modify a clang workaround, which should inserts the proper up and down
converts for half types that are generated from the usage of _Float16,
similarly like we already do for __fp16. This workaround can then later be replaced
by the CCState implementation.
Alternative 3:
----------------
This is the best approach, where I start looking into the AAPCS compliance
issues first:
- Fix the AAPCS compliance issues by implementing our own CCState and function
call lowering for f16 arguments.
-- With this we achieve that f16 does not need to be a legal type when in fact
it shouldn't be, so then the "normal" promotions kick in for illegal f16
operands and thus I don't need custom lowering for loads/stores.
but after that I am unsure how to do the match rules: one or two register
classes. It might be that the difference between Alternative 2 and this one,
is the CCState or Clang workaround. I will investigate a bit more.
________________________________
From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Sjoerd Meijer via llvm-dev <llvm-dev at lists.llvm.org>
Sent: 06 December 2017 08:32
To: Friedman, Eli; llvm-dev at lists.llvm.org
Subject: Re: [llvm-dev] [RFC] Half-Precision Support in the Arm Backends
Thanks a lot for the suggestions! I will look into using vld1/vst1, sounds good.
I am custom lowering the bitcasts, that's now the only place where FP_TO_FP16
and FP16_TO_FP nodes are created to avoid inefficient code generation. I will
double check if I can't achieve the same without using these nodes (because I
really would like to get completely rid of them).
Cheers,
Sjoerd.
>On 12/4/2017 6:44 AM, Sjoerd Meijer via llvm-dev wrote:
>>
>> Custom Lowering
>> -------------------------
>>
>> Making f16 legal and not having native load/stores instructions available,
>> (no FullFP16 support) means custom lowering loads/stores:
>> 1) Since we don't have FP16 load/store instructions available, we create
>> integer half-word loads. I unfortunately need the FP16_TO_FP node here,
>> because that "models" creating an integer value, which is what we need
>> to create a "truncating i16" integer load instructions. Instead, of
>> using
>> FP16_TO_FP, I have tried BITCASTs, but this can lead to code generation
>> to stack loads/stores which I don't want.
>> 2) Custom lowering f16 stores is very similar, and creates truncating
>> half-word integer stores.
>
>Technically, there are no f16 load/store instructions, yes, but we can
>use NEON vdl1 and vst1 to get something roughly equivalent, right?
>
>You probably want to custom-lower BITCAST instructions; the generic
>sequence emitted by the legalizer is pretty inefficient in most cases.
>
>---
>
>Overall, I think your approach makes sense.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180118/930958ce/attachment-0001.html>
More information about the llvm-dev
mailing list