[llvm-dev] TypePromoteFloat loses intermediate rounding operations

Tue Dec 10 15:34:58 PST 2019

Thanks Eli.

I forgot to bring up the strict FP questions which I was working on when I
found this. If we're in a strict FP function, do the fp_to_f16/f16_to_fp
emitted by promoting load/store/bitcast need to be strict versions of
fp_to_f16/f16_to_fp. And if so where do we get the chain, especially for
the bitcast case which isn't a chained node.

~Craig

On Tue, Dec 10, 2019 at 3:18 PM Eli Friedman <efriedma at quicinc.com> wrote:

> We could fix the legalization without touching the other handling by just
> inserting an fp_to_f16/f16_to_fp pair after each arithmetic operation that
> requires it.  One advantage to that approach is that it’s easier to take
> the obvious shortcut for fast-math.
>
>
>
> The “promote-to-larger” strategy doesn’t really round correctly in
> general, but it works for specific pairs of operator/operation.  For
> example, for f16 fadd in the default rounding mode, “a+b” is exactly
> equivalent to “(_Float16)((float)a+(float)b)”.  Not sure if this works for
> all f16 operations, and not sure how much we care if it doesn’t.
>
> There aren’t any calling convention implications here for ARM targets; not
> sure about other targets.  On 32-bit ARM, clang explicitly coerces half
> values to a legal type.  And half is always legal on AArch64 (unless you
> force soft-float, but at that point we don’t care).
>
>
>
> -Eli
>
>
>
> *From:* Craig Topper <craig.topper at gmail.com>
> *Sent:* Tuesday, December 10, 2019 12:18 PM
> *To:* llvm-dev <llvm-dev at lists.llvm.org>; Eli Friedman <
> efriedma at quicinc.com>; Tim Northover <t.p.northover at gmail.com>
> *Subject:* [EXT] TypePromoteFloat loses intermediate rounding operations
>
>
>
> For the following C code
>
>
>
> __fp16 x, y, z, w;
>
>
>
> void foo() {
>
> x = y + z;
>
> x = x + w;
>
> }
>
>
>
> clang produces IR that extends each operand to float and then truncates to
> half before assigning to x. Like this
>
>
>
> define dso_local void @foo() #0 !dbg !18 {
>
> %1 = load half, half* @y, align 2, !dbg !21
>
> %2 = fpext half %1 to float, !dbg !21
>
> %3 = load half, half* @z, align 2, !dbg !22
>
> %4 = fpext half %3 to float, !dbg !22
>
> %5 = fadd float %2, %4, !dbg !23
>
> %6 = fptrunc float %5 to half, !dbg !21
>
> store half %6, half* @x, align 2, !dbg !24
>
> %7 = load half, half* @x, align 2, !dbg !25
>
> %8 = fpext half %7 to float, !dbg !25
>
> %9 = load half, half* @w, align 2, !dbg !26
>
> %10 = fpext half %9 to float, !dbg !26
>
> %11 = fadd float %8, %10, !dbg !27
>
> %12 = fptrunc float %11 to half, !dbg !25
>
> store half %12, half* @x, align 2, !dbg !28
>
> ret void, !dbg !29
>
> }
>
>
>
> InstCombine then comes along and gets rid of all of the fpext and fptrunc.
> Leaving
>
>
>
> define dso_local void @foo() local_unnamed_addr #0 !dbg !18 {
>
> %1 = load half, half* @y, align 2, !dbg !21, !tbaa !22
>
> %2 = load half, half* @z, align 2, !dbg !26, !tbaa !22
>
> %3 = fadd half %1, %2, !dbg !21
>
> %4 = load half, half* @w, align 2, !dbg !27, !tbaa !22
>
> %5 = fadd half %3, %4, !dbg !28
>
> store half %5, half* @x, align 2, !dbg !29, !tbaa !22
>
> ret void, !dbg !30
>
> }
>
>
>
>
>
> Then SelectionDAG type legalization comes along and creates this as the
> final assembly
>
>
>
> pushq %rax
>
> .cfi_def_cfa_offset 16
>
> movzwl y(%rip), %edi
>
> callq __gnu_h2f_ieee
>
> movss %xmm0, 4(%rsp) # 4-byte Spill
>
> movzwl z(%rip), %edi
>
> callq __gnu_h2f_ieee
>
> addss 4(%rsp), %xmm0 # 4-byte Folded Reload
>
> movss %xmm0, 4(%rsp) # 4-byte Spill
>
> movzwl w(%rip), %edi
>
> callq __gnu_h2f_ieee
>
> addss 4(%rsp), %xmm0 # 4-byte Folded Reload
>
> callq __gnu_f2h_ieee
>
> movw %ax, x(%rip)
>
> popq %rax
>
>
>
>
>
> I assumed SelectionDAG should produce something equivalent to the original
> clang code with 4 total extends to f32 and 2 truncates. Instead we got 3
> extends and 1 truncate. So we lost the intermediate rounding between the 2
> adds that was in the original clang IR.
>
>
>
> I believe this occurs because the TypePromoteFloat legalization converts
> all arithmetic operations to their f32 equivalents, but does not place
> conversions to/from half around them. Instead fp_to_f16 and f16_to_fp nodes
> are only generated at loads, stores, bitcasts, and a probably a few other
> places. Basically only the place where the 16-bit size is needed to make
> the operation possible. Basically what we have is a very similar
> implementation to promoting integers, but that doesn't work for FP because
> we lose out on intermediate rounding.
>
>
>
> It seems like what we should instead do is insert fp16_to_fp and
> fp_to_fp16 in the libcall and arithmetic op handling. And use i16 to
> connect the legalized pieces together. Similar to how we use integer types
> when softening operations. I'm not sure if there would still be rounding
> issues with this, but it seems closer to matching the IR.
>
>
>
> Unfortunately, I think this would have the side effect of changing half
> arguments and return types to i16 instead of float, which would be an ABI
> change. At least on some targets __fp16 can't be used as an argument or
> return type so maybe that won't be a real problem?
>
>
>
> Anyone else have any thoughts on this?
>
>
>
> ~Craig
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20191210/38a1f013/attachment.html>