[llvm-bugs] [Bug 27222] New: Inefficient code for fp16 vectors

Tue Apr 5 11:25:29 PDT 2016

https://llvm.org/bugs/show_bug.cgi?id=27222

            Bug ID: 27222
           Summary: Inefficient code for fp16 vectors
           Product: new-bugs
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: pirama at google.com
                CC: llvm-bugs at lists.llvm.org, srhines at google.com
    Classification: Unclassified

We generate inefficient code for half vectors for some architectures.  Consider
the following IR:

define void @add_h(<4 x half>* %a, <4 x half>* %b) {
entry:
  %x = load <4 x half>, <4 x half>* %a, align 8
  %y = load <4 x half>, <4 x half>* %b, align 8
  %0 = fadd <4 x half> %x, %y
  store <4 x half> %0, <4 x half>* %a
  ret void
}

LLVM currently splits and scalarizes vectors.  IOW, it splits the <4 x half>
into 4 half datum and operates individually on them.  This prevents the backend
from selecting vector load and vector conversion instructions.  The code
generated has repeated 16-byte loads, converstion to fp32, addition, conversion
back to fp16 and a 16-byte store.

Here's the code generated for ARM32:
        ldrh    r4, [r1, #6]
        ldrh    r3, [r0, #6]
        ldrh    r12, [r1]
        ldrh    r2, [r0, #4]
        ldrh    lr, [r0, #2]
        vmov    s0, r4
        ldrh    r4, [r1, #2]
        ldrh    r1, [r1, #4]
        vmov    s2, r3
        ldrh    r3, [r0]
        vmov    s6, r2
        vmov    s10, lr
        vmov    s12, r12
        vcvtb.f32.f16   s0, s0
        vcvtb.f32.f16   s2, s2
        vadd.f32        s0, s2, s0
        vmov    s4, r1
        vmov    s8, r4
        vmov    s14, r3
        vcvtb.f32.f16   s4, s4
        vcvtb.f32.f16   s6, s6
        vcvtb.f32.f16   s2, s8
        vcvtb.f32.f16   s8, s10
        vcvtb.f32.f16   s10, s12
        vcvtb.f32.f16   s12, s14
        vcvtb.f16.f32   s0, s0
        vadd.f32        s4, s6, s4
        vadd.f32        s2, s8, s2
        vadd.f32        s6, s12, s10
        vmov    r1, s0
        vcvtb.f16.f32   s4, s4
        vcvtb.f16.f32   s0, s2
        vcvtb.f16.f32   s2, s6
        strh    r1, [r0, #6]
        vmov    r1, s4
        strh    r1, [r0, #4]
        vmov    r1, s0
        strh    r1, [r0, #2]
        vmov    r1, s2
        strh    r1, [r0]

In comparison, the same code gets translated to the following for AArch64:
        ldr             d0, [x1]
        ldr             d1, [x0]
        fcvtl   v0.4s, v0.4h
        fcvtl   v1.4s, v1.4h
        fadd    v0.4s, v1.4s, v0.4s
        fcvtn   v0.4h, v0.4s
        str             d0, [x0]
        ret
.Lfunc_end0:

This happens for the architectures whose LLVM backends don't natively support
half (such as x86, x86_64 and ARM32).

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20160405/91f05c6c/attachment.html>