<html>
<head>
<base href="https://llvm.org/bugs/" />
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW --- - Inefficient code for fp16 vectors"
href="https://llvm.org/bugs/show_bug.cgi?id=27222">27222</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>Inefficient code for fp16 vectors
</td>
</tr>
<tr>
<th>Product</th>
<td>new-bugs
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>normal
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>new bugs
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>pirama@google.com
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org, srhines@google.com
</td>
</tr>
<tr>
<th>Classification</th>
<td>Unclassified
</td>
</tr></table>
<p>
<div>
<pre>We generate inefficient code for half vectors for some architectures. Consider
the following IR:
define void @add_h(<4 x half>* %a, <4 x half>* %b) {
entry:
%x = load <4 x half>, <4 x half>* %a, align 8
%y = load <4 x half>, <4 x half>* %b, align 8
%0 = fadd <4 x half> %x, %y
store <4 x half> %0, <4 x half>* %a
ret void
}
LLVM currently splits and scalarizes vectors. IOW, it splits the <4 x half>
into 4 half datum and operates individually on them. This prevents the backend
from selecting vector load and vector conversion instructions. The code
generated has repeated 16-byte loads, converstion to fp32, addition, conversion
back to fp16 and a 16-byte store.
Here's the code generated for ARM32:
ldrh r4, [r1, #6]
ldrh r3, [r0, #6]
ldrh r12, [r1]
ldrh r2, [r0, #4]
ldrh lr, [r0, #2]
vmov s0, r4
ldrh r4, [r1, #2]
ldrh r1, [r1, #4]
vmov s2, r3
ldrh r3, [r0]
vmov s6, r2
vmov s10, lr
vmov s12, r12
vcvtb.f32.f16 s0, s0
vcvtb.f32.f16 s2, s2
vadd.f32 s0, s2, s0
vmov s4, r1
vmov s8, r4
vmov s14, r3
vcvtb.f32.f16 s4, s4
vcvtb.f32.f16 s6, s6
vcvtb.f32.f16 s2, s8
vcvtb.f32.f16 s8, s10
vcvtb.f32.f16 s10, s12
vcvtb.f32.f16 s12, s14
vcvtb.f16.f32 s0, s0
vadd.f32 s4, s6, s4
vadd.f32 s2, s8, s2
vadd.f32 s6, s12, s10
vmov r1, s0
vcvtb.f16.f32 s4, s4
vcvtb.f16.f32 s0, s2
vcvtb.f16.f32 s2, s6
strh r1, [r0, #6]
vmov r1, s4
strh r1, [r0, #4]
vmov r1, s0
strh r1, [r0, #2]
vmov r1, s2
strh r1, [r0]
In comparison, the same code gets translated to the following for AArch64:
ldr d0, [x1]
ldr d1, [x0]
fcvtl v0.4s, v0.4h
fcvtl v1.4s, v1.4h
fadd v0.4s, v1.4s, v0.4s
fcvtn v0.4h, v0.4s
str d0, [x0]
ret
.Lfunc_end0:
This happens for the architectures whose LLVM backends don't natively support
half (such as x86, x86_64 and ARM32).</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>