<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - Poor code generation for NEON unaligned loads/stores"
href="https://bugs.llvm.org/show_bug.cgi?id=39414">39414</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>Poor code generation for NEON unaligned loads/stores
</td>
</tr>
<tr>
<th>Product</th>
<td>libraries
</td>
</tr>
<tr>
<th>Version</th>
<td>7.0
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Windows NT
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>normal
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Backend: ARM
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>fabiang@radgametools.com
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org
</td>
</tr></table>
<p>
<div>
<pre>Sample repro on Godbolt:
<a href="https://godbolt.org/z/itWjaD">https://godbolt.org/z/itWjaD</a>
Here's the (C++) code verbatim:
----
#include <arm_neon.h>
typedef uint32_t U32unalign __attribute__((aligned(1)));
void f_unalign(void *p, uint8x8_t v)
{
vst1_lane_u32((U32unalign *) p, vreinterpret_u32_u8(v), 0);
}
void f_align(void *p, uint8x8_t v)
{
vst1_lane_u32((uint32_t *) p, vreinterpret_u32_u8(v), 0);
}
uint8x8_t g_unalign(const void *p)
{
return vld1_dup_u32((U32unalign *)p);
}
uint8x8_t g_align(const void *p)
{
return vld1_dup_u32((uint32_t *)p);
}
----
clang -target armv7a-none-eabi -O2 -S produces:
----
f_unalign(void*, __simd64_uint8_t):
vmov d16, r2, r3
vmov.32 r1, d16[0]
strb r1, [r0]
lsr r2, r1, #24
strb r2, [r0, #3]
lsr r2, r1, #16
lsr r1, r1, #8
strb r2, [r0, #2]
strb r1, [r0, #1]
bx lr
f_align(void*, __simd64_uint8_t):
vmov d16, r2, r3
vst1.32 {d16[0]}, [r0:32]
bx lr
g_unalign(void const*):
ldrb r1, [r0]
ldrb r2, [r0, #1]
ldrb r3, [r0, #2]
ldrb r0, [r0, #3]
orr r1, r1, r2, lsl #8
orr r0, r3, r0, lsl #8
orr r0, r1, r0, lsl #16
vdup.32 d16, r0
vmov r0, r1, d16
bx lr
g_align(void const*):
vld1.32 {d16[]}, [r0:32]
vmov r0, r1, d16
bx lr
----
The code I would expect to see for the unaligned variants is:
----
f_unalign(void*, __simd64_uint8_t):
vmov d16, r2, r3
vst1.32 {d16[0]}, [r0] ; same as f_align except for :32 alignment
specifier
bx lr
g_unalign(void const*):
vld1.32 {d16[]}, [r0] ; same here
vmov r0, r1, d16
bx lr
----
This is especially awkward with the NEON intrinsics because the single-lane
variants of vst1.32 (or .16) can be used to either:
- Store a single 32-bit lane (the intrinsics are OK for this)
- Store contiguous 32 bits (or 16 bits) of a narrower type, which the
intrinsics can't really express, requiring awkward workarounds like the
"U32unalign" construction above.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>