<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - ARMv7a: inefficient code generated from memcpy + bswap builtin"
href="https://bugs.llvm.org/show_bug.cgi?id=51621">51621</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>ARMv7a: inefficient code generated from memcpy + bswap builtin
</td>
</tr>
<tr>
<th>Product</th>
<td>libraries
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Backend: ARM
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>sami.liedes@iki.fi
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org, smithp352@googlemail.com, Ties.Stuij@arm.com
</td>
</tr></table>
<p>
<div>
<pre>Also applies to: armv7-a clang 11.0.1
Godbolt: <a href="https://godbolt.org/z/9f1GKf5qe">https://godbolt.org/z/9f1GKf5qe</a>
Consider this code:
----------
#include <stdint.h>
uint32_t read_unaligned_memcpy_bswap_32(const uint8_t *buf, int offset) {
uint32_t val;
__builtin_memcpy(&val, buf+offset, 4);
return __builtin_bswap32(val);
}
uint32_t read_unaligned_shift_add_32(const uint8_t *buf, int offset) {
return (((uint32_t)buf[offset]) << 24) +
(((uint32_t)buf[offset+1]) << 16) +
(((uint32_t)buf[offset+2]) << 8) +
(((uint32_t)buf[offset+3]) << 0);
}
----------
On many architectures, eg. ARMv8, these produce identical and efficient code.
On ARMv7a, __builtin_bswap32 version produces what looks like *worse* code
compared to the shift+add version (although I admit I don't know the
architecture well enough to be sure, but at least the result has 14
instructions as opposed to 8):
read_unaligned_memcpy_bswap_32(unsigned char const*, int):
ldrb r1, [r0, r1]!
ldrb r2, [r0, #1]
ldrb r3, [r0, #2]
ldrb r0, [r0, #3]
orr r1, r1, r2, lsl #8
orr r0, r3, r0, lsl #8
mov r2, #16711680
orr r0, r1, r0, lsl #16
mov r1, #65280
and r1, r1, r0, lsr #8
and r2, r2, r0, lsl #8
orr r1, r1, r0, lsr #24
orr r0, r2, r0, lsl #24
orr r0, r0, r1
bx lr
read_unaligned_shift_add_32(unsigned char const*, int):
ldrb r1, [r0, r1]!
ldrb r2, [r0, #1]
ldrb r3, [r0, #2]
ldrb r0, [r0, #3]
lsl r2, r2, #16
orr r1, r2, r1, lsl #24
orr r1, r1, r3, lsl #8
orr r0, r1, r0
bx lr
The same applies to the 16-bit version (see the Godbolt link for code), but the
difference is much less dramatic (also there trunk generates one instruction
more compared to 11.0.1 for the 16-bit bswap version; I don't know how
significant that is).</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>