[llvm-dev] [ARM] Should Use Load and Store with Register Offset
Sjoerd Meijer via llvm-dev
llvm-dev at lists.llvm.org
Mon Jul 20 02:15:08 PDT 2020
Hello Daniel,
LLVM and GCC's optimisation levels are not really equivalent. In Clang, -Os makes a performance and code-size trade off. In GCC, -Os is minimising code-size, which is equivalent to -Oz with Clang. I have't looked into details yet, but changing -Os to -Oz in the godbolt link gives the codegen you're looking for?
Cheers,
Sjoerd.
________________________________
From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Daniel Way via llvm-dev <llvm-dev at lists.llvm.org>
Sent: 20 July 2020 06:54
To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
Subject: [llvm-dev] [ARM] Should Use Load and Store with Register Offset
Hello LLVM Community (specifically anyone working with ARM Cortex-M),
While trying to compile the Newlib C library I found that Clang10 was generating slightly larger binaries than the libc from the prebuilt gcc-arm-none-eabi toolchain. I looked at a few specific functions (memcpy, strcpy, etc.) and noticed that LLVM does not tend to generate load/store instructions with a register offset (e.g. ldr Rd, [Rn, Rm] form) and instead prefers the immediate offset form.
When copying a contiguous sequence of bytes, this results in additional instructions to modify the base address. https://godbolt.org/z/T1xhae
void* memcpy_alt1(void* dst, const void* src, size_t len) {
char* save = (char*)dst;
for (size_t i = 0; i < len; ++i)
*((char*)(dst + i)) = *((char*)(src + i));
return save;
}
clang --target=armv6m-none-eabi -Os -fomit-frame-pointer
memcpy_alt1:
push {r4, lr}
cmp r2, #0
beq .LBB0_3
mov r3, r0
.LBB0_2:
ldrb r4, [r1]
strb r4, [r3]
adds r1, r1, #1
adds r3, r3, #1
subs r2, r2, #1
bne .LBB0_2
.LBB0_3:
pop {r4, pc}
arm-none-eabi-gcc -march=armv6-m -Os
memcpy_alt1:
movs r3, #0
push {r4, lr}
.L2:
cmp r3, r2
bne .L3
pop {r4, pc}
.L3:
ldrb r4, [r1, r3]
strb r4, [r0, r3]
adds r3, r3, #1
b .L2
Because this code appears in a loop that could be copying hundreds of bytes, I want to add an optimization that will prioritize load/store instructions with register offsets when the offset is used multiple times. I have not worked on LLVM before, so I'd like advice about where to start.
* The generated code is correct, just sub-optimal so is it appropriate to submit a bug report?
* Is anyone already tackling this change or is there someone with more experience interested in collaborating?
* Is this optimization better performed early during instruction selection or late using c++ (i.e. ARMLoadStoreOptimizer.cpp)
* What is the potential to cause harm to other parts of the code gen, specifically for other arm targets. I'm working with armv6m, but armv7m offers base register updating in a single instruction. I don't want to break other useful optimizations.
So far, I am reading through the LLVM documentation to see where a change could be applied. I have also:
* Compiled with -S -emit-llvm (see Godbolt link)
There is an identifiable pattern where a getelementptr function is followed by a load or store. When multiple getelementptr functions appear with the same virtual register offset, maybe this should match a tLDRr or tSTRr.
* Ran LLC with --print-machineinstrs
It appears that tLDRBi and tSTRBi are selected very early and never replaced by the equivalent t<LDRB|STRB>r instructions.
Thank you,
Daniel Way
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200720/ddda3aee/attachment.html>
More information about the llvm-dev
mailing list