[llvm-dev] [ARM] Should Use Load and Store with Register Offset
Daniel Way via llvm-dev
llvm-dev at lists.llvm.org
Sun Jul 19 22:54:53 PDT 2020
Hello LLVM Community (specifically anyone working with ARM Cortex-M),
While trying to compile the Newlib C library I found that Clang10 was
generating slightly larger binaries than the libc from the prebuilt
gcc-arm-none-eabi toolchain. I looked at a few specific functions (memcpy,
strcpy, etc.) and noticed that LLVM does not tend to generate load/store
instructions with a register offset (e.g. ldr Rd, [Rn, Rm] form) and
instead prefers the immediate offset form.
When copying a contiguous sequence of bytes, this results in additional
instructions to modify the base address. https://godbolt.org/z/T1xhae
void* memcpy_alt1(void* dst, const void* src, size_t len) {
char* save = (char*)dst;
for (size_t i = 0; i < len; ++i)
*((char*)(dst + i)) = *((char*)(src + i));
return save;
}
clang --target=armv6m-none-eabi -Os -fomit-frame-pointer
memcpy_alt1:
push {r4, lr}
cmp r2, #0
beq .LBB0_3
mov r3, r0
.LBB0_2:
ldrb r4, [r1]
strb r4, [r3]
adds r1, r1, #1
adds r3, r3, #1
subs r2, r2, #1
bne .LBB0_2
.LBB0_3:
pop {r4, pc}
arm-none-eabi-gcc -march=armv6-m -Os
memcpy_alt1:
movs r3, #0
push {r4, lr}
.L2:
cmp r3, r2
bne .L3
pop {r4, pc}
.L3:
ldrb r4, [r1, r3]
strb r4, [r0, r3]
adds r3, r3, #1
b .L2
Because this code appears in a loop that could be copying hundreds of
bytes, I want to add an optimization that will prioritize load/store
instructions with register offsets when the offset is used multiple times.
I have not worked on LLVM before, so I'd like advice about where to start.
- The generated code is correct, just sub-optimal so is it appropriate
to submit a bug report?
- Is anyone already tackling this change or is there someone with more
experience interested in collaborating?
- Is this optimization better performed early during instruction
selection or late using c++ (i.e. ARMLoadStoreOptimizer.cpp)
- What is the potential to cause harm to other parts of the code gen,
specifically for other arm targets. I'm working with armv6m, but armv7m
offers base register updating in a single instruction. I don't want to
break other useful optimizations.
So far, I am reading through the LLVM documentation to see where a change
could be applied. I have also:
- Compiled with -S -emit-llvm (see Godbolt link)
There is an identifiable pattern where a getelementptr function is
followed by a load or store. When multiple getelementptr functions appear
with the same virtual register offset, maybe this should match a tLDRr or
tSTRr.
- Ran LLC with --print-machineinstrs
It appears that tLDRBi and tSTRBi are selected very early and never
replaced by the equivalent t<LDRB|STRB>r instructions.
Thank you,
Daniel Way
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200720/8d89a248/attachment.html>
More information about the llvm-dev
mailing list