[llvm-dev] [ARM] Should Use Load and Store with Register Offset

Sun Jul 19 22:54:53 PDT 2020

Hello LLVM Community (specifically anyone working with ARM Cortex-M),

While trying to compile the Newlib C library I found that Clang10 was
generating slightly larger binaries than the libc from the prebuilt
gcc-arm-none-eabi toolchain. I looked at a few specific functions (memcpy,
strcpy, etc.) and noticed that LLVM does not tend to generate load/store
instructions with a register offset (e.g. ldr Rd, [Rn, Rm] form) and
instead prefers the immediate offset form.

When copying a contiguous sequence of bytes, this results in additional
instructions to modify the base address. https://godbolt.org/z/T1xhae

void* memcpy_alt1(void* dst, const void* src, size_t len) {
    char* save = (char*)dst;
    for (size_t i = 0; i < len; ++i)
        *((char*)(dst + i)) = *((char*)(src + i));
    return save;
}

clang --target=armv6m-none-eabi -Os -fomit-frame-pointer
memcpy_alt1:
        push    {r4, lr}
        cmp     r2, #0
        beq     .LBB0_3
        mov     r3, r0
.LBB0_2:
        ldrb    r4, [r1]
        strb    r4, [r3]
        adds    r1, r1, #1
        adds    r3, r3, #1
        subs    r2, r2, #1
        bne     .LBB0_2
.LBB0_3:
        pop     {r4, pc}

arm-none-eabi-gcc -march=armv6-m -Os
memcpy_alt1:
        movs    r3, #0
        push    {r4, lr}
.L2:
        cmp     r3, r2
        bne     .L3
        pop     {r4, pc}
.L3:
        ldrb    r4, [r1, r3]
        strb    r4, [r0, r3]
        adds    r3, r3, #1
        b       .L2

Because this code appears in a loop that could be copying hundreds of
bytes, I want to add an optimization that will prioritize load/store
instructions with register offsets when the offset is used multiple times.
I have not worked on LLVM before, so I'd like advice about where to start.

   - The generated code is correct, just sub-optimal so is it appropriate
   to submit a bug report?
   - Is anyone already tackling this change or is there someone with more
   experience interested in collaborating?
   - Is this optimization better performed early during instruction
   selection or late using c++ (i.e. ARMLoadStoreOptimizer.cpp)
   - What is the potential to cause harm to other parts of the code gen,
   specifically for other arm targets. I'm working with armv6m, but armv7m
   offers base register updating in a single instruction. I don't want to
   break other useful optimizations.

So far, I am reading through the LLVM documentation to see where a change
could be applied. I have also:

   - Compiled with -S -emit-llvm (see Godbolt link)
   There is an identifiable pattern where a getelementptr function is
   followed by a load or store. When multiple getelementptr functions appear
   with the same virtual register offset, maybe this should match a tLDRr or
   tSTRr.
   - Ran LLC with  --print-machineinstrs
   It appears that tLDRBi and tSTRBi are selected very early and never
   replaced by the equivalent t<LDRB|STRB>r instructions.

Thank you,

Daniel Way
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200720/8d89a248/attachment.html>