[llvm-dev] [ARM] Should Use Load and Store with Register Offset

Mon Jul 20 02:15:08 PDT 2020

Hello Daniel,

LLVM and GCC's optimisation levels are not really equivalent. In Clang, -Os makes a performance and code-size trade off. In GCC, -Os is minimising code-size, which is equivalent to -Oz with Clang. I have't looked into details yet, but changing -Os to -Oz in the godbolt link gives the codegen you're looking for?

Cheers,
Sjoerd.
________________________________
From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Daniel Way via llvm-dev <llvm-dev at lists.llvm.org>
Sent: 20 July 2020 06:54
To: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
Subject: [llvm-dev] [ARM] Should Use Load and Store with Register Offset

Hello LLVM Community (specifically anyone working with ARM Cortex-M),

While trying to compile the Newlib C library I found that Clang10 was generating slightly larger binaries than the libc from the prebuilt gcc-arm-none-eabi toolchain. I looked at a few specific functions (memcpy, strcpy, etc.) and noticed that LLVM does not tend to generate load/store instructions with a register offset (e.g. ldr Rd, [Rn, Rm] form) and instead prefers the immediate offset form.

When copying a contiguous sequence of bytes, this results in additional instructions to modify the base address. https://godbolt.org/z/T1xhae

void* memcpy_alt1(void* dst, const void* src, size_t len) {
    char* save = (char*)dst;
    for (size_t i = 0; i < len; ++i)
        *((char*)(dst + i)) = *((char*)(src + i));
    return save;
}

clang --target=armv6m-none-eabi -Os -fomit-frame-pointer
memcpy_alt1:
        push    {r4, lr}
        cmp     r2, #0
        beq     .LBB0_3
        mov     r3, r0
.LBB0_2:
        ldrb    r4, [r1]
        strb    r4, [r3]
        adds    r1, r1, #1
        adds    r3, r3, #1
        subs    r2, r2, #1
        bne     .LBB0_2
.LBB0_3:
        pop     {r4, pc}

arm-none-eabi-gcc -march=armv6-m -Os
memcpy_alt1:
        movs    r3, #0
        push    {r4, lr}
.L2:
        cmp     r3, r2
        bne     .L3
        pop     {r4, pc}
.L3:
        ldrb    r4, [r1, r3]
        strb    r4, [r0, r3]
        adds    r3, r3, #1
        b       .L2

Because this code appears in a loop that could be copying hundreds of bytes, I want to add an optimization that will prioritize load/store instructions with register offsets when the offset is used multiple times. I have not worked on LLVM before, so I'd like advice about where to start.

  *   The generated code is correct, just sub-optimal so is it appropriate to submit a bug report?
  *   Is anyone already tackling this change or is there someone with more experience interested in collaborating?
  *   Is this optimization better performed early during instruction selection or late using c++ (i.e. ARMLoadStoreOptimizer.cpp)
  *   What is the potential to cause harm to other parts of the code gen, specifically for other arm targets. I'm working with armv6m, but armv7m offers base register updating in a single instruction. I don't want to break other useful optimizations.

So far, I am reading through the LLVM documentation to see where a change could be applied. I have also:

  *   Compiled with -S -emit-llvm (see Godbolt link)
There is an identifiable pattern where a getelementptr function is followed by a load or store. When multiple getelementptr functions appear with the same virtual register offset, maybe this should match a tLDRr or tSTRr.
  *   Ran LLC with  --print-machineinstrs
It appears that tLDRBi and tSTRBi are selected very early and never replaced by the equivalent t<LDRB|STRB>r instructions.

Thank you,

Daniel Way
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200720/ddda3aee/attachment.html>