[llvm-dev] [ARM] Should Use Load and Store with Register Offset

Tue Jul 21 20:33:58 PDT 2020

Thank you, Sjoerd.

Your high-level comments are very helpful and much appreciated. I ended up
rebuilding the Newlib-nano source with -Oz instead of -Os and found an
overall improvement in code size. The final size is still larger than the
gcc-arm-none-eabi toolchain. Of course there are a few caveats to this:

   - Newlib is designed around GCC;
   - I'm not sure I perfectly reproduced the build settings for the
   pre-built toolchain (macros, etc.);
   - and this comparison considers all libc functions, many of which may
   not end up in the final image.

For now, I've submitted *BUG 46801* for the case when -Oz produces more
instructions than -Os. I don't know if it needs to be a priority, but
thought it should be recorded.

I may try benchmarking the memcpy implementations as well as a few other
libc functions, but I haven't done this before. Of course, I'll share my
results if I do end up testing.

Thank you for the help.
Daniel Way

On Tue, Jul 21, 2020 at 6:05 PM Sjoerd Meijer <Sjoerd.Meijer at arm.com> wrote:

> Hi Daniel,
>
> Your observations seem valid to me. Some high-level comments from my side.
>
> As you said, the loops are quite similar. We have also observed that in
> general we generate more code around loops, in the function prologue and
> epilogue, where some data and arguments get moved and reshuffled etc. While
> this is very obvious in these micro-benchmarks, it hasn't bothered us
> enough yet for larger apps where this is less important (or where others
> things are more important). The outlier looks indeed to be Clang -Oz for
> memcpy_alt2, that is perhaps a "code-size bug". As I haven't looked into
> it, it's too early for me to blame this on just the addressing modes as
> there could be several things going on.
>
> Since this is a micro-benchmark, and lowering memcpy is a bit of an art
> ;-), for which a specialised implementation is probably available, you
> might want to look at some other codes too that are important for you.
>
> Your remarks about execution times might be right too, and as you said,
> probably best confirmed with benchmark numbers. In our group, we have not
> really looked into performance for the Cortex-M0, probably because it's the
> only v6m core (although the Cortex-m23 and Armv8-M Baseline is very
> similar) and code-size would be more important for us, but there might be
> something to be gained here.
>
> Cheers,
> Sjoerd.
> ------------------------------
> *From:* Daniel Way <p.waydan at gmail.com>
> *Sent:* 21 July 2020 08:12
> *To:* Sjoerd Meijer <Sjoerd.Meijer at arm.com>
> *Cc:* llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> *Subject:* Re: [llvm-dev] [ARM] Should Use Load and Store with Register
> Offset
>
> Hello Sjoerd,
>
> Thank you for your response! I was not aware that -Oz is a closer
> equivalent to GCC's -Os. I tried -Oz when compiling with clang and
> confirmed that the Clang's generated assembly is equivalent to GCC for the
> code snippet I posted above.
>
> clang --target=armv6m-none-eabi -Oz -fomit-frame-pointer
> memcpy_alt1:
>         push    {r4, lr}
>         movs    r3, #0
> .LBB0_1:
>         cmp     r2, r3
>         beq     .LBB0_3
>         ldrb    r4, [r1, r3]
>         strb    r4, [r0, r3]
>         adds    r3, r3, #1
>         b       .LBB0_1
> .LBB0_3:
>         pop     {r4, pc}
>
> On the other hand, -O2 in GCC still uses the register-offset load and
> store instructions while Clang -O2 generates the same assembly as -Os:
> immediate-offset (0 offset) load/store followed by incrementing the base
> register addresses.
> I have not tried to benchmark the Clang-generated code, it is possible
> that execution time is bounded by the load and store instructions and
> memory access latency. From an intuitive view, however, both GCC and Clang
> are generating code with 1 load and 1 store, so if Clang inserts two
> additional adds instructions, the binary size is larger, execution *could* be
> slower, and there's no improvement in register utilization over GCC.
>
> I wanted to try a couple other variants of memcpy-like functions. The
> https://godbolt.org/z/d7P6rG link includes memcpy_alt2 which copies data
> from src to dst starting at the high address and memcpy_silly which
> copies src to dst<0-4>. Here is the behavior I have noticed from GCC and
> Clang.
>
> *memcpy_alt2*
>
>    - With -Os, GCC generates just 6 instructions. -O2 generates 7 but
>    reduces branching to once per loop.
>    - Clang with -Os or -O2 does a decent job of using a common register
>    to offset the load and store bases. It adds some overhead, though, by
>    pre-decrementing the base registers. 10 instructions generated.
>    - Clang with -Oz is pathological, generating 13 instructions. It uses
>    register-offset load/store instructions, but uses different registers for
>    the offsets.
>
> *memcpy_silly*
>
>    - I created this case to see if clang would select load/store with a
>    common offset register once enough load instructions were added.
>    - Clang with -Os or -O2 does not seem to care about register-offset
>    load/store and prefers to increment each base register address.
>    - Clang with -Oz performs the optimization I want. It produces the
>    same number of instructions as GCC, and avoids an issue where GCC has to
>    re-read the same value from the stack each time through the loop.
>
>
> I really think that, when limited to the Thumb1 ISA, register-offset load
> and store instructions should be used at -Oz, -Os, and -O2 optimization
> levels. Explicitly incrementing a register holding the base address seems
> unnecessary when the value seems wasteful and I cannot see how it will
> improve execution time in the examples I'm investigating. Id like to know
> if I'm wrong in assuming that LDR Rd, [Rn, Rm] and LDR Rd, [Rn, #<imm>]
> have the same execution time, but based on the Cortex-M0+ TRM they should
> both require 2 clock cycles.
>
> Best regards,
>
> Daniel Way
>
>
> On Mon, Jul 20, 2020 at 6:15 PM Sjoerd Meijer <Sjoerd.Meijer at arm.com>
> wrote:
>
> Hello Daniel,
>
> LLVM and GCC's optimisation levels are not really equivalent. In Clang,
> -Os makes a performance and code-size trade off. In GCC, -Os is minimising
> code-size, which is equivalent to -Oz with Clang. I have't looked into
> details yet, but changing -Os to -Oz in the godbolt link gives the codegen
> you're looking for?
>
> Cheers,
> Sjoerd.
> ------------------------------
> *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Daniel
> Way via llvm-dev <llvm-dev at lists.llvm.org>
> *Sent:* 20 July 2020 06:54
> *To:* llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> *Subject:* [llvm-dev] [ARM] Should Use Load and Store with Register Offset
>
> Hello LLVM Community (specifically anyone working with ARM Cortex-M),
>
> While trying to compile the Newlib C library I found that Clang10 was
> generating slightly larger binaries than the libc from the prebuilt
> gcc-arm-none-eabi toolchain. I looked at a few specific functions (memcpy,
> strcpy, etc.) and noticed that LLVM does not tend to generate load/store
> instructions with a register offset (e.g. ldr Rd, [Rn, Rm] form) and
> instead prefers the immediate offset form.
>
> When copying a contiguous sequence of bytes, this results in additional
> instructions to modify the base address. https://godbolt.org/z/T1xhae
>
> void* memcpy_alt1(void* dst, const void* src, size_t len) {
>     char* save = (char*)dst;
>     for (size_t i = 0; i < len; ++i)
>         *((char*)(dst + i)) = *((char*)(src + i));
>     return save;
> }
>
> clang --target=armv6m-none-eabi -Os -fomit-frame-pointer
> memcpy_alt1:
>         push    {r4, lr}
>         cmp     r2, #0
>         beq     .LBB0_3
>         mov     r3, r0
> .LBB0_2:
>         ldrb    r4, [r1]
>         strb    r4, [r3]
>         adds    r1, r1, #1
>         adds    r3, r3, #1
>         subs    r2, r2, #1
>         bne     .LBB0_2
> .LBB0_3:
>         pop     {r4, pc}
>
> arm-none-eabi-gcc -march=armv6-m -Os
> memcpy_alt1:
>         movs    r3, #0
>         push    {r4, lr}
> .L2:
>         cmp     r3, r2
>         bne     .L3
>         pop     {r4, pc}
> .L3:
>         ldrb    r4, [r1, r3]
>         strb    r4, [r0, r3]
>         adds    r3, r3, #1
>         b       .L2
>
> Because this code appears in a loop that could be copying hundreds of
> bytes, I want to add an optimization that will prioritize load/store
> instructions with register offsets when the offset is used multiple times.
> I have not worked on LLVM before, so I'd like advice about where to start.
>
>    - The generated code is correct, just sub-optimal so is it appropriate
>    to submit a bug report?
>    - Is anyone already tackling this change or is there someone with more
>    experience interested in collaborating?
>    - Is this optimization better performed early during instruction
>    selection or late using c++ (i.e. ARMLoadStoreOptimizer.cpp)
>    - What is the potential to cause harm to other parts of the code gen,
>    specifically for other arm targets. I'm working with armv6m, but armv7m
>    offers base register updating in a single instruction. I don't want to
>    break other useful optimizations.
>
> So far, I am reading through the LLVM documentation to see where a change
> could be applied. I have also:
>
>    - Compiled with -S -emit-llvm (see Godbolt link)
>    There is an identifiable pattern where a getelementptr function is
>    followed by a load or store. When multiple getelementptr functions appear
>    with the same virtual register offset, maybe this should match a tLDRr or
>    tSTRr.
>    - Ran LLC with  --print-machineinstrs
>    It appears that tLDRBi and tSTRBi are selected very early and never
>    replaced by the equivalent t<LDRB|STRB>r instructions.
>
> Thank you,
>
> Daniel Way
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200722/cedd419d/attachment-0001.html>