[llvm-dev] [ARM] Should Use Load and Store with Register Offset
Daniel Way via llvm-dev
llvm-dev at lists.llvm.org
Tue Jul 21 20:33:58 PDT 2020
Thank you, Sjoerd.
Your high-level comments are very helpful and much appreciated. I ended up
rebuilding the Newlib-nano source with -Oz instead of -Os and found an
overall improvement in code size. The final size is still larger than the
gcc-arm-none-eabi toolchain. Of course there are a few caveats to this:
- Newlib is designed around GCC;
- I'm not sure I perfectly reproduced the build settings for the
pre-built toolchain (macros, etc.);
- and this comparison considers all libc functions, many of which may
not end up in the final image.
For now, I've submitted *BUG 46801* for the case when -Oz produces more
instructions than -Os. I don't know if it needs to be a priority, but
thought it should be recorded.
I may try benchmarking the memcpy implementations as well as a few other
libc functions, but I haven't done this before. Of course, I'll share my
results if I do end up testing.
Thank you for the help.
Daniel Way
On Tue, Jul 21, 2020 at 6:05 PM Sjoerd Meijer <Sjoerd.Meijer at arm.com> wrote:
> Hi Daniel,
>
> Your observations seem valid to me. Some high-level comments from my side.
>
> As you said, the loops are quite similar. We have also observed that in
> general we generate more code around loops, in the function prologue and
> epilogue, where some data and arguments get moved and reshuffled etc. While
> this is very obvious in these micro-benchmarks, it hasn't bothered us
> enough yet for larger apps where this is less important (or where others
> things are more important). The outlier looks indeed to be Clang -Oz for
> memcpy_alt2, that is perhaps a "code-size bug". As I haven't looked into
> it, it's too early for me to blame this on just the addressing modes as
> there could be several things going on.
>
> Since this is a micro-benchmark, and lowering memcpy is a bit of an art
> ;-), for which a specialised implementation is probably available, you
> might want to look at some other codes too that are important for you.
>
> Your remarks about execution times might be right too, and as you said,
> probably best confirmed with benchmark numbers. In our group, we have not
> really looked into performance for the Cortex-M0, probably because it's the
> only v6m core (although the Cortex-m23 and Armv8-M Baseline is very
> similar) and code-size would be more important for us, but there might be
> something to be gained here.
>
> Cheers,
> Sjoerd.
> ------------------------------
> *From:* Daniel Way <p.waydan at gmail.com>
> *Sent:* 21 July 2020 08:12
> *To:* Sjoerd Meijer <Sjoerd.Meijer at arm.com>
> *Cc:* llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> *Subject:* Re: [llvm-dev] [ARM] Should Use Load and Store with Register
> Offset
>
> Hello Sjoerd,
>
> Thank you for your response! I was not aware that -Oz is a closer
> equivalent to GCC's -Os. I tried -Oz when compiling with clang and
> confirmed that the Clang's generated assembly is equivalent to GCC for the
> code snippet I posted above.
>
> clang --target=armv6m-none-eabi -Oz -fomit-frame-pointer
> memcpy_alt1:
> push {r4, lr}
> movs r3, #0
> .LBB0_1:
> cmp r2, r3
> beq .LBB0_3
> ldrb r4, [r1, r3]
> strb r4, [r0, r3]
> adds r3, r3, #1
> b .LBB0_1
> .LBB0_3:
> pop {r4, pc}
>
> On the other hand, -O2 in GCC still uses the register-offset load and
> store instructions while Clang -O2 generates the same assembly as -Os:
> immediate-offset (0 offset) load/store followed by incrementing the base
> register addresses.
> I have not tried to benchmark the Clang-generated code, it is possible
> that execution time is bounded by the load and store instructions and
> memory access latency. From an intuitive view, however, both GCC and Clang
> are generating code with 1 load and 1 store, so if Clang inserts two
> additional adds instructions, the binary size is larger, execution *could* be
> slower, and there's no improvement in register utilization over GCC.
>
> I wanted to try a couple other variants of memcpy-like functions. The
> https://godbolt.org/z/d7P6rG link includes memcpy_alt2 which copies data
> from src to dst starting at the high address and memcpy_silly which
> copies src to dst<0-4>. Here is the behavior I have noticed from GCC and
> Clang.
>
> *memcpy_alt2*
>
> - With -Os, GCC generates just 6 instructions. -O2 generates 7 but
> reduces branching to once per loop.
> - Clang with -Os or -O2 does a decent job of using a common register
> to offset the load and store bases. It adds some overhead, though, by
> pre-decrementing the base registers. 10 instructions generated.
> - Clang with -Oz is pathological, generating 13 instructions. It uses
> register-offset load/store instructions, but uses different registers for
> the offsets.
>
> *memcpy_silly*
>
> - I created this case to see if clang would select load/store with a
> common offset register once enough load instructions were added.
> - Clang with -Os or -O2 does not seem to care about register-offset
> load/store and prefers to increment each base register address.
> - Clang with -Oz performs the optimization I want. It produces the
> same number of instructions as GCC, and avoids an issue where GCC has to
> re-read the same value from the stack each time through the loop.
>
>
> I really think that, when limited to the Thumb1 ISA, register-offset load
> and store instructions should be used at -Oz, -Os, and -O2 optimization
> levels. Explicitly incrementing a register holding the base address seems
> unnecessary when the value seems wasteful and I cannot see how it will
> improve execution time in the examples I'm investigating. Id like to know
> if I'm wrong in assuming that LDR Rd, [Rn, Rm] and LDR Rd, [Rn, #<imm>]
> have the same execution time, but based on the Cortex-M0+ TRM they should
> both require 2 clock cycles.
>
> Best regards,
>
> Daniel Way
>
>
> On Mon, Jul 20, 2020 at 6:15 PM Sjoerd Meijer <Sjoerd.Meijer at arm.com>
> wrote:
>
> Hello Daniel,
>
> LLVM and GCC's optimisation levels are not really equivalent. In Clang,
> -Os makes a performance and code-size trade off. In GCC, -Os is minimising
> code-size, which is equivalent to -Oz with Clang. I have't looked into
> details yet, but changing -Os to -Oz in the godbolt link gives the codegen
> you're looking for?
>
> Cheers,
> Sjoerd.
> ------------------------------
> *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Daniel
> Way via llvm-dev <llvm-dev at lists.llvm.org>
> *Sent:* 20 July 2020 06:54
> *To:* llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> *Subject:* [llvm-dev] [ARM] Should Use Load and Store with Register Offset
>
> Hello LLVM Community (specifically anyone working with ARM Cortex-M),
>
> While trying to compile the Newlib C library I found that Clang10 was
> generating slightly larger binaries than the libc from the prebuilt
> gcc-arm-none-eabi toolchain. I looked at a few specific functions (memcpy,
> strcpy, etc.) and noticed that LLVM does not tend to generate load/store
> instructions with a register offset (e.g. ldr Rd, [Rn, Rm] form) and
> instead prefers the immediate offset form.
>
> When copying a contiguous sequence of bytes, this results in additional
> instructions to modify the base address. https://godbolt.org/z/T1xhae
>
> void* memcpy_alt1(void* dst, const void* src, size_t len) {
> char* save = (char*)dst;
> for (size_t i = 0; i < len; ++i)
> *((char*)(dst + i)) = *((char*)(src + i));
> return save;
> }
>
> clang --target=armv6m-none-eabi -Os -fomit-frame-pointer
> memcpy_alt1:
> push {r4, lr}
> cmp r2, #0
> beq .LBB0_3
> mov r3, r0
> .LBB0_2:
> ldrb r4, [r1]
> strb r4, [r3]
> adds r1, r1, #1
> adds r3, r3, #1
> subs r2, r2, #1
> bne .LBB0_2
> .LBB0_3:
> pop {r4, pc}
>
> arm-none-eabi-gcc -march=armv6-m -Os
> memcpy_alt1:
> movs r3, #0
> push {r4, lr}
> .L2:
> cmp r3, r2
> bne .L3
> pop {r4, pc}
> .L3:
> ldrb r4, [r1, r3]
> strb r4, [r0, r3]
> adds r3, r3, #1
> b .L2
>
> Because this code appears in a loop that could be copying hundreds of
> bytes, I want to add an optimization that will prioritize load/store
> instructions with register offsets when the offset is used multiple times.
> I have not worked on LLVM before, so I'd like advice about where to start.
>
> - The generated code is correct, just sub-optimal so is it appropriate
> to submit a bug report?
> - Is anyone already tackling this change or is there someone with more
> experience interested in collaborating?
> - Is this optimization better performed early during instruction
> selection or late using c++ (i.e. ARMLoadStoreOptimizer.cpp)
> - What is the potential to cause harm to other parts of the code gen,
> specifically for other arm targets. I'm working with armv6m, but armv7m
> offers base register updating in a single instruction. I don't want to
> break other useful optimizations.
>
> So far, I am reading through the LLVM documentation to see where a change
> could be applied. I have also:
>
> - Compiled with -S -emit-llvm (see Godbolt link)
> There is an identifiable pattern where a getelementptr function is
> followed by a load or store. When multiple getelementptr functions appear
> with the same virtual register offset, maybe this should match a tLDRr or
> tSTRr.
> - Ran LLC with --print-machineinstrs
> It appears that tLDRBi and tSTRBi are selected very early and never
> replaced by the equivalent t<LDRB|STRB>r instructions.
>
> Thank you,
>
> Daniel Way
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200722/cedd419d/attachment-0001.html>
More information about the llvm-dev
mailing list