[LLVMdev] [RFC] [ARM] v6m: Suggestions for a slightly different set of default optimizer settings.

Sun Jan 11 11:35:43 PST 2015

Hello to all.

When studying forums and mailing lists it seems to me that llvm usage 
for very small arm v6m targets is not so common.

In the last months, I have spent some time on analyzing performance of 
llvm/clang for very small targets. My main objective was to get the best 
possible performance from portable (non-assembly) crypto numerics for 
cortex-M0(+)  targets.
As a result (crypto paper is in the review process and not yet 
published), llvm did perform best and did outperform gcc and different 
versions of armcc by a *very* significant factor.
In this mail I would like to summarize some of my results. Based on my 
analysis, I am convinced, that LLVM will provide an excellent solution 
also for small bare-metal targets, already with only some changes in the 
default settings.

Before suggesting new configurations for v6m, as a first step, I'd like 
to suggest a definition of the "typical arm v6m system" and it's 
optimization priorities from my perspective.

In my opinion the most important v6m targets will be cortex m0 systems. 
If somebody is using M0 and not M3/M4, the system will be either very 
*low cost* or very *low power*. Otherwise thumb2 will be available. Low 
cost and low power means in a first step "little RAM", because of the 
silicon area and the leakage currents.
I assume that the vast majority of these systems will be flashed-based 
or ROM based microcontrollers.  I have worked on quite a lot of these 
systems. On such systems for all projects where I have been involved, 
the main bottleneck was RAM and not program memory.
This is the reason, why my first patch on llvm deals with tail call 
optimization for thumb1 and it's possible benefit with respect to stack 
usage.

After optimizing for RAM usage, code size and speed are the equally 
important second goals for such small embedded targets in my opinion. 
Since speed is often very much the same as power, I expect that we 
should not aggressively focus on code size only.
However in practice, my observation with crypto code was that in fact 
optimization for size with -Os did in fact not only reduce code size but 
give significantly better speed than -O2 or -O3 on the cortex-M0. I did 
not figure out why exactly, but I assume that some optimizations meant 
to improve ILP scheduling at -O3 did in fact pessimize for the cortex-M0.

In my observation on crypto code and signal processing, the main 
bottleneck of v6m is the slow memory interface and the large register 
pressure due to effectively 8 usable registers. Register spills are, 
thus,*very* expensive. Usage of a frame pointer effectively reduces the 
register set to 7 registers only, requiring even more spills. This is 
extremely costly for thumb1 targets.

So, as a first suggestion, I would like to suggest to enable frame 
pointer elimination by default in clang for v6m, as soon as any 
optimization level is chosen. With trunc clang, frame pointer 
elimination today seems to be deactivated, even with -O3 or -Os. Was 
there a specific reason not to activate this as default?

With respect to the extreme register pressure, my analysis of the crypto 
code performance did show that instruction scheduling is extremely 
important. I observed that, when using the -pre-RA-sched=source option, 
LLVM did a significantly better job. This was the main factor with which 
I succeeded making LLVM outperforming gcc by almost a factor of 2. 
Unfortunately, this option is not available on the clang interface level 
and I had to explicitly generate bytecode data intermediately. I think 
it would help much to expose this feature to the clang level or as part 
of the default optimization settings -Os, -O2, -O3.

I observed, that with today's head version, -pre-RA-sched appears to be 
hidden from the end user. At a first glance, -misched=ilpmin 
-enable-misched -misched-regpressure did give similar results. I would 
like to suggest to use them (or a similar configuration) as default for 
clang for higher optimization levels. Only if these passes may be 
considered to be stable, of course.

As a last point, it would be helpful to empower clang to do the linking 
of the code itself. I did not manage doing this in a first step and use 
gcc so far for final linking. Concerning include paths for the target, 
command line switches are available. I did not yet find out (and did not 
yet spend much time) on how to get linking running. Probably, it might 
be best to directly start with using binutils-gold instead of binutils-ld?

Summarizing,  I am convinced, with the above issues being resolved, LLVM 
will provide an excellent choice also for the very smallest targets. 
Thank's to the LLVM community to do an excellent job.

Yours,

Björn

P.S.: Some Aspects related to compile-time:

Having in mind, that typical armv6m targets typically will have 32k 
program memory, I expect that embedded software guys will be willing to 
tolerate much longer compilation times. Maybe there are expensive 
options that I am nor aware of, that currently are not activated by 
default due to performance reasons.

P.P.S.: Ideas for further code improvements of LLVM for small targets:

When comparing our hand coded assembly version with the best 
compiler-generated version we observed a speed gain of almost a factor 
of 2. It might be interresting to find out, where the biggest weaknesses 
of the compiler generated code were in order to find points for 
improvement.
For the most important v6m system, cortex M0 / M0+, the main speed 
bottlenecks were register pressure and the slow (2-cycle) overhead for 
memory accesses.  Besides special tricks, the asm optimizations did 
improve by changing internal calling conventions (no callee-saved-regs, 
all regs saved by caller), by replacing individual LDR/STR by LDM/STM 
sequences operating on more registers and by using the upper register 
half as spill bank.

When looking at those points, I suppose that the last aspect might be 
implemented in LLVM without too much of problems. Basically, the idea is 
to use R8,R10,R11,R12 and R13 as temporary spill slots that may be 
accessed with only 1 cycle instead of the 2 cycles required for memory 
accesses. For our crypto, we have tried hard but in vain using the upper 
registers for anything useful beside spill bank usage.
If llvm identifies large functions with lots of stack slots, it might be 
a good idea considering adding the upper regs to the spill list and 
replacing stack slot accesses to register accesses instead, if possible.