[LLVMdev] [RFC] [ARM] v6m: Suggestions for a slightly different set of default optimizer settings.
Bjoern Haase
bjoern.m.haase at web.de
Sun Jan 11 11:35:43 PST 2015
Hello to all.
When studying forums and mailing lists it seems to me that llvm usage
for very small arm v6m targets is not so common.
In the last months, I have spent some time on analyzing performance of
llvm/clang for very small targets. My main objective was to get the best
possible performance from portable (non-assembly) crypto numerics for
cortex-M0(+) targets.
As a result (crypto paper is in the review process and not yet
published), llvm did perform best and did outperform gcc and different
versions of armcc by a *very* significant factor.
In this mail I would like to summarize some of my results. Based on my
analysis, I am convinced, that LLVM will provide an excellent solution
also for small bare-metal targets, already with only some changes in the
default settings.
Before suggesting new configurations for v6m, as a first step, I'd like
to suggest a definition of the "typical arm v6m system" and it's
optimization priorities from my perspective.
In my opinion the most important v6m targets will be cortex m0 systems.
If somebody is using M0 and not M3/M4, the system will be either very
*low cost* or very *low power*. Otherwise thumb2 will be available. Low
cost and low power means in a first step "little RAM", because of the
silicon area and the leakage currents.
I assume that the vast majority of these systems will be flashed-based
or ROM based microcontrollers. I have worked on quite a lot of these
systems. On such systems for all projects where I have been involved,
the main bottleneck was RAM and not program memory.
This is the reason, why my first patch on llvm deals with tail call
optimization for thumb1 and it's possible benefit with respect to stack
usage.
After optimizing for RAM usage, code size and speed are the equally
important second goals for such small embedded targets in my opinion.
Since speed is often very much the same as power, I expect that we
should not aggressively focus on code size only.
However in practice, my observation with crypto code was that in fact
optimization for size with -Os did in fact not only reduce code size but
give significantly better speed than -O2 or -O3 on the cortex-M0. I did
not figure out why exactly, but I assume that some optimizations meant
to improve ILP scheduling at -O3 did in fact pessimize for the cortex-M0.
In my observation on crypto code and signal processing, the main
bottleneck of v6m is the slow memory interface and the large register
pressure due to effectively 8 usable registers. Register spills are,
thus,*very* expensive. Usage of a frame pointer effectively reduces the
register set to 7 registers only, requiring even more spills. This is
extremely costly for thumb1 targets.
So, as a first suggestion, I would like to suggest to enable frame
pointer elimination by default in clang for v6m, as soon as any
optimization level is chosen. With trunc clang, frame pointer
elimination today seems to be deactivated, even with -O3 or -Os. Was
there a specific reason not to activate this as default?
With respect to the extreme register pressure, my analysis of the crypto
code performance did show that instruction scheduling is extremely
important. I observed that, when using the -pre-RA-sched=source option,
LLVM did a significantly better job. This was the main factor with which
I succeeded making LLVM outperforming gcc by almost a factor of 2.
Unfortunately, this option is not available on the clang interface level
and I had to explicitly generate bytecode data intermediately. I think
it would help much to expose this feature to the clang level or as part
of the default optimization settings -Os, -O2, -O3.
I observed, that with today's head version, -pre-RA-sched appears to be
hidden from the end user. At a first glance, -misched=ilpmin
-enable-misched -misched-regpressure did give similar results. I would
like to suggest to use them (or a similar configuration) as default for
clang for higher optimization levels. Only if these passes may be
considered to be stable, of course.
As a last point, it would be helpful to empower clang to do the linking
of the code itself. I did not manage doing this in a first step and use
gcc so far for final linking. Concerning include paths for the target,
command line switches are available. I did not yet find out (and did not
yet spend much time) on how to get linking running. Probably, it might
be best to directly start with using binutils-gold instead of binutils-ld?
Summarizing, I am convinced, with the above issues being resolved, LLVM
will provide an excellent choice also for the very smallest targets.
Thank's to the LLVM community to do an excellent job.
Yours,
Björn
P.S.: Some Aspects related to compile-time:
Having in mind, that typical armv6m targets typically will have 32k
program memory, I expect that embedded software guys will be willing to
tolerate much longer compilation times. Maybe there are expensive
options that I am nor aware of, that currently are not activated by
default due to performance reasons.
P.P.S.: Ideas for further code improvements of LLVM for small targets:
When comparing our hand coded assembly version with the best
compiler-generated version we observed a speed gain of almost a factor
of 2. It might be interresting to find out, where the biggest weaknesses
of the compiler generated code were in order to find points for
improvement.
For the most important v6m system, cortex M0 / M0+, the main speed
bottlenecks were register pressure and the slow (2-cycle) overhead for
memory accesses. Besides special tricks, the asm optimizations did
improve by changing internal calling conventions (no callee-saved-regs,
all regs saved by caller), by replacing individual LDR/STR by LDM/STM
sequences operating on more registers and by using the upper register
half as spill bank.
When looking at those points, I suppose that the last aspect might be
implemented in LLVM without too much of problems. Basically, the idea is
to use R8,R10,R11,R12 and R13 as temporary spill slots that may be
accessed with only 1 cycle instead of the 2 cycles required for memory
accesses. For our crypto, we have tried hard but in vain using the upper
registers for anything useful beside spill bank usage.
If llvm identifies large functions with lots of stack slots, it might be
a good idea considering adding the upper regs to the spill list and
replacing stack slot accesses to register accesses instead, if possible.
More information about the llvm-dev
mailing list