[LLVMdev] TableGen syntax for matching a constant load

Sat Feb 26 19:10:25 PST 2011

On Sat, Feb 26, 2011 at 06:40:19PM -0800, Chris Lattner wrote:
> 
> On Feb 26, 2011, at 6:12 PM, Jakob Stoklund Olesen wrote:
> 
> >> 
> >> All these patterns have one important downside. They are suboptimal if
> >> more than one store happens in a row. E.g. the 0 store is better
> >> expressed as xor followed by two register moves, if a register is
> >> available... This is most noticable when memset() gets inlined
> > 
> > Note that LLVM's -Os option does not quite mean the same as GCC's flag.
> > It disables optimizations that increase code size without a clear performance gain.
> > It does not try to minimize code size at any cost.
> 
> Jakob is right, but there is a clear market for "smallest at any cost".
> The FreeBSD folks would really like to build their bootloader with
> clang for example :).

Yes, I have the same problem for NetBSD. All but two of the boot loaders
are working. One is currently off by less than 300 Byte, one by 800.

> It should be reasonably easy to add a new "optsize2" function attribute
> to LLVM IR, and have that be set with -Oz (the "optimize for size at
> any cost") flag, which could then enable stuff like this.
> 
> There are lots of other cases where this would be useful, such as
> forced use of "rep; movsb" on x86, which is much smaller than a call
> to memset, but also much slower :).

Agreed. From studying GCC's peep hole optimisation list and the
assembler code, I see the following candiates for space saving:

setcc followed by movzbl into xor and setcc. This is #8785 and a general
optimisation. I have seen enough code to profit from this.

The optimised memory set from this thread. The question of assigning a
scratch register if multiple instructions want to use the same 32bit
immediate would be useful in other cases too.

Function prologue and epilogue can often be optimised by adjusting %esp
or %rsp using push/pop. E.g. for 32bit mode, "addl $4, %esp" and "addl
$8, %esp" are more compact as one or two pops to a scratch register.
This is also a hot path for most CPUs. Same for subtracting and 64bit
mode. Generally using push/pop for stack manipulation would be much
nicer for code size, but require extensive changes to the code
generator. I think this is the majority of why GCC creates smaller
binaries.

Using cmp/test against a constant before a conditional branch can often
be optimised if the register is dead. For cmp, a check against -1 or 1
can be replaced with inc/dec and inverting the condition. This saves 2
Bytes in 32bit mode and 1 Byte in 64bit mode. It applies generally for
all optimiser levels. Compares against 8bit signed immediates for 32bit
/ 64bit registers can be expressed as add or sub, saving 2 Bytes in all
cases.

Joerg