[cfe-dev] [llvm-dev] the as-if rule / perf vs. security

Wed Mar 16 11:46:21 PDT 2016

On 16 Mar 2016, at 18:31, Tim Northover via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
>> I'm less sure if the gaps are on the edges.  I'm worried that you might
>> ending up crossing some important address boundary if you look at something
>> earlier or later than what the user requested.
> 
> Oh yes, you certainly can't do it on the edges in most cases. It can
> very easily cross a page boundary and segfault your process.

GVN will widen loads and stores up to a power of two.  We’ve had numerous problems with this.  Our architecture supports byte-granularity memory protection, so often this widening will cause a trap (if you try to access byte 4 of a 3-byte array, then you will get bad behaviour).  Even without this, it often causes us to generate very bad code.  Two adjacent 32-bit stores are replaced by a single 64-bit store, only now the store is not correctly aligned (the compiler knows that the stores have 4-byte alignment).  

On MIPS, the store case isn’t too bad if we tell the compiler that the target doesn’t support unaligned accesses, because it then generates a pair of sdl / sdr instructions (which doesn’t actually save us anything over a pair of sw instructions, and is slower on some implementations).  If we tell the compiler that we do support unaligned loads and stores (which we’d like to, because they’re sufficiently rare that we get better code in most cases if we trap and emulate the few where the compiler assumed alignment that wasn’t really there) then we get a trap and emulate it.  This was particularly fun in the FreeBSD kernel, where the code that triggered this ‘optimisation’ was the code that handled unaligned load and store exceptions…

For the load case, it’s significantly worse, because doing a couple of loads from within the same cache line is very cheap, doing a single load and then some masking is more expensive.  On a simple pipeline, it costs more simply because you have more instructions (and a load from L1 is very cheap).  On a more complex pipeline, you’ve turned a sequence of independent operations into a sequence with a dependency and so you lose out on ILP.  If the two loads span a cache line boundary (as the two stores did in the case that we saw) then you’re now hitting a *really* slow microcode path on a lot of recent processors, when the naïve code generation would have produced a small number of micro-ops.

In short, I am very nervous of adding more of these optimisations, because the ones that we do today are not always sensible.

David