[llvm-dev] Load combine pass

Fri Sep 30 01:18:09 PDT 2016

On 29 Sep 2016, at 18:56, Sanjoy Das <sanjoy at playingwithpointers.com> wrote:
> 
> That makes sense, but what do you think of Artur's suggestion of
> catching only the obvious patterns?  That is, catching only cases like
> 
>  i16* ptr = ...
>  i32 val = ptr[0] | (ptr[1] << 16);
> 
> ==> // subject to endianess
> 
>  i16* ptr = ...
>  i32 val = *(i32*) ptr;
> 
> To me that seems like a win (or at least, not a loss) on any
> architecture.  However, I will admit that I've only ever worked on x86
> so I have a lot of blind spots here.

This is one of the cases that actually is a loss on many MIPS platforms, unless ptr[0] is 4-byte aligned.  If it is 2-byte aligned (which is the only place I’ve seen this crop up in hot loops of benchmark code, which is the only place where I’ve looked seriously to see why we’re generating bad code), we end up widening two 2-byte, 2-byte-aligned, loads to a 4-byte, 2-byte-aligned load.  The latter sequence is more expensive.

Even on x86, this holds dynamically: you’ll hit a slower microcode path on a lot of x86 implementations if you’re dispatching an unaligned 4-byte load than if you dispatch a pair of aligned 2-byte loads.  If the two two-byte loads are adjacent and are 4-byte aligned (and therefore within the same cache line), then you’ll often see micro-op fusion combine them back into a 4-byte load, depending on the surrounding insturctions.  

Even then, it’s only a win on x86 if val is the only use of the result.  We were widening loads that had independent uses, so we ended up loading into one register and then splitting out portions of the result, which put a lot more pressure on the register rename unit[1].  This has a decidedly non-linear performance outcome: often it’s fine, but if it’s in a loop and it crosses a threshold then you see about a 50% performance decrease (on recent x86 microarchitectures).

David

[1] A few years ago, I would have said cache usage was the most important thing to optimise for as a first pass.  With modern superscalar chips, register rename pressure is now even more important.  This was really driven home to me a few months ago when I made a loop run twice as fast on the two most recent Intel microarchitectures by inserting an xor %rax, %rax at the top - explicitly killing the false dependency between loop iterations made more of a difference than all of the other optimisations that we’d done, combined.