[llvm-dev] Load combine pass

Simon Dardis via llvm-dev llvm-dev at lists.llvm.org
Fri Sep 30 03:01:16 PDT 2016



> -----Original Message-----
> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of David
> Chisnall via llvm-dev
> Sent: 30 September 2016 09:18
> To: Sanjoy Das
> Cc: llvm-dev; Sanjoy Das
> Subject: Re: [llvm-dev] Load combine pass
> 
> On 29 Sep 2016, at 18:56, Sanjoy Das <sanjoy at playingwithpointers.com>
> wrote:
> >
> > That makes sense, but what do you think of Artur's suggestion of
> > catching only the obvious patterns?  That is, catching only cases like
> >
> >  i16* ptr = ...
> >  i32 val = ptr[0] | (ptr[1] << 16);
> >
> > ==> // subject to endianess
> >
> >  i16* ptr = ...
> >  i32 val = *(i32*) ptr;
> >
> > To me that seems like a win (or at least, not a loss) on any
> > architecture.  However, I will admit that I've only ever worked on x86
> > so I have a lot of blind spots here.
> 
> This is one of the cases that actually is a loss on many MIPS platforms, unless
> ptr[0] is 4-byte aligned.  If it is 2-byte aligned (which is the only place I’ve
> seen this crop up in hot loops of benchmark code, which is the only place
> where I’ve looked seriously to see why we’re generating bad code), we end
> up widening two 2-byte, 2-byte-aligned, loads to a 4-byte, 2-byte-aligned
> load.  The latter sequence is more expensive.
> 
> Even on x86, this holds dynamically: you’ll hit a slower microcode path on a
> lot of x86 implementations if you’re dispatching an unaligned 4-byte load
> than if you dispatch a pair of aligned 2-byte loads.  If the two two-byte loads
> are adjacent and are 4-byte aligned (and therefore within the same cache
> line), then you’ll often see micro-op fusion combine them back into a 4-byte
> load, depending on the surrounding insturctions.
> 
> Even then, it’s only a win on x86 if val is the only use of the result.  We were
> widening loads that had independent uses, so we ended up loading into one
> register and then splitting out portions of the result, which put a lot more
> pressure on the register rename unit[1].  This has a decidedly non-linear
> performance outcome: often it’s fine, but if it’s in a loop and it crosses a
> threshold then you see about a 50% performance decrease (on recent x86
> microarchitectures).
> 
> David
> 

To expand on David's point, alignment matters heavily on MIPS. An unaligned load
has to be split into two loads (load word left/ load word right). For MIPSR6 it's a single
instruction, but if the load crosses certain thresholds such has cache line/page boundary
some implementations may generate an address exception which causes the operating
system to emulate the load which faulted. (MIPSR6 supports unaligned accesses for
user code, but is implementation dependant as to the combination of hardware/OS
software support).

For cases where the widened load is sufficiently aligned and the values have separate
uses, for MIPS64, the component values might require canonicalization. E.g. combining
two i32 loads into an i64 load. If those i32s are used arithmetically, bit 31 has to be
replicated into bits 32 to 63. This is a hard ISA requirement for MIPS64 as it doesn't do
subregister accesses. (32 bit arithmetic operations on MIPS64 expect a 32 bit value sign
extended to 64 bits, otherwise the result is undefined.) This is bad for code density
as the normal 32 bit loads will perform the sign extension anyway.

In summary, I think load combining would need some sort of target cost model /
target profitability query to determine if combining some set of loads for some
chain of operations is actually an optimization.

Thanks,
Simon

> [1] A few years ago, I would have said cache usage was the most important
> thing to optimise for as a first pass.  With modern superscalar chips, register
> rename pressure is now even more important.  This was really driven home
> to me a few months ago when I made a loop run twice as fast on the two
> most recent Intel microarchitectures by inserting an xor %rax, %rax at the top
> - explicitly killing the false dependency between loop iterations made more
> of a difference than all of the other optimisations that we’d done, combined.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


More information about the llvm-dev mailing list