[llvm-dev] Load combine pass
David Chisnall via llvm-dev
llvm-dev at lists.llvm.org
Thu Sep 29 03:10:13 PDT 2016
On 29 Sep 2016, at 01:25, Sanjoy Das <sanjoy at playingwithpointers.com> wrote:
> Hi David,
> David Chisnall via llvm-dev wrote:
> > On 28 Sep 2016, at 16:50, Philip Reames via llvm-dev<llvm-dev at lists.llvm.org> wrote:
> >> At this point, my general view is that widening transformations of any kind should be done very late. Ideally, this is something the backend would do, but doing it as a CGP like fixup pass over the IR is also reasonable.
> > I’m really glad to see that this is gone in GVN - it will reduce our
> > diffs a lot when we do the next import. The GVN load widening is not
> > sound in the presence of hardware-enforced spacial memory safety, so
> > we ended up with the compiler inserting things that caused hardware
> > bounds checks to fail and had to disable it in a few places.
> > If you’re reintroducing it, please can we have a backend option to
> > specify whether it’s valid to widen loads beyond the extents (for
> > example, for us it is not valid to widen an i16 load followed by an i8
> > load to an i32 load). Ideally, we’d also want the back end to supply
> Don't you have to mark the functions you generate as
> "Attribute::SanitizeAddress"? We should definitely make the
> speculative form of this transform (i.e. "load i32, load i16" -->
> "load i64") predicated on Attribute::SanitizeAddress.
Nope, we’re not using the address sanitiser. Our architecture supports byte-granularity bounds checking in hardware.
Note that even without this, for pure MIPS code without our extensions, load widening generates significantly worse code than when it doesn’t happen. I’m actually finding it difficult to come up with a microarchitecture where a 16-bit load followed by an 8-bit load from the same cache line would give worse performance than a 32-bit load, a mask and a shift. In an in-order design, it’s more instructions to do the same work, and therefore slower. In an out-of-order design, the two loads within the cache line will likely be dispatched simultaneously and you’ll have less pressure on the register rename engine.
It seems like something that will only give wins if it exposes optimisation opportunities to other transforms and, as such, should probably be exposed as an analysis so that the later transforms can do the combine if it actually makes sense to do so.
More information about the llvm-dev