[PATCH] D25350: [X86] Enable interleaved memory accesses by default

Thu Oct 20 11:14:33 PDT 2016

mkuper added a comment.

Thanks for investigating this, Simon!

In https://reviews.llvm.org/D25350#575481, @RKSimon wrote:

> 1 - We've lost a number of cases where we had vectorized horizontal reduction clamp + sum patterns. These were typically loading 16 sparse integers as 4 x v4i32 in vpinsrd buildvector sequences and then performing the clamps (pminsd/pmaxsd) + hadd's. These are fully scalarized now.

That seems fairly bad.
Do you have a reproducer? This didsn't seem to break our existing horizontal reduction lit tests.

> 2 - Where interleaving is kicking in it always uses 256-bit vector types, and the code spends a huge amount of time performing cross-lane shuffles (vextractf128/vinsertf128 etc.).

Is this AVX or AVX2? I mean, do we get this just because of having to perform integer ops on xmms, or is this just part of the resulting shuffle sequence?

> This should be improvable in the backend with a mixture of more shuffle improvements (PR21281 and PR21138 come to mind)

It seems like PR21281 was mostly resolved. I'll need to look at PR21138.

> and also possibly splitting a ymm load into 2 if the only use of the load is to extract the low / high xmm subvectors.

This is a bit weird - I'm not sure I'd expect this to fire in this kind of situation.

In any case, how do you think we can move forward with this? I'd really like to get this in (because of cases like the 60% improvement in denbench), but, obviously, with a minimum amount of regressions. :-)
If you can provide reproducers for the CG issues, I'll look into fixing them before enabling this. Otherwise, are you ok with this going on as is? If not, what's the alternative?

https://reviews.llvm.org/D25350