[PATCH] D25350: [X86] Enable interleaved memory accesses by default

Thu Oct 20 09:01:24 PDT 2016

RKSimon added a comment.

In https://reviews.llvm.org/D25350#574974, @mkuper wrote:

> Simon, any news on your end?

So looking through the before + after code we're seeing 2 types of diff:

1 - We've lost a number of cases where we had vectorized horizontal reduction clamp + sum patterns. These were typically loading 16 sparse integers as 4 x v4i32 in vpinsrd buildvector sequences and then performing the clamps (pminsd/pmaxsd) + hadd's. These are fully scalarized now.

2 - Where interleaving is kicking in it always uses 256-bit vector types, and the code spends a huge amount of time performing cross-lane shuffles (vextractf128/vinsertf128 etc.). This should be improvable in the backend with a mixture of more shuffle improvements (PR21281 and PR21138 come to mind) and also possibly splitting a ymm load into 2 if the only use of the load is to extract the low / high xmm subvectors.

https://reviews.llvm.org/D25350