[PATCH] D25350: [X86] Enable interleaved memory accesses by default
Simon Pilgrim via llvm-commits
llvm-commits at lists.llvm.org
Thu Oct 20 09:01:24 PDT 2016
RKSimon added a comment.
In https://reviews.llvm.org/D25350#574974, @mkuper wrote:
> Simon, any news on your end?
So looking through the before + after code we're seeing 2 types of diff:
1 - We've lost a number of cases where we had vectorized horizontal reduction clamp + sum patterns. These were typically loading 16 sparse integers as 4 x v4i32 in vpinsrd buildvector sequences and then performing the clamps (pminsd/pmaxsd) + hadd's. These are fully scalarized now.
2 - Where interleaving is kicking in it always uses 256-bit vector types, and the code spends a huge amount of time performing cross-lane shuffles (vextractf128/vinsertf128 etc.). This should be improvable in the backend with a mixture of more shuffle improvements (PR21281 and PR21138 come to mind) and also possibly splitting a ymm load into 2 if the only use of the load is to extract the low / high xmm subvectors.
More information about the llvm-commits