[PATCH] D25350: [X86] Enable interleaved memory accesses by default

Thu Oct 20 12:06:30 PDT 2016

RKSimon added a comment.

>> 1 - We've lost a number of cases where we had vectorized horizontal reduction clamp + sum patterns. These were typically loading 16 sparse integers as 4 x v4i32 in vpinsrd buildvector sequences and then performing the clamps (pminsd/pmaxsd) + hadd's. These are fully scalarized now.
> 
> That seems fairly bad.
>  Do you have a reproducer? This didsn't seem to break our existing horizontal reduction lit tests.

We should be able to create one and address this as a follow up issue.

>> 2 - Where interleaving is kicking in it always uses 256-bit vector types, and the code spends a huge amount of time performing cross-lane shuffles (vextractf128/vinsertf128 etc.).
> 
> Is this AVX or AVX2? I mean, do we get this just because of having to perform integer ops on xmms, or is this just part of the resulting shuffle sequence?

This is AVX1 on a Jaguar CPU - so internally its a 128-bit ALU that double pumps ymm instructions. It can be sensitive to large amounts of dependent ymm code like this.

>> This should be improvable in the backend with a mixture of more shuffle improvements (PR21281 and PR21138 come to mind)
> 
> It seems like PR21281 was mostly resolved. I'll need to look at PR21138.

I have a possible shuffle patch that cover both of these, but haven't had time to finish it - its a rewrite of lowerVectorShuffleByMerging128BitLanes that acts a bit like lowerShuffleAsRepeatedMaskAndLanePermute but in reverse (multiple input lane permute followed by repeated mask).

>> and also possibly splitting a ymm load into 2 if the only use of the load is to extract the low / high xmm subvectors.
> 
> This is a bit weird - I'm not sure I'd expect this to fire in this kind of situation.

Sorry, I meant such a change could fix the regression - and possibly allow a great deal more shuffle folding. It'll be a fine balance as to when to let it fire though.

> In any case, how do you think we can move forward with this? I'd really like to get this in (because of cases like the 60% improvement in denbench), but, obviously, with a minimum amount of regressions. :-)
>  If you can provide reproducers for the CG issues, I'll look into fixing them before enabling this. Otherwise, are you ok with this going on as is? If not, what's the alternative?

Yes I think I'm happy for this to go ahead, the regression areas we can work on afterward, most can be solved during lowering and are existing issues - its just interleaving makes them a little more obvious!

https://reviews.llvm.org/D25350