[PATCH] D30200: [SLP] Fix for PR31880: shuffle and vectorize repeated scalar ops on extracted elements

Thu Aug 10 00:01:33 PDT 2017

dorit added a comment.

Hi Alexey,

With this patch I'm seeing cases where SLP no longer vectorizes code that was vectorized before.
SLP detects that it is possible that the extracts will be optimized into a shuffle,
so it uses the cost of the shuffle in its profitability evaluation.
However the cost of the shuffle comes out very high, so SLP is abandoned…

In the specific case at hand the scenario is the following:
With the patch, SLP is asking about getShuffleCost(Permute_Single_Src, 16xi64).
So for AVX2 we fall back to return 12*Base::getShuffleCost(Permute_Two_Src,4xi64),
and since that cost is not modeled this in turn returns 12*getPermuteShuffleOverhead, which comes out to = 12*8 = 96 ….
In the end the overall cost of the shuffle (after subtracting 16 inserts) is 80, 
compared to the cost of the sequence before the patch, which was 16 (using the result of getGatherCost).
96 is certainly an excessive cost, and should be fixed in the target. But regardless...:

I wonder if it may make sense to also have SLP examine both costs (gatherCost as before and ShuffleCost) and select the minimum one?

Another concern is that *maybe* SLP is not asking the exact right question when it calls getShuffleCost(Permute_Single_Src, 16xi64); 
Apparently InstCombine does reach the decision to optimize the sequence into shuffles, so this may suggest that the question that SLP asks TTI when it tries to predict what instCombine will do, does not match the question that InstCombine will ask TTI...?

In any case, I'm working on fixing the cost that X86TTI returns, which fixes (or works around?) the problem for AVX2, but not yet for SSE2 on Silvermont (still under investigation).
But maybe the other directions of fixes in the SLPVectorizer suggested above should be considered as well...

Thanks,
Dorit

Repository:
  rL LLVM

https://reviews.llvm.org/D30200