[PATCH] Add support to recognize non SIMD kind of parallelism in SLPVectorizer

Thu Jun 19 12:27:04 PDT 2014

> On Jun 19, 2014, at 2:23 AM, Karthik Bhat <kv.bhat at samsung.com> wrote:
> 
> Hi Arnold,
> Thanks for the pointers. Updated the cost model for SK_Alternate Shuffles.
> The cost model is as follows-
> 1) In BasicTTI created a function getAltShuffleOverhead. 
> As you mentioned the conservative cost of shuffle is the cost of extracting the element from an index from vector + cost of inserting the element into the result vector for all elements in result vector.
> Since we will be alternatively picking elements from the 2 vectors which are of same type(i.e. element 0 of 1st vector, element 1 of 2nd vector, element 3 of 1st vector and so on..) the functions just runs a loop to calculate the cost of extracting elements from each index and adds it up to the cost of inserting the element at that index.
> 
> We have overidden the getShuffleCost for X86 and ARM. Created 2 tables NEONAltShuffleTable and X86AltShuffleTable with more accurate cost of shuffle. The cost here represents the number of instructions required to generate the shuffled vector.
> 
> Please if you could let me know your inputs on this.
> 
> For <8 x i16> as well we should be generating the correct mask as the logic to create mask is generic. The logic to create mask is we take the vector length(8 in this case) and run a loop from 0 to length alternatively selecting loopindex(i) and lengthofVector+i. So in this case as we will be generating sequence (0,8+1,2,8+3,4,8+5,6,8+7) i.e. <0,9, 2, 11,4,13, 6, 15>.

I was not worried about the slp vectorizer generated mask - my handwritten mask in the first example I gave was wrong - I was pointing out my mistake :). 

For the <8 x i16> case my comment was just about the cost model side of things; that we should make sure we return reasonable estimates on x86 and arm which you have done.

> 
> I tried to test it on a sample test but it exits before entering buildTree in vectorizeStoreChain as VF>ChainLen. I will try to come up with a working test to check <8 x i16> and <16 x i8> but i feel we should be generating correct mask in both these cases as per our logic.

I don’t understand this part. A test for <8 x i16> should just look like one for <4 x float> except for the number of operations (and the integer type).

I think you can remove the tbaa tags since they should not be required for your examples to work.

LGTM.

Thanks for working on this!