[PATCH] D24681: Optimize patterns of vectorized interleaved memory accesses for X86.

Fri Sep 30 14:03:18 PDT 2016

Farhana added inline comments.

> DavidKreitzer wrote in X86InterleavedAccess.cpp:87
> Wouldn't it be simpler to create a bitcast to a pointer to an array of the smaller vector type? That would avoid the need for further bitcasts, saving 4 LLVM IR instructions. For example, in the 4 interleaved double case, the bitcast would look like this:
> 
>   %bc = bitcast <16 x double>* %ptr to [4 x <4 x double>]*
> 
> And then your GEPs would look like this:
> 
>   %gep0 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 0
>   %gep1 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 1
>   %gep2 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 2
>   %gep3 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 3

Good suggestion!

> DavidKreitzer wrote in X86InterleavedAccess.cpp:119
> You seem to be making unwarranted assumptions about the order of the Shuffles array. Your code is only correct if
> 
> Indices[0] = 3
> Indices[1] = 2
> Indices[2] = 1
> Indices[3] = 0
> 
> I see nothing in the interleaved access pass that guarantees this ordering.
> 
> A better way to write this function would be to populate a new array of shuffles where NewShuffles[0] corresponds to an index of 0, NewShuffles[1] corresponds to an index of 1, etc. Then you can do the shuffle replacement like this:
> 
>   for (unsigned i = 0; i < Shuffles.size(); i++) {
>     unsigned Index = Indices[i];
>     Shuffles[i]->replaceAllUsesWith(NewShuffles[Index]);
>   }

Well, vectorizer guarantees the order, but certainly I should not have relied on that since any other optimizations can mess up the order.

> DavidKreitzer wrote in X86InterleavedAccess.cpp:146
> Do you really need this test for unique indices? I see no reason not to handle multiple shuffles with the same index.
> 
> I also don't see a need for all possible indices to be covered, which you are implicitly requiring by forcing all the indices to be unique and checking that Shuffles.size() == 4 in isSupported. If you remove this requirement, you will automatically handle more cases. For example, suppose we vectorize a loop that accesses only fields a, b, and d of an array of this type of structure:
> 
>   struct {
>     double a, b, c, d;
>   };
> 
> Your shuffle sequence will still work well for this case. One of the output shuffles will just go unused.
> 
> This suggestion has implications on the isSupported routine. You would have to replace the Shuffles.size() == 4 check with Factor == 4.

What about the case where we only have only one strided_load and generating 4 extracts would be more profitable than generating 8 vector instructions, right?

> RKSimon wrote in x86-interleaved-access.ll:1
> Is this actually true? The checks below don't look like what the script would generate.

Hi Simon,
I am not sure whether I understand your concern. Which checks are you talking about?
Farhana

https://reviews.llvm.org/D24681