[PATCH] D24681: Optimize patterns of vectorized interleaved memory accesses for X86.

Thu Oct 6 17:49:10 PDT 2016

Farhana added inline comments.

> DavidKreitzer wrote in X86InterleavedAccess.cpp:146
> Even in the case of only one strided load, I think you will find that the CG will generate good code from your expansion sequence. It is true that 5 of the 8 shuffle instructions you are generating are unnecessary in that case, but those shuffle instructions should be optimized away as dead.
> 
> Try it out, and perhaps add a unit test for this situation, and possibly for some of the other unusual situations like multiple shuffles with the same index.

I know CG will get rid of the other shuffles, I am not talking about this particular case. In general, I think we have to incorporate some concept of cost in order to support any number of accesses.

> delena wrote in X86InterleavedAccess.cpp:89
> inbounds GEP?

Since GEP does not come in a fixed order before the load-instruction, it requires traversing the instructions. I have a plan to add it in the next change-set. I added a todo comment to the code.

Also, vectorizer does not generate inboundsGEP today, so there were not much incentive to add that in this change-set.

> delena wrote in X86InterleavedAccess.cpp:143
> It is not a good name for function. I think that you don't need additional function call here at all.

You are right in the current context. My plan is to define a class to encapsulate all the information and allow data sharing where I will have two main functions one for load generation and the other one for shuffle-generation. In order to keep the follow-up change-set with minimal changes I decided to create a function here. I hope it's ok to do so.

> DavidKreitzer wrote in X86InterleavedAccess.cpp:35
> Please remove this check for Shuffles.size() != 4. I don't think you need it.
> 
> Also, I think Zvi is right that you need to check the load size here. Even in the previous version of the code, there is nothing to guard against the load being larger than you expect. So (as you pointed out), the "unexpected load size" assertion might fire on valid LLVM IR input.
> 
> And in the new version of the code where you are no longer checking that a shuffle exists for each possible Index (0, 1, 2, and 3), there is the oddball possibility that the load is smaller than you expect. Probably what you should be checking here is that LoadSize >= Factor * ShuffleVecSize since that is the size of the expanded set of loads.

Enforcing LoadSize >= Factor * ShuffleVecSize makes the support of any combination of shuffles kind of pointless, It will only support the shuffles starting with 0 and 3. That is the reason I did not want to support it in this change-set.

My plan was to support it completely by generating maskload in the follow-up change-set, also avoid generating unnecessary shuffles.

> RKSimon wrote in x86-interleaved-access.ll:1
> The 'AVX-NEXT'/'AVX1-NEXT'/'AVX2-NEXT' checks - the update script would generate quite a bit more than what is shown below.

If I understand your comment correctly, you are saying the optimization will generate more instructions than it is checking for. Yes, it only checks for the must instructions, because the rest can be optimized away depending on the uses.

> DavidKreitzer wrote in x86-interleaved-access.ll:25
> There should be a vunpckhpd here too, right? Is there a reason you are not checking for it?

Right. The first four are the stepping stones, which guarantee the rest. I did not add it in order to keep the file size small. I know this is going to be populated with lot of tests.

https://reviews.llvm.org/D24681