[PATCH] D24681: Optimize patterns of vectorized interleaved memory accesses for X86.

Mon Oct 3 13:37:49 PDT 2016

DavidKreitzer added inline comments.

> Farhana wrote in X86InterleavedAccess.cpp:87
> Good suggestion!

Thanks for the fix. I'm okay with the way you did this, but note that by modelling the address as

  %gep3 = getelementptr <4 x double>, <4 x double>* %bc, i32 3

instead of

  %gep3 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 3

the size of the underlying memory reference is no longer explicit in the GEP. You can imagine a scenario where this could have a negative impact on subsequent optimization.

> Farhana wrote in X86InterleavedAccess.cpp:146
> What about the case where we only have only one strided_load and generating 4 extracts would be more profitable than generating 8 vector instructions, right?

Even in the case of only one strided load, I think you will find that the CG will generate good code from your expansion sequence. It is true that 5 of the 8 shuffle instructions you are generating are unnecessary in that case, but those shuffle instructions should be optimized away as dead.

Try it out, and perhaps add a unit test for this situation, and possibly for some of the other unusual situations like multiple shuffles with the same index.

> X86InterleavedAccess.cpp:35
> +  if (!SubTarget.hasAVX() || ShuffleVecSize != 256 ||
> +      DL.getTypeSizeInBits(ShuffleEltTy) != 64 || Shuffles.size() != 4 ||
> +      Factor != 4)

Please remove this check for Shuffles.size() != 4. I don't think you need it.

Also, I think Zvi is right that you need to check the load size here. Even in the previous version of the code, there is nothing to guard against the load being larger than you expect. So (as you pointed out), the "unexpected load size" assertion might fire on valid LLVM IR input.

And in the new version of the code where you are no longer checking that a shuffle exists for each possible Index (0, 1, 2, and 3), there is the oddball possibility that the load is smaller than you expect. Probably what you should be checking here is that LoadSize >= Factor * ShuffleVecSize since that is the size of the expanded set of loads.

https://reviews.llvm.org/D24681