[PATCH] D24681: Optimize patterns of vectorized interleaved memory accesses for X86.

Fri Sep 30 09:59:07 PDT 2016

DavidKreitzer requested changes to this revision.
DavidKreitzer added a comment.
This revision now requires changes to proceed.

Hi Farhana,

The overall architecture of the optimization looks good to me, but a few fixes are needed, most notably a fix for the correctness problem related to the ordering of the Shuffles array.

Thanks,
-Dave

> X86InterleavedAccess.cpp:22
> +/// Returns true if the interleaved access group represented by the shuffles
> +/// are supported for the subtarget. Returns false otherwise.
> +static bool isSupported(const X86Subtarget &SubTarget,

are --> is

> X86InterleavedAccess.cpp:87
> +
> +  Value *ScalarBasePtr = Builder.CreateBitCast(
> +    LI->getPointerOperand(), ScalarPtrTy);

Wouldn't it be simpler to create a bitcast to a pointer to an array of the smaller vector type? That would avoid the need for further bitcasts, saving 4 LLVM IR instructions. For example, in the 4 interleaved double case, the bitcast would look like this:

  %bc = bitcast <16 x double>* %ptr to [4 x <4 x double>]*

And then your GEPs would look like this:

  %gep0 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 0
  %gep1 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 1
  %gep2 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 2
  %gep3 = getelementptr [4 x <4 x double>], [4 x <4 x double>]* %bc, i32 0, i32 3

> X86InterleavedAccess.cpp:119
> +  ShuffleMask = makeArrayRef( IntMask3, 4);
> +  Shuffles[3]->replaceAllUsesWith(
> +               Builder.CreateShuffleVector(IntrVec1, IntrVec2, ShuffleMask));

You seem to be making unwarranted assumptions about the order of the Shuffles array. Your code is only correct if

Indices[0] = 3
Indices[1] = 2
Indices[2] = 1
Indices[3] = 0

I see nothing in the interleaved access pass that guarantees this ordering.

A better way to write this function would be to populate a new array of shuffles where NewShuffles[0] corresponds to an index of 0, NewShuffles[1] corresponds to an index of 1, etc. Then you can do the shuffle replacement like this:

  for (unsigned i = 0; i < Shuffles.size(); i++) {
    unsigned Index = Indices[i];
    Shuffles[i]->replaceAllUsesWith(NewShuffles[Index]);
  }

> X86InterleavedAccess.cpp:144
> +  assert(Shuffles.size() == Indices.size() &&
> +                      "Unmatched number of shufflevectors and indices");
> +

The indentation is off here. Can you please run clang-format on this file? To make things easy on your reviewers, please do this in a separate upload without ANY other changes.

> X86InterleavedAccess.cpp:146
> +
> +  SmallSetVector<unsigned, 4> UniqueIndices;
> +  for (unsigned Index : Indices)

Do you really need this test for unique indices? I see no reason not to handle multiple shuffles with the same index.

I also don't see a need for all possible indices to be covered, which you are implicitly requiring by forcing all the indices to be unique and checking that Shuffles.size() == 4 in isSupported. If you remove this requirement, you will automatically handle more cases. For example, suppose we vectorize a loop that accesses only fields a, b, and d of an array of this type of structure:

  struct {
    double a, b, c, d;
  };

Your shuffle sequence will still work well for this case. One of the output shuffles will just go unused.

This suggestion has implications on the isSupported routine. You would have to replace the Shuffles.size() == 4 check with Factor == 4.

https://reviews.llvm.org/D24681