[PATCH] D36960: [X86][LLVM]Expanding Supports lowerInterleavedLoad() in X86InterleavedAccess (VF{8|16|32} stride 3).

Wed Aug 23 11:15:44 PDT 2017

zvi added inline comments.

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:169
+  Value *VecBasePtr;
+  unsigned int NumOfLoad = NumSubVectors;
+  if (DL.getTypeSizeInBits(VecTy) == 768) {
----------------
NumOfLoads or NumLoads

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:346
+//  {<[0|3|6|1|4|7|2|5]-[8|11|14|9|12|15|10|13]>}
+static void createVPShuftkMask(int VF, int Stride, int VectorWide,
+                               SmallVectorImpl<uint32_t> &Mask) {
----------------
Please check function name

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:349
+  int Lane = (VectorWide / 128 > 0) ? VectorWide / 128 : 1;
+  for (int LaneCount = 0; LaneCount < Lane; LaneCount++) {
+    for (int i = 0; i < VF / Lane; ++i) {
----------------
drop the brackets?

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:349
+  int Lane = (VectorWide / 128 > 0) ? VectorWide / 128 : 1;
+  for (int LaneCount = 0; LaneCount < Lane; LaneCount++) {
+    for (int i = 0; i < VF / Lane; ++i) {
----------------
zvi wrote:
> drop the brackets?
Maybe swap names of variables Lane and LaneCount?

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:350
+  for (int LaneCount = 0; LaneCount < Lane; LaneCount++) {
+    for (int i = 0; i < VF / Lane; ++i) {
+      Mask.push_back((i * Stride) % (VF / Lane) + (VF / Lane) * LaneCount);
----------------
int i = 0, e = VF / Lane; i != e; ++i

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:359
+//  mask[index]%Stride. Where '%' is a modulo operator. We're interested of the
+//  wide of the group. The Group defines the interesting group. For example, the
+//  interested group is marked between '[...]'
----------------
wide -> width or size

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:363
+//  groupShiftAmount(3,1,{0|3|6|[1|4|7]|2|5}) -> 3
+static int groupShiftAmount(unsigned Stride, int Group,
+                            SmallVectorImpl<uint32_t> &Mask) {
----------------
The return value may indeed by used as a shift amount, but if this function returns the element count of the i-th group, maybe better name it something like 'getGroupElemCount'

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:364
+static int groupShiftAmount(unsigned Stride, int Group,
+                            SmallVectorImpl<uint32_t> &Mask) {
+  int Index, Count = 0, GroupWide = 0;
----------------
ArrayRef<uint32_t> Mask

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:365
+                            SmallVectorImpl<uint32_t> &Mask) {
+  int Index, Count = 0, GroupWide = 0;
+  int GroupNum = Mask[Mask.size() - 1] % Stride;
----------------
Initialize Index here

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:365
+                            SmallVectorImpl<uint32_t> &Mask) {
+  int Index, Count = 0, GroupWide = 0;
+  int GroupNum = Mask[Mask.size() - 1] % Stride;
----------------
zvi wrote:
> Initialize Index here
GroupWide -> GroupWidth

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:368
+
+  for (Index = Mask.size() - 1; Index >= 0; Index--) {
+    if (GroupNum != Mask[Index] % Stride) {
----------------
I might be missing something, but is group number 0 defined to be the group starting at the end of Mask?

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:393
+static void createAlignMask(int VF, int Align,
+                              SmallVectorImpl<uint32_t> &Mask, int VectorWide,
+                              bool AlignBegin = false) {
----------------
Please check identation

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:394
+                              SmallVectorImpl<uint32_t> &Mask, int VectorWide,
+                              bool AlignBegin = false) {
+  int Lane = (VectorWide / 128 > 0) ? (VectorWide / 128) : 1;
----------------
Please document the arguments

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:395
+                              bool AlignBegin = false) {
+  int Lane = (VectorWide / 128 > 0) ? (VectorWide / 128) : 1;
+  int DiffToJump = AlignBegin ? Align : (VF / Lane - Align);
----------------
std::max(VectorWidth / 128, 1)

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:399
+  for (int i = 0 ; i < (VF / Lane); i++) {
+      int RealIndex = i + DiffToJump;
+      if (Lane > 1 && RealIndex > (VF / Lane) - 1)
----------------
identation

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:441
+
+  createAlignMask(VecElems, Offset[0] + Offset[1], VPAlign2, VectorWide, true);
+  createAlignMask(VecElems, Offset[1], VPAlign3, VectorWide, true);
----------------
So the last argument is always passed with the value true to all calls of this function?

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:453
+
+  Type *Ty = Vec[0]->getType();
+  for (int i = 0; i < 3; i++)
----------------
You could already create the Undef value instead of only its type here.

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:478
+
+  Value *TempVec = Builder.CreateShuffleVector(Vec[1], Vec[1], VPAlign3);
+  TransposedMatrix[0] = Builder.CreateShuffleVector(Vec[0], Vec[0], VPAlign2);
----------------
The second vector argument should be an undef, and the shuffle mask should select only elements from the first vector operand.  This will eventually happen in InstCombine or DAGCombine, but why not avoid this compile-time overhead?

https://reviews.llvm.org/D36960