[PATCH] [AArch64] Fix a bug generating incorrect instruction when building small vector.

Mon Jul 21 23:50:32 PDT 2014

Hi Kevin,

I'm not quite convinced yet:

================
Comment at: lib/Target/AArch64/AArch64ISelLowering.cpp:4192
@@ +4191,3 @@
+    if (SrcEltTy.bitsLT(SmallestEltTy)) {
+      ResMultiplier *= SmallestEltTy.getSizeInBits() / SrcEltTy.getSizeInBits();
+      SmallestEltTy = SrcEltTy;
----------------
It's actually far easier to just set ResMultiplier once outside the loop: you know SmallestEltTy and VT.getVectorElementType(). Those give the correct ratio.

================
Comment at: lib/Target/AArch64/AArch64ISelLowering.cpp:4269-4270
@@ -4248,2 +4268,4 @@
     }
+    for (int j = 0; j != ResMultiplier - 1; ++j)
+      Mask.push_back(-1);
   }
----------------
Kevin Qin wrote:
> Tim Northover wrote:
> > I don't think this is generic enough. Again in the situation with two mismatched sources (so take a v4i32 VT being produced from a v4i16 and a v8i8 for example).
> > 
> > Anything coming from the v4i16 needs to set *two* elements of Mask properly (and then 2 undefs).
> > 
> > I'm not sure, but a more thorough restructuring of this loop might make the result clearer too. Don't be afraid to move things about if you see anything.
> For your example, v4i32 will bitcast to v16i8, which only 0, 8, 16, 24 are valid lane number, and all other lanes should be undef. The valid lane number is decided by ResMultiplier only, and independent to sources. So I think this code can cover all scenarios including two sources are mismatched(The mismatching of  two sources will reflect on their OffsetMultipliers, and finally change the Mask number). But I can make a change for code style.
OK, I don't think I was clear enough. Let's look at a concrete example. Take LHS = v4i16, RHS = v8i8, result = v2i32 and imagine something like

    (BUILD_VECTOR (i32 extract_element LHS, 2),
                  (i32 extract_element RHS, 0))

I think this should lead to v8i8 as the basic shuffle type (we agree there), with indices <4, 5, u, u, 8, u, u, u>. I think the current loop discards the 5, though.

http://reviews.llvm.org/D4385