[PATCH] [AArch64] Fix a bug generating incorrect instruction when building small vector.

Tue Jul 22 01:25:52 PDT 2014

See comments below.

Regards,
Kevin

================
Comment at: lib/Target/AArch64/AArch64ISelLowering.cpp:4192
@@ +4191,3 @@
+    if (SrcEltTy.bitsLT(SmallestEltTy)) {
+      ResMultiplier *= SmallestEltTy.getSizeInBits() / SrcEltTy.getSizeInBits();
+      SmallestEltTy = SrcEltTy;
----------------
Tim Northover wrote:
> It's actually far easier to just set ResMultiplier once outside the loop: you know SmallestEltTy and VT.getVectorElementType(). Those give the correct ratio.
I got myself limited to the previous solution.... Get it out of the loop is a better idea. 

================
Comment at: lib/Target/AArch64/AArch64ISelLowering.cpp:4269-4270
@@ -4248,2 +4268,4 @@
     }
+    for (int j = 0; j != ResMultiplier - 1; ++j)
+      Mask.push_back(-1);
   }
----------------
Tim Northover wrote:
> Kevin Qin wrote:
> > Tim Northover wrote:
> > > I don't think this is generic enough. Again in the situation with two mismatched sources (so take a v4i32 VT being produced from a v4i16 and a v8i8 for example).
> > > 
> > > Anything coming from the v4i16 needs to set *two* elements of Mask properly (and then 2 undefs).
> > > 
> > > I'm not sure, but a more thorough restructuring of this loop might make the result clearer too. Don't be afraid to move things about if you see anything.
> > For your example, v4i32 will bitcast to v16i8, which only 0, 8, 16, 24 are valid lane number, and all other lanes should be undef. The valid lane number is decided by ResMultiplier only, and independent to sources. So I think this code can cover all scenarios including two sources are mismatched(The mismatching of  two sources will reflect on their OffsetMultipliers, and finally change the Mask number). But I can make a change for code style.
> OK, I don't think I was clear enough. Let's look at a concrete example. Take LHS = v4i16, RHS = v8i8, result = v2i32 and imagine something like
> 
>     (BUILD_VECTOR (i32 extract_element LHS, 2),
>                   (i32 extract_element RHS, 0))
> 
> I think this should lead to v8i8 as the basic shuffle type (we agree there), with indices <4, 5, u, u, 8, u, u, u>. I think the current loop discards the 5, though.
If LHS = v4i16, RHS = v8i8, result = v2i32, it must be LHS = v4i8, RHS = v8i8, result = v2i8 before legalization. Because the actual lane size is decided by the smallest element size among result and 2 operands, and all other larger element sizes must come from legalization. Please correct me if I'm wrong. Based on this logic, the mask <4, u, u, u, 8, u, u, u> is correct.

http://reviews.llvm.org/D4385