[EXT] [PATCH] D75388: Add a pass to identify certain shuffle_vector and transform it into target specific intrinsics.

Fri Mar 6 12:00:15 PST 2020

On 05/03/2020 00:35, Wei Zhao wrote:
> 1) Why not do this work in ISel?
> 
> In LLVM IR, we have shufflevector
> In LLVM DAG, we have vector_shuffle
> 
> In ISel, the normal process to translate LLVM IR shufflevector to TBLn instruction is to translate it first to DAG node vector_shuffle, and then in the DAG Legalize() to lower it to machine specific instructions like TBLn.
> 
> There is a very strict and deeply rooted convention or requirement on how to translate LLVM IR shufflevector -> DAG node vector_shuffle.
> 
> In the following example:
> %v3 = shufflevector <8 x i8> %v1, <8 x i8> %v2, <8 x i32>
>                                   <i32 1, i32 0, i32 3, i32 2, i32 5, i32 4, i32 7, i32 6> ; yields <8 x i8>
> 
> %v3 and %v1, %v2 have to be the same type.
> 
> If not, ISel won't translate it to DAG node vector_shuffle and later it won't be lowered into a TBLn instruction.
> Instead, the ISel(it has at most 9 steps in total) will lower it into a sequence of extract/insert instructions to form a new vector, as in the above example, %v3 will be generated by a sequence of extract/insert instructions.
> 
> Because ISel is coded like this, it is very hard to break this rule and put the handling of irregular (mismatched type) shufflevector in the ISel stage.
> 
> To my understanding, the type match requirement on shuffle vector can be dated back to the early days of LLVM when it was used to generate code for Motorola Altivec.
> 
> Because of all these, we decided to following the example of Ldn/Stn generation in InterleavedAccessPass(), just directly generate Tbl1 instruction from LLVM IR before entering ISel.
> 

You may be right that IR is the simpler place for this to be. If we could add it to ISel though, that would be the more natural place for it. And would stop us having to look at uses like this. ISel knows about all the types and lanes that are going on at the machine level. A lot of the times that these pre-isel passes have been added feel like a mistake to me. When there is masking involved it is sometimes really required, but dagcombine is good at taking many things into account, optimising many things all at once.

There is ReconstructShuffle and isShuffleMaskLegal, but they might not be applicable with the way that this example goes via a smaller legal vector. At one point it looks like we have:
              t5: v8i16,ch = load<(load 16 from %ir.scevgep238, align 2)> t0, t2, undef:i64
            t16: i32 = extract_vector_elt t5, Constant:i64<0>
            t17: i32 = extract_vector_elt t5, Constant:i64<4>
          t20: v2i32 = BUILD_VECTOR t16, t17
          t22: v2i32 = BUILD_VECTOR Constant:i32<65535>, Constant:i32<65535>
        t24: v2i32 = and t20, t22
      t26: v2i64 = zero_extend t24
That going into and out of a v2i32/v4i32 is not going to be efficient. It may make sense to take sequences like this and flatten them to simpler code, potentially using the VTBL like you propose. (It might turn out this example is nothing like your actual code though, or that some cases are not worth optimising like that. The VTBL is a powerful instruction, but comes at a cost. We need to be careful not to overuse it where simpler code would be better.)

I'm not convinced either way, but needing to check the uses lie that sounds off to me.

> 2) Why looking at shufflevector's user instruction?
> 
> Let's look at another example:
>         %wide.vec = load <8 x i16>, <8 x i16>* %scevgep238, align 2, !tbaa !8
>         %strided.vec = shufflevector <8 x i16> %wide.vec, <8 x i16> undef, <2 x i32> <i32 0, i32 4>
>         %2 = uitofp <2 x i16> %strided.vec to <2 x double>
> 
> Basically it loads a 128bit data into a V register, and it will extract the 0th 16bit and the 4th 16bit to form a v2i16 vector. Note here, the %strided.vec has a type v2i16, and the input vector has a type v8i16, they are not the same. So based on the above analysis, the current ISel will lower this into a sequence of extract/insert+shifting+masking instructions, its cost is very high.
> 
> The type match requirement is not baseless as it is met it can help to form a "complete" or "128bit" vector with every byte defined so that its user instruction can work on it.
> 
> In the above example, the result of the shufflevector is a v2i16 vector, it cannot be directly used by its user instruction uitofp without shifting and alignment according to the type <2 x double>.
> 
> So by looking at the user instruction, we can figure out the corresponding Tbl1 instruction's mask. In this case, we know the first element of %strided.vec <v2i16> should have an offset 0(byte offset), while the 2nd element should have an offset of (2-1)*bitwidth(double). On AArch64, TBLn instruction has no type concept, it only knows byte. The mask is byte based. In the above example, we can use AArch64's Tbl1 instruction to place the v2i16 vector into a 128-bit register and avoid those high cost of extract/insert(actually, also a few shifting/masking) instructions. This brings huge performance gain.

I see. You are essentially looking for a <2 x double>, so that you know the lanes the result will need to end up in.

Am I right in saying that in that example above this would be the same as (and (load v8i16, 0x000000000000ffff000000000000ffff)). Because the lanes are already in the correct place, they just need to be masked off? In other cases they would need to be shifted too (so a single vtbl may do better).

Another point for ISel would be that it knows these types. Not sure if it would be better or just as awkward.

> 3) Cost model
> Because shufflevector on mismatched type vectors are usually lowered into a sequence of extract/insert instructions, LLVM loop vectorizer give a very high cost to interleaved memory access based on this extract/insert approach to form a vector. For example, in the above example, to form a v2i16 vector from v8i16 vector, when VF is 2, UF=4, the cost is 34 for this interleaved memory access. This high cost will kill this loop's vectorization as most other instructions usually costs 0/1. Without the use of Tbl1 instruction to form a new vector at a low cost, many loops will not be vectorized because of this high cost.
> 
> On Marvell's thunder2tx99, the Tbl1 cost 5 cycles to finish which is almost the same a fadd/fmul. While Tbl2/3/4 will cost more cycles to finish.
> 
> Tbl1 definitely cost much less cycles than a sequence of extract/insert instructions to forma vector. And this is the motivation of our work.

Yep. I was thinking that the VTBL would cost 1, but that doesn't also account for the cost of the load. (The VTBL is also probably 3 instructions if I'm understanding correctly, an adr, a load and the vtbl. Plus a constant pool block. But the adr and load should be pulled up out of the loop, making them free for the vectorizers cost model. They just end up taking up an extra register).

I think it's worth considering this as at least 2 separate parts, one that gets the backend to lower to the better sequences of instructions (through this or through ISel if it would work), and the other that gets the vectorizer to produce the code we want (by adjusting the cost model).

Both patches need tests to show that they are behaving correctly. Those tests might also shed more light on the best way forward here.

Looks very interesting,
Dave

> Wei Zhao
>       __o   Hurry ...
>   _  \<,_
> (_)/ (_)
> ~~~~~~~~~~~~
> 
> -----Original Message-----
> From: Dave Green via Phabricator <reviews at reviews.llvm.org>
> Sent: Sunday, March 1, 2020 7:46 AM
> To: Wei Zhao <wxz at marvell.com>; mail at justinbogner.com; dorit.nuzman at intel.com
> Cc: david.green at arm.com; joelkevinjones at gmail.com; kristof.beyls at arm.com; hiraditya at msn.com; llvm-commits at lists.llvm.org; t.p.northover at gmail.com; mcrosier at codeaurora.org; florian_hahn at apple.com; simon.moll at emea.nec.com; daniel.kiss at arm.com
> Subject: [EXT] [PATCH] D75388: Add a pass to identify certain shuffle_vector and transform it into target specific intrinsics.
> 
> External Email
> 
> ----------------------------------------------------------------------
> dmgreen added a comment.
> 
> Hello. Looks interesting. We ended up doing something similar with lowering of interleaved access groups to VMOVN instructions in MVE. It went straight through ISel though, not needing to go via the InterleavedAccessPass.  I don't immediately see why this case would need to be done differently. It looks like ISel can already generate at least some TBL1 instructions.
> 
> Can you add some testcases for this? Both for producing this in the vectorizer/costmodel tests and for the backend codegen of the load+shuffle patterns you expect to see.
> 
> Some other initial thoughts:
> 
> - Can you run clang-format over the patch. That would lower the amount of noise from the lint bot.
> - VF can be calculated from VecTy and Factor.
> - Checking the instruction users are a certain kind sounds odd. Can you explain why it's checking that and only generating in those cases?
> - I was half expecting BaseT::getInterleavedMemoryOpCost to return something like getMemoryOpCost + getShuffleCost, but it seems to use the cost of inserts + extracts.
> 
> 
> Repository:
>    rG LLVM Github Monorepo
> 
> CHANGES SINCE LAST ACTION
>    https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D75388_new_&d=DwIFAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=uyxacxdjzpq-fLmkeDKKtQ&m=0n9ZhNTN1_I6H0N4p3WRoaHCWAvbNBqGaASQfJXEdtE&s=1psgguuIwGypLJw0arv2yLvATznte7hELMxvk0Sf3j4&e=
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D75388&d=DwIFAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=uyxacxdjzpq-fLmkeDKKtQ&m=0n9ZhNTN1_I6H0N4p3WRoaHCWAvbNBqGaASQfJXEdtE&s=syU3AG-gSHXaKkNZZ8787kW7omfRxJZ45YQxNkNX2FE&e=
> 
> 
>