[LLVMdev] SelectionDAG scalarizes vector operations.

Wed Feb 8 00:36:19 PST 2012

Hi Nadav,

> I had a few thoughts regarding our short discussion yesterday.
>
>   I am not sure how we can lower SEXT into the vpmovsx family of instructions. I propose the following strategy for the ZEXT and ANYEXT family of functions.

what I would like to understand first is why there are any vector xEXT nodes
at all!  As I tried to explain on IRC, I don't think you ever get these from
the GCC autovectorizer except as part of a shuffle-extend pair.  Where do you
get these nodes from?  Does the intel auto-vectorizer produce them more often
than the GCC one?

Ciao, Duncan.

  At first, we let the Type Legalizer/VectorOpLegalizer scalarize the code. 
Next, we allow the dag-combiner to convert the BUILD_VECTOR node into a shuffle. 
This is possible because all of the inputs of the build vector come from two 
values(src and (undef or zero)).  Finally, the shuffle lowering code lowers the 
new shuffle node into UNPCKLPS. This sequence should be optimal for all of the 
sane types.
> Once we implement ZEXT and ANYEXT we could issue a INREG_SEXT instruction to support SEXT.  Unfortunately, v2i64 SRA is not supported by the hardware and the code will be scalarized ...
>
> Currently we promote vector elements to the widest possible type, until we hit the _first_ legal register type.  For AVX, where YMM registers extend XMM registers, it is not clear to me why we stop at XMM sized registers. In some cases, masks of types<4 x i1>  are legalized to<4 x i32>  in XMM registers even if they are a result of a vector-compare of<4 x i64>  types.  I also had a second observation, which contradicts the first one. In many cases we 'over promote'. Consider the<2 x i32>  type. Promoting the elements to<2 x i64>  makes us to use types which are not supported by the instruction set. For example, not all of the shift operations are implemented for vector i64 types.  Maybe a different strategy would be to promote vector elements up to i32, which is the common element type for most processors, and widen the vector from this point onwards.  I am not sure how we can implement vector compare/select with this approach.
>
> Thanks,
> Nadav
>
>> nadav: in my experience a lot of trouble comes from this kind of thing: there is an x86 instruction that takes the first two elements of<4 x i32>,
>> extends them from i32 to i64, and returns<2 x i64>
>> ^ all one instruction
>> how to represent that in LLVM IR? in LLVM IR it ends up as two IR instructions
>> first a shuffle that extracts<2 x i32>  from<4 x i32>  then some kind of extension from<2 x i32>  to<2 x i64>
>> currently codegen doesn't do anything sensible with either of these two, let alone realize that together they correspond to a single processor instruction
>> nadav: anyway, what I'm saying is that a bunch of extensions seen in the IR/SDag may be due to this kind of thing
>> it certainly happens all the time with IR coming from the gcc vectorizers
>> we need to somehow turn the multiple nodes into one processor instruction
>> in fact this is pretty much the only way you can get extending casts of vectors with IR coming from the GCC vectorizer
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>