[LLVMdev] Vector select/compare support in LLVM

Rotem, Nadav nadav.rotem at intel.com
Tue Mar 8 11:46:45 PST 2011


Hello, 

I started working on adding vector support for the SELECT and CMP instructions in the codegen (bugs: 3384, 1784, 2314). 

Currently, the codegen scalarizes vector CMPs into multiple scalar CMPs.  It is easy to add similar scalarization support to the SELECT instruction.  However, using multiple scalar operations is slower than using vector operations. 
In LLVM, vector-compare operations generate a vector of i1s, and the vector-select instruction uses these vectors. In between, these values (masks) can be manipulated (xor-ed, and-ed, etc). 
For x86, I would like the codegen to generate the ‘pcmpeq’ and ‘blend’ family of instructions.  SSE masks are implemented using a 32bit word per item, where the MSB bit is used as a predicate and the rest of the bits are ignored.  I believe that  PPC Altivec and ARM Neon are also implemented this way. 

I can think of two ways to represent masks in x86: sparse and packed. In the sparse method, the masks are kept in <4 x 32bit> registers, which are mapped to xmm registers. This is the ‘native’ way of using masks. 
In the second representation, the packed method, the MSB bits are collected from the xmm register into a packed general purpose register.  Luckily, SSE has the MOVMSKPS instruction, which converts sparse masks to packed masks. I am not sure which representation is better, but both are reasonable. The former may cause register pressure in some cases, while the latter may add the packing-unpacking overhead. 

_Sparse_
After my discussion with Duncan, last week, I started working on the promotion of  type <4 x i1> to <4 x i32>, and I ran into a problem.  It looks like the codegen term ‘promote’ is overloaded.  For scalars, the ‘promote’ operation converts scalars to larger bit-width scalars.  For vectors, the ‘promote’ operation widens the vector to the next power of two.  This is reasonable for types such as ‘<3 x float>’.  Maybe we need to add another legalization operation which will mean widening the vectors?  In any case, I estimated that implementing this per-element promotion would require major changes and decided that this is not the way to go.

_Packed_
I followed Duncan’s original suggestion which was packing vectors of i1s into general purpose registers.
I started by adding several new types to ValueTypes (td and h).  I added ‘4vi1, 8vi1, 16vi1 … 64vi1’.  For x86, I mapped the v8i1 .. v8i64 to general purpose x86 registers. I started playing with a small program, which performed a vector CMP on 4 elements.  The legalizer promoted the v4i1 to the next legal pow-of-two type, which was v8i1. I changed WidenVecRes_SETCC and added a new method WidenVecOp_Select to handle the legalization of these types. The widening of the Select and SETCC ops was simple since I only widened the operands which needed widening. I am not sure if this is correct, but I ran into more problems before I could test it.  
Another  problem that I had was that i1 types are still promoted to i8 types. So if I have a vector such as ‘4 x i1: <0, 0, 1, 1>’,  it will be mapped to DAG node ‘BUILD_VECTOR’ which accepts 4 i8s and returns a single v4i1.  This fails somewhere because the cast is illegal.  The desired result should be that the above vector would be translated to the (packed) scalar value ‘3’. I hacked TargetLowering::ReplaceNodeResults and added a minimal support for BUILD_VECTOR. 

I’d be interested in hearing your suggestions in which direction/s to proceed.

Thank you, 
Nadav
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.




More information about the llvm-dev mailing list