[llvm-dev] Question about VectorLegalizer::ExpandStore() with v4i1

Wed Jun 29 14:43:49 PDT 2016

Rob, Ahmed, and Jingu,

[I'm sorry if my point of view is too x86 centric.]

>>the tricky part about fixing it is the need to settle on a memory layout for these vectors
>> (packed vs byte per i1;  packed would be compatible with AVX512, I think).

I agree with Ahmed here, in principle. It's actually more than that, since vector compare
in AVX2 and below produces the same bitwidth per element as the compared data.
For example, in a mixed data type code, it isn't rare to feed integer vector compare
(0/FFFFFFFF, not even 0/1) consumed in double precision blend (or compute) and vice versa
---- mask conversion between 32bit-per-elem and 64bit-per-elem has to happen.
We need to minimize conversion between 0/1 logic and 0/-1 logic, and also conversion
between different element sizes. Doing so for AVX2 and below is challenging enough.
Introduction of AVX512F in Xeon Phi added another challenge to the vectorizer developers.
Addition of AVX512BW and VL should make it easier.

Without AVX512BW and VL (i.e., all of today's x86 targets), optimal representation of
the result of compare is determined by how it is consumed, and it is not a good idea
to have such optimization in multiple different places. If the legalizer has to blindly
legalize v4i1 without knowing how it is consumed, it is best to look at what happens
to v8i1. We can then let the same optimizer work to get the optimal ASM code out
in the end, whether vectorization factor is 4 or 8.

In the end, I may be agreeing to Rob, but not because of the reasons Rob mentioned.
One of the headaches is movmskps/pmovmskb do not have a quick reverse instruction
(MIC-AVX512 and below). I do not know LLVM's X86 CodeGen enough to say whether it
internally has mask-to/from-vector nodes. If it has, I'd hope X86 CodeGen can cancel out such
things in a peephole manner very efficiently so that blindly going for i1-per-elem (at type
legalization time) is good enough for most (if not all) cases ----- and I also hope that is
good (or good enough) for other (i.e., non-x86) backends.

Thanks,
Hideki Saito
Vectorizer Technical Lead
Intel Compiler and Languages

-----------------------------------------------------
Message: 8
Date: Tue, 28 Jun 2016 10:57:09 -0700 (PDT)
From: Rob Cameron via llvm-dev <llvm-dev at lists.llvm.org>
To: Ahmed Bougacha <ahmed.bougacha at gmail.com>
Cc: llvm-dev <llvm-dev at lists.llvm.org>
Subject: Re: [llvm-dev] Question about VectorLegalizer::ExpandStore()
	with	v4i1
Message-ID: <1150997581.449524.1467136629022.JavaMail.zimbra at sfu.ca>
Content-Type: text/plain; charset=utf-8

Hi, Ahmed.

A packed representation, one bit per i1, is natural and best for our
work, for sure.   In the Parabix project, we produced very fast text
and byte stream processing applications using packed bit streams,
stored 128 bits at a time for SSE/Neon/Altivec registers, 256 bits at
a time for AVX, 512 bits at a time for AVX 512.   

I also think that the one bit per i1 approach is best and most consistent
overall.   Vectors are not arrays.   Vectors are intended to be treated
as single values.  Whereas an array of i1 could reasonably be viewed as
an array of bytes, a vector of i1 should be packed. 

The use of vector types in general should signify that efficient loading,
storing and manipulating of vectors is more important than manipulation of
individual elements.   The entire point is to provide a natural model for
SIMD instruction sets, it seems to me.

As you say, the packed representation makes a lot of sense for AVX512.
But even the existing SSE and AVX instruction sets use a packed representation
in many cases.   For example, the SSE operation movmskps produces a 4xi1
and pmovmskb produces 16xi1, both in packed form.   In addition, any
icmp or fcmp operation can be easily implemented using two instructions
to produce packed i1 values.   Our software relies on this packed
representation extensively.

> 
> JinGu,
> 
> Your analysis is correct, vectors of i1 are incorrectly legalized.
> This is a known issue (http://llvm.org/PR22603); the tricky part about
> fixing it is the need to settle on a memory layout for these vectors
> (packed vs byte per i1;  packed would be compatible with AVX512, I
> think).
> 
> -Ahmed
>