[LLVMdev] Vector instructions
Stefanus Du Toit
stefanus.dutoit at rapidmind.com
Fri Jun 27 08:02:19 PDT 2008
Hi Dan,
Thanks for your comments. I've responded inline below.
On 26-Jun-08, at 6:49 PM, Dan Gohman wrote:
> On Jun 26, 2008, at 1:56 PM, Stefanus Du Toit wrote:
>>
>> ===
>> 1. Shufflevector only accepts vectors of the same type
>>
>> I would propose to change the syntax from:
>>
>>> <result> = shufflevector <n x <ty>> <v1>, <n x <ty>> <v2>, <n x i32>
>>> <mask> ; yields <n x <ty>>
>>
>> to:
>>
>>> <result> = shufflevector <a x <ty>> <v1>, <b x <ty>> <v2>, <d x i32>
>>> <mask> ; yields <d x <ty>>
>>
>> With the requirement that the entries in the (still constant) mask
>> are
>> within the range of [0, a + b - 1].
> The alternative is to have the frontend synthesize the needed
> operations with extracts, inserts, and possibly shuffles if needed.
> LLVM is actually fairly well prepared to optimize code like this.
> I recommend giving this a try, and reporting any problems you
> encounter.
That certainly appears to be the only option at the moment, and we'll
have a look to see how that works out. However, note that a
sufficiently generalized shufflevector would remove the need for
insertelement and extractelement to exist completely.
>> 2. vector select
>> 3. vector trunc, sext, zext, fptrunc, fpext
>> 4. vector shl, lshr, ashr
> [...]
>
> We agree that these would be useful. There are intentions to add them
> to LLVM; others can say more.
OK. I'd love to hear more, especially if someone is planning to do
this in the short term.
>> 4. vfcmp, vicmp return types
>>
>> This topic came up recently on llvm-dev (thread "VFCmp failing when
>> unordered or UnsafeFPMath on x86"). Having vector compares return a
>> vector of integer elements with bit width equal to the element types
>> being compared seems unnecessarily inconsistent with the icmp and
>> fcmp
>> instructions. Since only one bit is well-defined anyways, why not
>> just
>> return a vector of i1 elements? If after codegen this is expanded
>> into
>> some other width, that's fine -- the backend can do this since only
>> one bit of information is exposed at the IR level anyways.
>>
>> Having these instructions return vector of i1 would also be nicely
>> consistent with a vector select operation. Vector bit shifts and
>> trunc
>> would make this easier, but it seems to me that the entire IR would
>> be
>> far more consistent if these operations simply returned i1 vectors.
>
> It turns out that having them return vectors of i1 would be somewhat
> complicated. For example, a <4 x i1> on an SSE2 target could expand
> either to 1 <4 x i32> or 2 <2 x i64>s, and the optimal thing would
> be to make the decision based on the comparison that produced them,
> but LLVM isn't yet equipped for that.
Can you expand on this a bit? I'm guessing you're referring to
specific mechanics in LLVM's code generation framework making this
difficult to do?
From a theoretical point of view (keeping in mind current
architectures) there can definitely be a need to change the
representation at various points in the program. For example, a <2 x
i1> generated from comparing some <2 x float> types could later be
used to select amongst two <2 x double> types, and on most current
architectures this would require a mask to be expanded. There are
three cases of instructions at which representations for i1 vectors
must be chosen: instructions which generate vectors of i1s (e.g.
vfcmp), instructions which consume them (e.g. a vector select) and
instructions which do as both (e.g. shufflevector, phi statements). I
would expect generating statements to generally have only one fixed
choice (since common ISAs generally create masks of the size of what's
being compared), and consumers to generally dictate a representation,
possibly necessitating conversion code to be inserted. For
instructions that transform vectors of i1 in some way, there may be
some additional optimization opportunity to avoid conversion code. I
would also presume that standard optimizations such as a form of CSE
or VN and code motion would help reduce the number of conversions.
In fact, this issue of representing i1 vectors isn't even constrained
to vector types. Scalar code which stores i1 values in regular
registers needs to have the same considerations (e.g. on Cell SPU
where there is no condition register).
Perhaps all that was way off topic, and you're just referring to a
simple mechanical issue preventing this right now. My point is just
that representation of booleans in registers is an important
architecturally specific issue to handle, and the frontend is (in my
humble opinion) probably not the best place for this.
> vicmp and vfcmp are very much aimed at solving practical problems on
> popular architectures without needing significant new infrastructure
> They're relatively new, and as you say, they'll be more useful when
> combined with vector shifts and friends and we start teaching LLVM
> to recognize popular idioms with them.
Can you give me examples of how they are used today at all? I'm having
a really hard time figuring out a good use for them (that doesn't
involve effectively scalarizing them immediately) without these other
instructions.
> One interesting point here is that if we ever do want to have vector
> comparisons that return vectors of i1, one way to do that would be to
> overload plain icmp and fcmp, which would be neatly consistent with
> the way plain mul and add are overloaded for vector types instead of
> having a separate vadd and vmul.
Agreed, this would be nice.
--
Stefanus Du Toit <stefanus.dutoit at rapidmind.com>
RapidMind Inc.
phone: +1 519 885 5455 x116 -- fax: +1 519 885 1463
More information about the llvm-dev
mailing list