[LLVMdev] Extending vector operations
Stefanus Du Toit
stefanus.dutoit at rapidmind.com
Tue Jul 22 08:04:14 PDT 2008
Hi Nate,
On 21-Jul-08, at 7:46 PM, Nate Begeman wrote:
> On Jul 21, 2008, at 1:21 PM, Stefanus Du Toit wrote:
>> 1) Vector shl, lshr, ashr
>>
> That seems reasonable.
Thanks.
>> 2) Vector strunc, sext, zext, fptrunc and fpext
>>
>> Again, I think these are hopefully straightforward. Please let me
>> know
>> if you expect any issues with vector operations that change element
>> sizes from the RHS to the LHS, e.g. around legalization.
>
> Is the proposed semantics here that the number of elements stays the
> same size, and the overall vector width changes?
Yes.
>> 3) Vector intrinsics for floor, ceil, round, frac/modf
>>
>> These are operations that are not trivially specified in terms of
>> simpler operations. It would be nice to have these as overloaded,
>> target-independent intrinsics, in the same way as llvm.cos etc. are
>> supported now.
>
> It seems like these could be handled through intrinsics in the LLVM
> IR, and could use general improvement in the selection dag.
Right, that's what we were thinking too. Glad to hear that makes sense!
>
>> 4) Vector select
>>
>> We consider a vector select extremely important for a number of
>> operations. This would be an extension of select to support an <N x
>> i1> vector mask to select between elements of <N x T> vectors for
>> some
>> basic type T. Vector min, max, sign, etc. can be built on top of this
>> operation.
>
> How is this anything other than AND/ANDN/OR on any integer vector
> type? I don't see what adding this to the IR gets you for vectors,
> since "vector of i1" doesn't mean "vector of bits" necessarily.
Note that I don't mean a "select bits", but rather a "select
components" operation. In other words, a straightforward vectorization
of the existing "select" IR operation.
You can implement the existing LLVM select instruction with bitwise
operations, but it's still convenient to have. For one, as I
mentioned, it provides an obvious idiomatic way to express operations
like min and max.
Vector selection is a common operation amongst languages that support
vectors directly, and vector code will often avoid branches by
performing some form of predication instead.
I'm really not that concerned about how it's expressed, but there
needs to be a well-understood way to lower something that looks like
"vector ?:", vector max, etc in a frontend to something that will
actually generate good code in the backend. If the idiom for vector
float max is "vfcmp, ashr, bitcast, and, xor, and, or, bitcast" and
that generates a single maxps from the x86 backend, great. If the
idiom is "vfcmp, select", even better.
>> 5) Vector comparisons that return <N x i1>
>>
>> This is maybe not a must-have, and perhaps more a question of
>> preference. I understand the current vfcmp/vicmp semantics, returning
>> a vector of iK where K matches the bitwidth of the operands being
>> compared with the high bit set or not, are there for pragmatic
>> reasons, and that these functions exist to aid with code emitted that
>> uses machine-specific intrinsics.
>
> I totally disagree with this approach; A vector of i1 doesn't
> actually match what you want to do with the hardware, unless you had
> say, 128 x i1 for SSE, and it's strange when you have to spill and
> reload it.
I definitely am not thinking of <128 x i1>. The intent is really just
to express "here is a vector of values where only one bit matters in
each element" and have codegen map that to a representation
appropriate to the machine being targeted.
Of course these eventually need to be widened to an appropriately
sized integer vector, and that vector may be a mask or a 0/1 value, or
whatever. The responsibility to doing so can be placed at pretty much
any level of the stack, all the way up to making the user worry about
it.
For us, making the user worry about it isn't an option; we have first
class bools in our frontend, support vectors of these, and naturally
define comparisons, selection, etc to be consistent with this. So that
leaves it to either the part generating LLVM IR, or the LLVM backend/
mid-end. I'm of the tendency that this job is best suited to be done
in an SSA representation, and best suited to be done with a specific
machine in mind. If this is completely infeasible, we'll have to worry
about it during generation of LLVM IR, which is fine. I'd just rather
see it done in LLVM where it can hopefully benefit others with the
same issues.
To me this isn't much different whether we're talking about vectors or
scalars; it's about the utility of an "i1" type generally, and about
whose responsibility it is to map such a type to hardware.
> The current VICMP and VFCMP instructions do not exist for use with
> machine intrinsics; they exist to allow code written use C-style
> comparison operators to generate efficient code on a wide range of
> both scalar and vector hardware.
OK. My understanding of this was based on an email from Chris to llvm-
dev. I had asked how these were used today, especially given the lack
of vector shifts. Here's his response:
>> They can be used with target-specific intrinsics. For example, SSE
>> provides a broad range of intrinsics to support instructions that
>> LLVM
>> IR can't express well. See llvm/include/llvm/IntrinsicsX86.td for
>> more details.
If you have examples on how these are expected to be used today
without machine intrinsics, that would really help - e.g. for
expressing something like a vector max.
>> For code that does not use machine intrinsics, I believe it would be
>> cleaner, simpler, and potentially more efficient, to have a vector
>> compare that returns <N x i1> instead. For example, in conjunction
>> with the above-mentioned vector select, this would allow a max to be
>> expressed simply as a sequence of compare and select.
>
> Having gone down this path, I'd have to disagree with you.
OK. Hopefully my comments above will make my reasoning clearer. If not
please let me know. You have a lot more experience with this in LLVM
than me so pardon my ignorance :). I'm not suggesting that these
semantics are absolutely the way to go, but we do need some way to
address these issues.
>> In addition to the above suggestions, I'd also like to hear what
>> others think about handling vector operations that aren't powers of
>> two in size, e.g. <3 x float> operations. I gather the status quo is
>> that only POT sizes are expected to work (although we've found some
>> bugs for things like <2 x float> that we're submitting). Ideally
>> things like <3 x float> operands would usually be rounded up to the
>> size supported by the machine directly. We can try to do this in the
>> frontend, but it would of course be ideal if these just worked. I'm
>> curious if anyone else out there has dealt with this already and has
>> some suggestions.
>
> Handling NPOT vectors in the code generator ideally would be great;
> I know some people are working on widening the operations to a wider
> legal vector type, and scalarizing is always a possibility as well.
> The main problem here is what to do with address calculations, and
> alignment.
Right, we've dealt with this in the past by effectively scalarizing
loads and stores (since simply extending a load might go beyond a page
boundary, etc.), but extending all register operations. I think this
addresses the address calculation and alignment issues?
Thanks for the feedback, I hope to hear more.
Stefanus
--
Stefanus Du Toit <stefanus.dutoit at rapidmind.com>
RapidMind Inc.
phone: +1 519 885 5455 x116 -- fax: +1 519 885 1463
More information about the llvm-dev
mailing list