[LLVMdev] Extending vector operations

Tue Jul 22 08:04:14 PDT 2008

Hi Nate,

On 21-Jul-08, at 7:46 PM, Nate Begeman wrote:
> On Jul 21, 2008, at 1:21 PM, Stefanus Du Toit wrote:
>> 1) Vector shl, lshr, ashr
>>
> That seems reasonable.

Thanks.

>> 2) Vector strunc, sext, zext, fptrunc and fpext
>>
>> Again, I think these are hopefully straightforward. Please let me  
>> know
>> if you expect any issues with vector operations that change element
>> sizes from the RHS to the LHS, e.g. around legalization.
>
> Is the proposed semantics here that the number of elements stays the  
> same size, and the overall vector width changes?

Yes.

>> 3) Vector intrinsics for floor, ceil, round, frac/modf
>>
>> These are operations that are not trivially specified in terms of
>> simpler operations. It would be nice to have these as overloaded,
>> target-independent intrinsics, in the same way as llvm.cos etc. are
>> supported now.
>
> It seems like these could be handled through intrinsics in the LLVM  
> IR, and could use general improvement in the selection dag.

Right, that's what we were thinking too. Glad to hear that makes sense!

>
>> 4) Vector select
>>
>> We consider a vector select extremely important for a number of
>> operations. This would be an extension of select to support an <N x
>> i1> vector mask to select between elements of <N x T> vectors for  
>> some
>> basic type T. Vector min, max, sign, etc. can be built on top of this
>> operation.
>
> How is this anything other than AND/ANDN/OR on any integer vector  
> type?  I don't see what adding this to the IR gets you for vectors,  
> since "vector of i1" doesn't mean "vector of bits" necessarily.

Note that I don't mean a "select bits", but rather a "select  
components" operation. In other words, a straightforward vectorization  
of the existing "select" IR operation.

You can implement the existing LLVM select instruction with bitwise  
operations, but it's still convenient to have. For one, as I  
mentioned, it provides an obvious idiomatic way to express operations  
like min and max.

Vector selection is a common operation amongst languages that support  
vectors directly, and vector code will often avoid branches by  
performing some form of predication instead.

I'm really not that concerned about how it's expressed, but there  
needs to be a well-understood way to lower something that looks like  
"vector ?:", vector max, etc in a frontend to something that will  
actually generate good code in the backend. If the idiom for vector  
float max is "vfcmp, ashr, bitcast, and, xor, and, or, bitcast" and  
that generates a single maxps from the x86 backend, great. If the  
idiom is "vfcmp, select", even better.

>> 5) Vector comparisons that return <N x i1>
>>
>> This is maybe not a must-have, and perhaps more a question of
>> preference. I understand the current vfcmp/vicmp semantics, returning
>> a vector of iK where K matches the bitwidth of the operands being
>> compared with the high bit set or not, are there for pragmatic
>> reasons, and that these functions exist to aid with code emitted that
>> uses machine-specific intrinsics.
>
> I totally disagree with this approach; A vector of i1 doesn't  
> actually match what you want to do with the hardware, unless you had  
> say, 128 x i1 for SSE, and it's strange when you have to spill and  
> reload it.

I definitely am not thinking of <128 x i1>. The intent is really just  
to express "here is a vector of values where only one bit matters in  
each element" and have codegen map that to a representation  
appropriate to the machine being targeted.

Of course these eventually need to be widened to an appropriately  
sized integer vector, and that vector may be a mask or a 0/1 value, or  
whatever. The responsibility to doing so can be placed at pretty much  
any level of the stack, all the way up to making the user worry about  
it.

For us, making the user worry about it isn't an option; we have first  
class bools in our frontend, support vectors of these, and naturally  
define comparisons, selection, etc to be consistent with this. So that  
leaves it to either the part generating LLVM IR, or the LLVM backend/ 
mid-end. I'm of the tendency that this job is best suited to be done  
in an SSA representation, and best suited to be done with a specific  
machine in mind. If this is completely infeasible, we'll have to worry  
about it during generation of LLVM IR, which is fine. I'd just rather  
see it done in LLVM where it can hopefully benefit others with the  
same issues.

To me this isn't much different whether we're talking about vectors or  
scalars; it's about the utility of an "i1" type generally, and about  
whose responsibility it is to map such a type to hardware.

> The current VICMP and VFCMP instructions do not exist for use with  
> machine intrinsics; they exist to allow code written use C-style  
> comparison operators to generate efficient code on a wide range of  
> both scalar and vector hardware.

OK. My understanding of this was based on an email from Chris to llvm- 
dev. I had asked how these were used today, especially given the lack  
of vector shifts. Here's his response:

>> They can be used with target-specific intrinsics.  For example, SSE
>> provides a broad range of intrinsics to support instructions that  
>> LLVM
>> IR can't express well.  See llvm/include/llvm/IntrinsicsX86.td for
>> more details.

If you have examples on how these are expected to be used today  
without machine intrinsics, that would really help - e.g. for  
expressing something like a vector max.

>> For code that does not use machine intrinsics, I believe it would be
>> cleaner, simpler, and potentially more efficient, to have a vector
>> compare that returns <N x i1> instead. For example, in conjunction
>> with the above-mentioned vector select, this would allow a max to be
>> expressed simply as a sequence of compare and select.
>
> Having gone down this path, I'd have to disagree with you.

OK. Hopefully my comments above will make my reasoning clearer. If not  
please let me know. You have a lot more experience with this in LLVM  
than me so pardon my ignorance :). I'm not suggesting that these  
semantics are absolutely the way to go, but we do need some way to  
address these issues.

>> In addition to the above suggestions, I'd also like to hear what
>> others think about handling vector operations that aren't powers of
>> two in size, e.g. <3 x float> operations. I gather the status quo is
>> that only POT sizes are expected to work (although we've found some
>> bugs for things like <2 x float> that we're submitting). Ideally
>> things like <3 x float> operands would usually be rounded up to the
>> size supported by the machine directly. We can try to do this in the
>> frontend, but it would of course be ideal if these just worked. I'm
>> curious if anyone else out there has dealt with this already and has
>> some suggestions.
>
> Handling NPOT vectors in the code generator ideally would be great;  
> I know some people are working on widening the operations to a wider  
> legal vector type, and scalarizing is always a possibility as well.   
> The main problem here is what to do with address calculations, and  
> alignment.

Right, we've dealt with this in the past by effectively scalarizing  
loads and stores (since simply extending a load might go beyond a page  
boundary, etc.), but extending all register operations. I think this  
addresses the address calculation and alignment issues?

Thanks for the feedback, I hope to hear more.

Stefanus

--
Stefanus Du Toit <stefanus.dutoit at rapidmind.com>
   RapidMind Inc.
   phone: +1 519 885 5455 x116 -- fax: +1 519 885 1463