<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><br><div><div>On Jul 21, 2008, at 1:21 PM, Stefanus Du Toit wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div>Hi,<br><br>We would like to extend the vector operations in llvm a bit. We're <br>hoping to get some feedback on the right way to go, or some starting <br>points. I had previously had some discussion on this list about a <br>subset of the changes we have in mind.<br><br>All of these changes are intended to make target-independent IR (i.e. <br>IR without machine specific intrinsics) generate better code or be <br>easier to generate from a frontend with vector support (whether from <br>manual or autovectorization).<br><br>If you have any insight into how to best get started with any of these <br>changes, and whether they are feasible and sensible, please let me <br>know. We're mostly interested in x86 as a target in the short term, <br>but obviously want these to apply to other LLVM targets as well. We're <br>prepared to implement these changes, but would like to hear any <br>suggestions and objections you might have.<br><br>Below are the specific additions we have in mind.<br><br>===<br>1) Vector shl, lshr, ashr<br><br>I think these are no-brainers. We would like to extend the semantics <br>of the shifting instructions to naturally apply to vectors as well. <br>One issue is that these operations often only support a single shift <br>amount for an entire vector. I assume it should be fairly <br>straightforward to select on this pattern, and scalarize the general <br>case as necessary.</div></blockquote><div><br></div><div>That seems reasonable.</div><br><blockquote type="cite"><div>2) Vector strunc, sext, zext, fptrunc and fpext<br><br>Again, I think these are hopefully straightforward. Please let me know <br>if you expect any issues with vector operations that change element <br>sizes from the RHS to the LHS, e.g. around legalization.</div></blockquote><div><br></div><div>Is the proposed semantics here that the number of elements stays the same size, and the overall vector width changes?</div><div><br></div><blockquote type="cite"><div>3) Vector intrinsics for floor, ceil, round, frac/modf<br><br>These are operations that are not trivially specified in terms of <br>simpler operations. It would be nice to have these as overloaded, <br>target-independent intrinsics, in the same way as llvm.cos etc. are <br>supported now.</div></blockquote><div><br></div><div>It seems like these could be handled through intrinsics in the LLVM IR, and could use general improvement in the selection dag.</div><br><blockquote type="cite"><div>4) Vector select<br><br>We consider a vector select extremely important for a number of <br>operations. This would be an extension of select to support an <N x <br>i1> vector mask to select between elements of <N x T> vectors for some <br>basic type T. Vector min, max, sign, etc. can be built on top of this <br>operation.</div></blockquote><div><br></div><div>How is this anything other than AND/ANDN/OR on any integer vector type? I don't see what adding this to the IR gets you for vectors, since "vector of i1" doesn't mean "vector of bits" necessarily.</div><br><blockquote type="cite"><div>5) Vector comparisons that return <N x i1><br><br>This is maybe not a must-have, and perhaps more a question of <br>preference. I understand the current vfcmp/vicmp semantics, returning <br>a vector of iK where K matches the bitwidth of the operands being <br>compared with the high bit set or not, are there for pragmatic <br>reasons, and that these functions exist to aid with code emitted that <br>uses machine-specific intrinsics.</div></blockquote><div><br></div><div>I totally disagree with this approach; A vector of i1 doesn't actually match what you want to do with the hardware, unless you had say, 128 x i1 for SSE, and it's strange when you have to spill and reload it. The current VICMP and VFCMP instructions do not exist for use with machine intrinsics; they exist to allow code written use C-style comparison operators to generate efficient code on a wide range of both scalar and vector hardware.</div><br><blockquote type="cite"><div>For code that does not use machine intrinsics, I believe it would be <br>cleaner, simpler, and potentially more efficient, to have a vector <br>compare that returns <N x i1> instead. For example, in conjunction <br>with the above-mentioned vector select, this would allow a max to be <br>expressed simply as a sequence of compare and select.</div></blockquote><div><br></div><div>Having gone down this path, I'd have to disagree with you.</div><blockquote type="cite"><div><font class="Apple-style-span" color="#000000"><br></font></div><div>In addition to the above suggestions, I'd also like to hear what <br>others think about handling vector operations that aren't powers of <br>two in size, e.g. <3 x float> operations. I gather the status quo is <br>that only POT sizes are expected to work (although we've found some <br>bugs for things like <2 x float> that we're submitting). Ideally <br>things like <3 x float> operands would usually be rounded up to the <br>size supported by the machine directly. We can try to do this in the <br>frontend, but it would of course be ideal if these just worked. I'm <br>curious if anyone else out there has dealt with this already and has <br>some suggestions.</div></blockquote><div><br></div><div>Handling NPOT vectors in the code generator ideally would be great; I know some people are working on widening the operations to a wider legal vector type, and scalarizing is always a possibility as well. The main problem here is what to do with address calculations, and alignment.</div><div><br></div><div>Nate</div></div></body></html>