[llvm-commits] shufflevector on ARM (clumsy x-post from llvmdev)

Tim Northover Tim.Northover at arm.com
Fri Jan 7 07:16:51 PST 2011


On 07/01/11 07:28, Bob Wilson wrote:

 > The extract_subvector patch looks good, except for the testsuite
 > changes. Those tests are supposed to test spill code, and your patch
 > causes them to stop spilling. I'll commit the patch after I fix the
 > tests to continue spilling in spite of your change.

Ah thanks. I'd convinced myself it was spilling, just slightly
differently. Glad you picked that up.

 > The build_vector patch looks good, too. Can you also provide some tests
 > that exercise this? (The test/CodeGen/ARM/vext.ll file would be a good
 > place to put them.)

Yep, I've uploaded the replacement to

http://www.maths.ed.ac.uk/~s0677366/build_vector_r2.patch

It incorporates your comments and some tests that I believe exercise
most of the code (some is just there in case weird lowering creates
something unexpected and I can't actually produce an example).

 > The loop below would be a good candidate for splitting into a separate
 > function, except that it would need quite a few arguments. Maybe this
 > entire new chunk of new code could be a separate function (if that would
 > be cleaner)? I would prefer that, if it's not too awkward. The
 > LowerBUILD_VECTOR function is already getting pretty long and it makes
 > it hard to follow the logic.

Good suggestion, I've moved all the code into a separate function and I
think the control flow is rather simpler as a result.

 > This new code will apply to <4 x i32> vectors. The following code to
 > implement the BUILD_VECTOR by directly assigning subregisters will also
 > handle that case.Have you looked at which is better? It might be better
 > to swap the order of these. I suppose accessing S subregisters can be
 > slow since the move instructions will run in the VFP pipeline and cause
 > stalls on some processors

I hadn't thought of anything so cunning. If both pieces of code apply
then the result is <4 x i32> and both source vectors are <4 x i32> as
well (if <2 x i32> I bail). I think this means that the result of my
code would be identical to a perfect shuffle with no added overhead (no
VEXTs).

So assuming that perfect shuffles are indeed handled optimally the order
I gave happens to be correct. More by luck than judgement.

Let me know if you disagree or have more suggestions.

Tim.

P.S. Sorry if this gets through twice, I'm still rather uncertain about
attachments on these lists and suspect my first attempt was black-holed.

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.





More information about the llvm-commits mailing list