[LLVMdev] shufflevector on ARM

Thu Jan 6 03:55:55 PST 2011

Hi,

I've been taking a look at http://llvm.org/bugs/show_bug.cgi?id=8411,
which is essentially improving how shufflevector instructions are
handled for ARM.

It looks like the main complexity comes from the fact that the DAG which
emerges from SelectionDAGBuilder::visitShuffleVector is often rather
low-level. A case could probably be made for keeping more information
around in the DAG, but that would require cooperation among all the
backends.

For now, both EXTRACT_SUBVECTORs and BUILD_VECTORs seem to be handled on
ARM mainly by resorting to the stack, which often leads to rather bad
code. EXTRACT_SUBVECTOR in particular should just involve ignoring one
of the registers.

I've put together a couple of patches to improve matters:

http://www.maths.ed.ac.uk/~s0677366/build_vector.patch
http://www.maths.ed.ac.uk/~s0677366/extract_subvector.patch

(both were originally created a couple of weeks ago so the offsets have
changed slightly, they're still valid on today's trunk).

extract_subvector.patch adds "Pat"s to the relevant .td file so that
EXTRACT_SUBVECTOR works naturally. This changed the code generated in a
couple of tests and exposed slight bug in visitShuffleVector itself, so
they needed modifying.

build_vector.patch is a more complex C++ modification, which attempts to
reconstruct shuffles from the BUILD_VECTOR nodes where possible.

Its primary effect is on <8 x *> -> <4 x *> shuffles, where on average
it saves 4.6 instructions, with a degradation (of 1 instruction) in only
5/83827 shuffles. Runtime benchmarks were more difficult, however a
random sample suggests it improves about 75% of shuffles. I suspect
"natural" shuffles will fare better.

On <16 x *> -> <8 x *> shuffles, it rarely performs any optimization.
Probably only 0.03 instructions shorter on average. The problem is that
not many random shuffles actually have known good encodings, so the
existing SHUFFLE_VECTOR handling refuses to deal with them usually.
Again I'd expect natural shuffles to do better.

The code is disabled on shorter vectors because my tests suggested the
transformations weren't improving matters. (The extract_subvector code
still applies -- it will always be better).

I would welcome any comments or suggestions.

Tim

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.