[PATCH] Added tests for shufflevector lowering to blend instrs.These tests ensure that a change I will propose in clang works asexpected.

Sat May 3 00:00:54 PDT 2014

Hi Nadav,

I don't know if I'm understanding what you're asking, so I'm just going
to dump a bunch of information to make sure we're on the same page. It
might get a bit long.

The _mm{256,}_blend* intrinsics in clang are emitting llvm x86 intrinsics
directly. That might make us not optimize a bunch of cases, especially
when functions get inlined.

I looked at lowervector_shuffle on the x86 backend and we, in fact, lower
the appropriate vectorshuffles to blend instructions:
LowerVECTOR_SHUFFLEtoBlend
@ X86ISelLowering.cpp:6307

Since we know we lower the shufflevectors to an appropriate blend
instruction (we explicitly check for this exact pattern), I figured the
patch
is safe to be applied to clang. But to be absolutely sure we don't regress
in the future (in llvm nor clang), I decided to also write tests to verify
that
we actually emit the blend instructions. These tests should have already
been in place, since it's a special case of a shuffle vector and we want to
be sure we emit blends when appropriate, and not a bunch of mov + pshuf,
but they aren't there yet.

It gets even worse when you realize that the lowering of a select-based
blend operation is worse than the equivalent vectorshuffle. Take the
following code:

define <4 x float> @aaa(<4 x float> %a, <4 x float> %b) {
  %1 = shufflevector <4 x float> %a, <4 x float> %b, <4 x i32> <i32 0, i32
5, i32 6, i32 3>
  ret <4 x float> %1
}

define <4 x float> @bbb(<4 x float> %a, <4 x float> %b) {
  %1 = select <4 x i1> <i1 false, i1 true, i1 true, i1 false>, <4 x float>
%a, <4 x float> %b
  ret <4 x float> %1
}

;; Compile with: llc -O3 -mattr=avx

The @aaa function gets compiled to
_aaa:                                   ## @aaa
vblendps $6, %xmm1, %xmm0, %xmm0
retq

While the @bbb function generates the following:

LCPI1_0:
.long 0                       ## 0x0
.long 4294967295              ## 0xffffffff
.long 4294967295              ## 0xffffffff
.long 0                       ## 0x0
...
_bbb:                                   ## @bbb
vmovaps LCPI1_0(%rip), %xmm2
vblendvps %xmm2, %xmm0, %xmm1, %xmm0
retq

This happens because the vselect DAG node is set to Expand, which will end
up making it generate the 128bit constant, and ends up using the
VBLENDPSrr instruction. While the shufflevector code will go through
LowerVECTOR_SHUFFLEtoBlend and generate the mask for the immediate,
picking the VBLENDPSrri version, and not touching any memory.

The <8 x float> is similar. The non-avx, non-sse4 version is also much
worse on the select case.

As for adding the builtin to clang, I have no idea about how receptive
they will be to it, I think we should discuss that possibility on the
clang part of the patch.

But adding the __builtin_select to clang seems to me like it's the wrong
way to go. As far as optimizations go, it seems like it would be much
easier to turn that a select with a ConstantInt vector as a mask into a
shufflevector than the other way around.

If you'd still prefer to make clang emit select instructions and make
__builtin_select (or similar) available to programs, please reply to the
clang part of the patch too: http://reviews.llvm.org/D3601

Sorry about the long text,

  Filipe

On Fri, May 2, 2014 at 9:52 PM, Nadav Rotem <nrotem at apple.com> wrote:

>
> On May 2, 2014, at 6:35 PM, Filipe Cabecinhas <
> filcab+llvm.phabricator at gmail.com> wrote:
>
> I can't find any __builtin_select that I can use in clang's intrinsics
> headers.
>
>
> Can you add a new builtin?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140503/e828d8eb/attachment.html>