[PATCH] Added tests for shufflevector lowering to blend instrs.These tests ensure that a change I will propose in clang works asexpected.

Sun May 4 00:25:26 PDT 2014

Hi Nadav,

Thanks for the comments. I think I finally understood what you meant. You'd
prefer to have the less flexible select instruction instead of the shuffles.

I see us as having some options to still implement the optimization I
wanted (there might be others, of course):
 - Only change llvm (+clang headers):
    - Why? Makes this optimization available for anything that uses llvm
    - How? Maybe a pass in instcombine could transform the vectorshuffles
that can be mapped into selects.
               We could also keep the clang headers as are, and do an
instcombine transform for the x86 blend intrinsic and turn it into a select
when possible.

 - Change clang:
    - Why? Should clang be very concerned with emitting the proper IR?
    - How? Change the intrinsics in the headers and change the places where
clang emits vectorshuffles to, instead, emit select instructions when
appropriate.

Would you be ok with the first one? Or do you think clang should be “more
careful” with the IR it emits and that we should change clang?

Thanks,

  Filipe

P. S: About the lowering of the vselect SelectionDAG node, is the proper
way to address it to set its lowering as Custom and implement the blend+imm
optimization when we can?

On Sat, May 3, 2014 at 11:29 PM, Nadav Rotem <nrotem at apple.com> wrote:

>
> On May 3, 2014, at 12:00 AM, Filipe Cabecinhas <
> filcab+llvm.phabricator at gmail.com> wrote:
>
> Hi Nadav,
>
> I don't know if I'm understanding what you're asking, so I'm just going
> to dump a bunch of information to make sure we're on the same page. It
> might get a bit long.
>
> The _mm{256,}_blend* intrinsics in clang are emitting llvm x86 intrinsics
> directly. That might make us not optimize a bunch of cases, especially
> when functions get inlined.
>
>
> I looked at lowervector_shuffle on the x86 backend and we, in fact, lower
> the appropriate vectorshuffles to blend instructions:
> LowerVECTOR_SHUFFLEtoBlend
> @ X86ISelLowering.cpp:6307
>
> Since we know we lower the shufflevectors to an appropriate blend
> instruction (we explicitly check for this exact pattern), I figured the
> patch
> is safe to be applied to clang. But to be absolutely sure we don't regress
> in the future (in llvm nor clang), I decided to also write tests to verify
> that
> we actually emit the blend instructions. These tests should have already
> been in place, since it's a special case of a shuffle vector and we want to
> be sure we emit blends when appropriate, and not a bunch of mov + pshuf,
> but they aren't there yet.
>
>
> Shuffle vector is the wrong representation for blends.  Vector selects is
> the correct representation.  Blend instructions select (per element MUX)
> between two values based on a mask, exactly like the select instruction.
>  Shufflevector instructions are very flexible and they can represent many
> kind of operations, including selects, but this is also what makes them so
> difficult to lower. We use over a thousand lines of c++ code to lower x86
> shuffles, and we still miss many patterns. You wrote that you want to use
> shuffles because you want the optimizer to optimize them but throughout the
> compiler we don’t touch shuffles (with a few exceptions) because we don’t
> want to generate new shuffle patterns that we can’t lower.
>
>
> It gets even worse when you realize that the lowering of a select-based
> blend operation is worse than the equivalent vectorshuffle. Take the
> following code:
>
> define <4 x float> @aaa(<4 x float> %a, <4 x float> %b) {
>   %1 = shufflevector <4 x float> %a, <4 x float> %b, <4 x i32> <i32 0, i32
> 5, i32 6, i32 3>
>   ret <4 x float> %1
> }
>
> define <4 x float> @bbb(<4 x float> %a, <4 x float> %b) {
>   %1 = select <4 x i1> <i1 false, i1 true, i1 true, i1 false>, <4 x float>
> %a, <4 x float> %b
>   ret <4 x float> %1
> }
>
> ;; Compile with: llc -O3 -mattr=avx
>
> The @aaa function gets compiled to
> _aaa:                                   ## @aaa
>  vblendps $6, %xmm1, %xmm0, %xmm0
> retq
>
> While the @bbb function generates the following:
>
> LCPI1_0:
> .long 0                       ## 0x0
> .long 4294967295              ## 0xffffffff
>  .long 4294967295              ## 0xffffffff
> .long 0                       ## 0x0
> ...
> _bbb:                                   ## @bbb
> vmovaps LCPI1_0(%rip), %xmm2
> vblendvps %xmm2, %xmm0, %xmm1, %xmm0
>  retq
>
>
> This example only tells me that we are missing an optimization in iSel. It
> does convince me that the correct way to represent blends in the IR is
> using vector selects.
>
> Would you recommend other targets to implement their blend intrinsics as
> shuffles? No, because lowering shuffles is really complicated. On X86 we
> lower shuffles using thousands of lines of c++ code and we still miss some
> shuffles.
>
> This happens because the vselect DAG node is set to Expand, which will end
> up making it generate the 128bit constant, and ends up using the
> VBLENDPSrr instruction. While the shufflevector code will go through
> LowerVECTOR_SHUFFLEtoBlend and generate the mask for the immediate,
> picking the VBLENDPSrri version, and not touching any memory.
>
> The <8 x float> is similar. The non-avx, non-sse4 version is also much
> worse on the select case.
>
> As for adding the builtin to clang, I have no idea about how receptive
> they will be to it, I think we should discuss that possibility on the
> clang part of the patch.
>
>
> I also don’t know how to do it but I am sure it is easy to find out. One
> way to find out is to figure out how the shuffle instructions are created.
> Please a breakpoint in the IR builder and look up the call stack. The file
> that implements the lowering of the shuffle is probably where you need to
> add the code for lowering blends to selects.
>
>
> But adding the __builtin_select to clang seems to me like it's the wrong
> way to go. As far as optimizations go, it seems like it would be much
> easier to turn that a select with a ConstantInt vector as a mask into a
> shufflevector than the other way around.
>
> If you'd still prefer to make clang emit select instructions and make
> __builtin_select (or similar) available to programs, please reply to the
> clang part of the patch too: http://reviews.llvm.org/D3601
>
>
> Please implement the select builtin and lower blends to selects.
>
> Sorry about the long text,
>
>   Filipe
>
>
>
> On Fri, May 2, 2014 at 9:52 PM, Nadav Rotem <nrotem at apple.com> wrote:
>
>>
>> On May 2, 2014, at 6:35 PM, Filipe Cabecinhas <
>> filcab+llvm.phabricator at gmail.com> wrote:
>>
>> I can't find any __builtin_select that I can use in clang's intrinsics
>> headers.
>>
>>
>> Can you add a new builtin?
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140504/791a27f0/attachment.html>