[PATCH][x86] Teach how to combine a vselect into a movss/movsd.

Mon Jan 20 11:41:52 PST 2014

Thanks!

Committed revision 199683.

On Mon, Jan 20, 2014 at 7:25 PM, Nadav Rotem <nrotem at apple.com> wrote:
> Andrea,
>
> Thanks for working on this. The patch looks excellent!
>
> -Nadav
>
> On Jan 20, 2014, at 11:22 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:
>
>> Hi Nadav,
>>
>> I attached a new version of the patch.
>>
>> Rules are:
>> 1.  fold (vselect (build_vector (0, -1, -1, -1)), A, B) -> (movss A, B);
>> 2.  fold (vselect (build_vector (-1, 0, 0, 0)), A, B) -> (movss B, A)
>> 3.  fold (vselect (build_vector (0, -1)), A, B) -> (movsd A, B)
>> 4.  fold (vselect (build_vector (-1, 0)), A, B) -> (movsd B, A)
>>
>> The differences with respect to the previous version are:
>> - the target specific combine on VSELECT nodes is now run after types
>> are legalized (i.e. !DCI.isBeforeLegalize()).
>> - I slightly simplified the algorithms (no if-stmt in the loops).
>> - I used std::swap as suggested by Juergen.
>>
>> I also investigated whether it was possible to enable this new
>> transformation after DAG legalization.
>> However, The custom lowering of build_vector dag nodes changed the dag
>> sequence in a way that made it really hard to recognize my original
>> patterns.
>> In general, build_vector dag nodes used for the vselect Mask are
>> firstly expanded into a 'vector_shuffle' of constants and eventually
>> combined into either a X86ISD::VZEXT_MOVL or a 'bitcast of a load from
>> target constant pool'.
>> For simplicity, I eventually decided to enable the combine only after
>> types are legalized (i.e. `!DCI.isBeforeLegalize()`).
>>
>> Please let me know what you think about this new version of the patch
>> and if ok to submit.
>>
>> Thanks!
>> Andrea
>>
>> On Fri, Jan 17, 2014 at 9:19 PM, Andrea Di Biagio
>> <andrea.dibiagio at gmail.com> wrote:
>>> Hi Nadav and Juergen,
>>>
>>> On Fri, Jan 17, 2014 at 8:16 PM, Nadav Rotem <nrotem at apple.com> wrote:
>>>> Thanks for working on this Andrea. The transformation itself is okay, but I am worried about problems that may show up if this optimization were to fire up too early before other optimizations have a chance to optimize this select. This is really a lowering transformation I mention this because very few optimizations can (or should have to) optimize x86 specific nodes. For example, maybe A and B could be optimized into constants at some point but this optimization would prevent us from doing anything about it.  I suggest that you make sure that this optimization only runs after the operations are legalized.
>>>
>>> True, it is safer to run this after nodes are legalized.
>>>
>>> I'll change the patch so that the optimization runs after legalization.
>>> (I will also introduce the std::swap as suggested by Juergen).
>>>
>>> Thanks for the reviews!
>>> Andrea
>>>
>>>>
>>>> On Jan 16, 2014, at 5:42 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> this patch teaches the x86 backend how to combine vselect dag nodes
>>>>> into movss/movsd when possible.
>>>>>
>>>>> If the vector type of the operands of the vselect is either
>>>>> MVT::v4i32 or MVT::v4f32, then we can fold according to the following rules:
>>>>>
>>>>> 1.  fold (vselect (build_vector (0, -1, -1, -1)), A, B) -> (movss A, B);
>>>>> 2.  fold (vselect (build_vector (-1, 0, 0, 0)), A, B) -> (movss B, A)
>>>>>
>>>>> If the vector type of the operands of the vselect is either
>>>>> MVT::v2i64 or MVT::v2f64 (and we have SSE2) , then we can fold
>>>>> according to the following rules:
>>>>>
>>>>> 3.  fold (vselect (build_vector (0, -1)), A, B) -> (movsd A, B)
>>>>> 4.  fold (vselect (build_vector (-1, 0)), A, B) -> (movsd B, A)
>>>>>
>>>>> I added extra test cases to file 'test/CodeGen/X86/vselect.ll' in
>>>>> order to verify that we correctly select movss/movsd instructions.
>>>>>
>>>>> Before this change, the backend only knew how to lower a shufflevector
>>>>> into a X86Movss/X86Movsd, but not how to do the same with vselect dag
>>>>> nodes.
>>>>> For that reason, all the ISel patterns introduced at r197145
>>>>> http://llvm.org/viewvc/llvm-project?view=revision&revision=197145
>>>>> were only matched if the X86Movss/X86Movsd were obtained from the
>>>>> custom lowering of a shufflevector.
>>>>>
>>>>> With this change, the backend is now able to combine vselect into
>>>>> X86Movss and therefore it can reuse the patterns from revision 197145
>>>>> to further simplify packed vector arithmetic operations.
>>>>>
>>>>> I added new test-cases in 'test/CodeGen/X86/sse-scalar-fp-arith-2.ll'
>>>>> to verify that now we correctly select SSE/AVX scalar fp instructions
>>>>> from a packed arithmetic instruction followed by a vselect.
>>>>>
>>>>> After this change, the following tests started failing because they
>>>>> always expected blendvps/blendvpd instructions in the output assembly:
>>>>> test/CodeGen/X86/sse2-blend.ll
>>>>> test/CodeGen/X86/avx-blend.ll
>>>>> test/CodeGen/X86/blend-msb.ll
>>>>> test/CodeGen/X86/sse41-blend.ll
>>>>>
>>>>> Now the backend knows how to efficiently emit movss/movsd and
>>>>> therefore all the failing cases are expected failures (that is because
>>>>> the backend knows how to select movss/movsd and not only
>>>>> blendvps/blendvpd).
>>>>>
>>>>> I modified those failing tests so that - when possible - the generated
>>>>> assembly still contains the expected blendvps/blendvpd(see for example
>>>>> how I changed avx-blend.ll).
>>>>> In all other cases I just changed the CHECK lines to verify that we
>>>>> produce a movss/movsd.
>>>>>
>>>>> Please let me know if ok to submit.
>>>>>
>>>>> Thanks,
>>>>> Andrea Di Biagio
>>>>> SN Systems - Sony Computer Entertainment Group.
>>>>> <patch-vselect.diff>
>>>>
>> <patch-v2.diff>
>