[PATCH][x86] Teach how to combine a vselect into a movss/movsd.

Mon Jan 20 11:25:28 PST 2014

Andrea, 

Thanks for working on this. The patch looks excellent!

-Nadav

On Jan 20, 2014, at 11:22 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:

> Hi Nadav,
> 
> I attached a new version of the patch.
> 
> Rules are:
> 1.  fold (vselect (build_vector (0, -1, -1, -1)), A, B) -> (movss A, B);
> 2.  fold (vselect (build_vector (-1, 0, 0, 0)), A, B) -> (movss B, A)
> 3.  fold (vselect (build_vector (0, -1)), A, B) -> (movsd A, B)
> 4.  fold (vselect (build_vector (-1, 0)), A, B) -> (movsd B, A)
> 
> The differences with respect to the previous version are:
> - the target specific combine on VSELECT nodes is now run after types
> are legalized (i.e. !DCI.isBeforeLegalize()).
> - I slightly simplified the algorithms (no if-stmt in the loops).
> - I used std::swap as suggested by Juergen.
> 
> I also investigated whether it was possible to enable this new
> transformation after DAG legalization.
> However, The custom lowering of build_vector dag nodes changed the dag
> sequence in a way that made it really hard to recognize my original
> patterns.
> In general, build_vector dag nodes used for the vselect Mask are
> firstly expanded into a 'vector_shuffle' of constants and eventually
> combined into either a X86ISD::VZEXT_MOVL or a 'bitcast of a load from
> target constant pool'.
> For simplicity, I eventually decided to enable the combine only after
> types are legalized (i.e. `!DCI.isBeforeLegalize()`).
> 
> Please let me know what you think about this new version of the patch
> and if ok to submit.
> 
> Thanks!
> Andrea
> 
> On Fri, Jan 17, 2014 at 9:19 PM, Andrea Di Biagio
> <andrea.dibiagio at gmail.com> wrote:
>> Hi Nadav and Juergen,
>> 
>> On Fri, Jan 17, 2014 at 8:16 PM, Nadav Rotem <nrotem at apple.com> wrote:
>>> Thanks for working on this Andrea. The transformation itself is okay, but I am worried about problems that may show up if this optimization were to fire up too early before other optimizations have a chance to optimize this select. This is really a lowering transformation I mention this because very few optimizations can (or should have to) optimize x86 specific nodes. For example, maybe A and B could be optimized into constants at some point but this optimization would prevent us from doing anything about it.  I suggest that you make sure that this optimization only runs after the operations are legalized.
>> 
>> True, it is safer to run this after nodes are legalized.
>> 
>> I'll change the patch so that the optimization runs after legalization.
>> (I will also introduce the std::swap as suggested by Juergen).
>> 
>> Thanks for the reviews!
>> Andrea
>> 
>>> 
>>> On Jan 16, 2014, at 5:42 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> this patch teaches the x86 backend how to combine vselect dag nodes
>>>> into movss/movsd when possible.
>>>> 
>>>> If the vector type of the operands of the vselect is either
>>>> MVT::v4i32 or MVT::v4f32, then we can fold according to the following rules:
>>>> 
>>>> 1.  fold (vselect (build_vector (0, -1, -1, -1)), A, B) -> (movss A, B);
>>>> 2.  fold (vselect (build_vector (-1, 0, 0, 0)), A, B) -> (movss B, A)
>>>> 
>>>> If the vector type of the operands of the vselect is either
>>>> MVT::v2i64 or MVT::v2f64 (and we have SSE2) , then we can fold
>>>> according to the following rules:
>>>> 
>>>> 3.  fold (vselect (build_vector (0, -1)), A, B) -> (movsd A, B)
>>>> 4.  fold (vselect (build_vector (-1, 0)), A, B) -> (movsd B, A)
>>>> 
>>>> I added extra test cases to file 'test/CodeGen/X86/vselect.ll' in
>>>> order to verify that we correctly select movss/movsd instructions.
>>>> 
>>>> Before this change, the backend only knew how to lower a shufflevector
>>>> into a X86Movss/X86Movsd, but not how to do the same with vselect dag
>>>> nodes.
>>>> For that reason, all the ISel patterns introduced at r197145
>>>> http://llvm.org/viewvc/llvm-project?view=revision&revision=197145
>>>> were only matched if the X86Movss/X86Movsd were obtained from the
>>>> custom lowering of a shufflevector.
>>>> 
>>>> With this change, the backend is now able to combine vselect into
>>>> X86Movss and therefore it can reuse the patterns from revision 197145
>>>> to further simplify packed vector arithmetic operations.
>>>> 
>>>> I added new test-cases in 'test/CodeGen/X86/sse-scalar-fp-arith-2.ll'
>>>> to verify that now we correctly select SSE/AVX scalar fp instructions
>>>> from a packed arithmetic instruction followed by a vselect.
>>>> 
>>>> After this change, the following tests started failing because they
>>>> always expected blendvps/blendvpd instructions in the output assembly:
>>>> test/CodeGen/X86/sse2-blend.ll
>>>> test/CodeGen/X86/avx-blend.ll
>>>> test/CodeGen/X86/blend-msb.ll
>>>> test/CodeGen/X86/sse41-blend.ll
>>>> 
>>>> Now the backend knows how to efficiently emit movss/movsd and
>>>> therefore all the failing cases are expected failures (that is because
>>>> the backend knows how to select movss/movsd and not only
>>>> blendvps/blendvpd).
>>>> 
>>>> I modified those failing tests so that - when possible - the generated
>>>> assembly still contains the expected blendvps/blendvpd(see for example
>>>> how I changed avx-blend.ll).
>>>> In all other cases I just changed the CHECK lines to verify that we
>>>> produce a movss/movsd.
>>>> 
>>>> Please let me know if ok to submit.
>>>> 
>>>> Thanks,
>>>> Andrea Di Biagio
>>>> SN Systems - Sony Computer Entertainment Group.
>>>> <patch-vselect.diff>
>>> 
> <patch-v2.diff>