[PATCH][X86] Improve the lowering of BITCAST dag nodes from type f64 to type v2i32 (and vice versa).

Tue May 6 10:16:53 PDT 2014

Thanks Nadav :)
Committed revision 208107.

On Tue, May 6, 2014 at 5:28 PM, Nadav Rotem <nrotem at apple.com> wrote:
> LGTM.
>
> Thanks Andrea!
>
> On May 6, 2014, at 7:59 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:
>
>> Hi,
>>
>> The goal of this patch is to simplify the bitconvert from type
>> MVT::f64 to type MVT::v2i32 (and vice versa).
>>
>> When legalizing an ISD::BITCAST dag node from MVT::f64 to MVT::v2i32,
>> we now produce a cheaper SCALAR_TO_VECTOR (to a vector of type v2f64)
>> followed by a 'free' bitcast to v4i32. The elements of the resulting
>> v4i32 are then extracted to eventually build the resulting v2i32
>> vector. This is cheaper than introducing a store+load sequence to
>> convert the operand in input from type f64 to i64.
>>
>> During type legalization, the f64 operand of a ISD::BITCAST dag node
>> that performs a bitconvert from type MVT::f64 to type MVT::v2i32 is
>> initially converted into an i64. Then the resulting i64 is used to
>> build a vector of type v2i64.
>> The reason why the backend introduces a new v2i64 vector is because
>> value type MVT::v2i32 is illegal and it requires promotion to the next
>> legal vector type with the same number of elements (in this case, it
>> is type MVT::v2i64).
>> The conversion from f64 to i64 is done by storing the value on a stack
>> location and then loading the value from that same location as a i64.
>>
>> This patch is beneficial for example in the following case:
>>
>> define double @test(double %A) {
>>  %1 = bitcast double %A to <2 x i32>
>>  %add = add <2 x i32> %1, <i32 3, i32 5>
>>  %2 = bitcast <2 x i32> %add to double
>>  ret double %2
>> }
>>
>> Before we produced:
>>   movsd %xmm0, -8(%rsp)
>>   movq -8(%rsp), %xmm0
>>   pshufd $16, %xmm0, %xmm0
>>   paddq .LCPI0_0(%rip), %xmm0
>>   pshufd $8, %xmm0, %xmm0
>>   movq %xmm0, -16(%rsp)
>>   movsd -16(%rsp), %xmm0
>>   retq
>>
>> With this patch we produce a much cleaner:
>>   pshufd $16, %xmm0, %xmm0
>>   paddq .LCPI0_0(%rip), %xmm0
>>   pshufd $8, %xmm0, %xmm0
>>
>>
>> Function @t4 from test 'ret-mmx.ll' is another example of function
>> that is strongly simplified by this transformation. Before we produced
>> a long sequence of 8 instructions (for @t4). Now  the entire function
>> is optimized into a single 'movsd' instruction.
>>
>> Back to function @test from the example,
>> with this patch we would produce a sequence of pshufd+paddq+pshufd.
>> Ideally we should be able to fold that entire sequence into a single paddd.
>>
>> Another patch will follow that improves the dagcombiner to spot
>> sequences of shuffle+binop+shuffle which can be safely folded into a
>> single binop.
>>
>> Please let me know if ok to submit.
>>
>> Thanks,
>> Andrea Di Biagio
>> SN Systems - Sony Computer Entertainment Group.
>> <patch-lower-bitcast.diff>
>