[PATCH] [X86, AVX] adjust tablegen patterns to generate better code for scalar insertion into zero vector (PR23073)

Thu Apr 2 12:10:12 PDT 2015

On Thu, Apr 2, 2015 at 7:17 PM, Sanjay Patel <spatel at rotateright.com> wrote:
> Hi Andrea,
>
> I don't doubt that blends have better throughput on SB/Haswell, but the
> change should not have been applied to all subtargets universally because of
> the size disadvantage of blend instructions.

Right, I agree.

>
> If fixing these up after regalloc is the preferred solution, I'll look into
> that. But does limiting these patterns to match specific chips in tablegen
> change anything in the shuffle lowering logic?

Changing the tablegen patterns to match specific chips makes sense to me.
In retrospect, the idea of having a combine could still make sense to
improve the matching of movss/d memory-register variants. However,
that it is a completely different problem.

>
>
> On Thu, Apr 2, 2015 at 11:54 AM, Andrea Di Biagio
> <andrea.dibiagio at gmail.com> wrote:
>>
>> Hi Sanjay,
>>
>> On Thu, Apr 2, 2015 at 6:01 PM, Sanjay Patel <spatel at rotateright.com>
>> wrote:
>> > Patch updated again:
>> >
>> > I removed all of the changes related to blend vs. movs, so this patch is
>> > now purely about adjusting the AddedComplexity to fix PR23073.
>> >
>> > I did some svn blaming and see the reasoning for the blend patterns.
>> > These were added in r219022 by Chandler. But I think that change
>> > overstepped, so I've put some FIXMEs in here. I think the procedure is to
>> > follow-up on the commit mail for that checkin, so I'll do that next
>>
>> Hi Sanjay,
>>
>> I don't think those patterns are a mistake. Blend instructions always
>> have better reciprocal throughput than movss on
>> SandyBridge/IvyBridge/Haswell. On Haswell, a blend instruction has
>> 0.33 reciprocal throughput because it can be scheduled for execution
>> on three different ports. On Jaguar and other AMD chips, blendps
>> doesn't have a better throughput with respect to movss; so movss may
>> be a better choice.
>>
>> I remember this problem was raised a while ago during the evaluation
>> of the new shuffle lowering, and Chandler suggested to add some logic
>> to simplify the machine code (after regalloc) matching all the complex
>> variants of movs[s|d] (and potentially converting blends to movs if
>> necessary).
>> From a 'shuffle lowering' (and ISel) point of view, it was easier to
>> reason in terms of blends rather than movss. movss/d have a
>> memory-register form that is quite complicated to match..
>>
>> -Andrea
>>
>> >
>> >
>> > http://reviews.llvm.org/D8794
>> >
>> > Files:
>> >   lib/Target/X86/X86InstrSSE.td
>> >   test/CodeGen/X86/vector-shuffle-256-v4.ll
>> >   test/CodeGen/X86/vector-shuffle-256-v8.ll
>> >
>> > Index: lib/Target/X86/X86InstrSSE.td
>> > ===================================================================
>> > --- lib/Target/X86/X86InstrSSE.td
>> > +++ lib/Target/X86/X86InstrSSE.td
>> > @@ -7168,6 +7168,10 @@
>> >  }
>> >
>> >  // Patterns
>> > +// FIXME: Prefer a movss or movsd over a blendps when optimizing for
>> > size or
>> > +// on targets where they have equal performance. These were changed to
>> > use
>> > +// blends because blends have better throughput on SandyBridge and
>> > Haswell, but
>> > +// movs[s/d] are 1-2 byte shorter instructions.
>> >  let Predicates = [UseAVX] in {
>> >    let AddedComplexity = 15 in {
>> >    // Move scalar to XMM zero-extended, zeroing a VR128 then do a
>> > @@ -7184,8 +7188,10 @@
>> >    // Move low f32 and clear high bits.
>> >    def : Pat<(v8f32 (X86vzmovl (v8f32 VR256:$src))),
>> >              (VBLENDPSYrri (v8f32 (AVX_SET0)), VR256:$src, (i8 1))>;
>> > -  def : Pat<(v8i32 (X86vzmovl (v8i32 VR256:$src))),
>> > -            (VBLENDPSYrri (v8i32 (AVX_SET0)), VR256:$src, (i8 1))>;
>> > +
>> > +  // Move low f64 and clear high bits.
>> > +  def : Pat<(v4f64 (X86vzmovl (v4f64 VR256:$src))),
>> > +            (VBLENDPDYrri (v4f64 (AVX_SET0)), VR256:$src, (i8 1))>;
>> >    }
>> >
>> >    def : Pat<(v8f32 (X86vzmovl (insert_subvector undef,
>> > @@ -7199,14 +7205,19 @@
>> >                             (v2f64 (VMOVSDrr (v2f64 (V_SET0)),
>> > FR64:$src)),
>> >                             sub_xmm)>;
>> >
>> > -  // Move low f64 and clear high bits.
>> > -  def : Pat<(v4f64 (X86vzmovl (v4f64 VR256:$src))),
>> > -            (VBLENDPDYrri (v4f64 (AVX_SET0)), VR256:$src, (i8 1))>;
>> > -
>> > +  // These will incur an FP/int domain crossing penalty, but it may be
>> > the only
>> > +  // way without AVX2. Do not add any complexity because we may be able
>> > to match
>> > +  // more optimal patterns defined earlier in this file.
>> > +  def : Pat<(v8i32 (X86vzmovl (v8i32 VR256:$src))),
>> > +            (VBLENDPSYrri (v8i32 (AVX_SET0)), VR256:$src, (i8 1))>;
>> >    def : Pat<(v4i64 (X86vzmovl (v4i64 VR256:$src))),
>> >              (VBLENDPDYrri (v4i64 (AVX_SET0)), VR256:$src, (i8 1))>;
>> >  }
>> >
>> > +// FIXME: Prefer a movss or movsd over a blendps when optimizing for
>> > size or
>> > +// on targets where they have equal performance. These were changed to
>> > use
>> > +// blends because blends have better throughput on SandyBridge and
>> > Haswell, but
>> > +// movs[s/d] are 1-2 byte shorter instructions.
>> >  let Predicates = [UseSSE41] in {
>> >    // With SSE41 we can use blends for these patterns.
>> >    def : Pat<(v4f32 (X86vzmovl (v4f32 VR128:$src))),
>> > Index: test/CodeGen/X86/vector-shuffle-256-v4.ll
>> > ===================================================================
>> > --- test/CodeGen/X86/vector-shuffle-256-v4.ll
>> > +++ test/CodeGen/X86/vector-shuffle-256-v4.ll
>> > @@ -843,8 +843,9 @@
>> >  define <4 x double> @insert_reg_and_zero_v4f64(double %a) {
>> >  ; ALL-LABEL: insert_reg_and_zero_v4f64:
>> >  ; ALL:       # BB#0:
>> > -; ALL-NEXT:    vxorpd %xmm1, %xmm1, %xmm1
>> > -; ALL-NEXT:    vmovsd {{.*#+}} xmm0 = xmm0[0],xmm1[1]
>> > +; ALL-NEXT:    # kill: XMM0<def> XMM0<kill> YMM0<def>
>> > +; ALL-NEXT:    vxorpd %ymm1, %ymm1, %ymm1
>> > +; ALL-NEXT:    vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3]
>> >  ; ALL-NEXT:    retq
>> >    %v = insertelement <4 x double> undef, double %a, i32 0
>> >    %shuffle = shufflevector <4 x double> %v, <4 x double>
>> > zeroinitializer, <4 x i32> <i32 0, i32 5, i32 6, i32 7>
>> > Index: test/CodeGen/X86/vector-shuffle-256-v8.ll
>> > ===================================================================
>> > --- test/CodeGen/X86/vector-shuffle-256-v8.ll
>> > +++ test/CodeGen/X86/vector-shuffle-256-v8.ll
>> > @@ -133,8 +133,6 @@
>> >  ; AVX2:       # BB#0:
>> >  ; AVX2-NEXT:    movl $7, %eax
>> >  ; AVX2-NEXT:    vmovd %eax, %xmm1
>> > -; AVX2-NEXT:    vxorps %ymm2, %ymm2, %ymm2
>> > -; AVX2-NEXT:    vblendps {{.*#+}} ymm1 = ymm1[0],ymm2[1,2,3,4,5,6,7]
>> >  ; AVX2-NEXT:    vpermps %ymm0, %ymm1, %ymm0
>> >  ; AVX2-NEXT:    retq
>> >    %shuffle = shufflevector <8 x float> %a, <8 x float> %b, <8 x i32>
>> > <i32 7, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
>> > @@ -962,8 +960,6 @@
>> >  ; AVX2:       # BB#0:
>> >  ; AVX2-NEXT:    movl $7, %eax
>> >  ; AVX2-NEXT:    vmovd %eax, %xmm1
>> > -; AVX2-NEXT:    vxorps %ymm2, %ymm2, %ymm2
>> > -; AVX2-NEXT:    vblendps {{.*#+}} ymm1 = ymm1[0],ymm2[1,2,3,4,5,6,7]
>> >  ; AVX2-NEXT:    vpermd %ymm0, %ymm1, %ymm0
>> >  ; AVX2-NEXT:    retq
>> >    %shuffle = shufflevector <8 x i32> %a, <8 x i32> %b, <8 x i32> <i32
>> > 7, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
>> >
>> > EMAIL PREFERENCES
>> >   http://reviews.llvm.org/settings/panel/emailpreferences/
>> >
>> > _______________________________________________
>> > llvm-commits mailing list
>> > llvm-commits at cs.uiuc.edu
>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> >
>
>