[PATCH] [X86, AVX] adjust tablegen patterns to generate better code for scalar insertion into zero vector (PR23073)

Thu Apr 2 11:17:27 PDT 2015

Hi Andrea,

I don't doubt that blends have better throughput on SB/Haswell, but the
change should not have been applied to all subtargets universally because
of the size disadvantage of blend instructions.

If fixing these up after regalloc is the preferred solution, I'll look into
that. But does limiting these patterns to match specific chips in tablegen
change anything in the shuffle lowering logic?

On Thu, Apr 2, 2015 at 11:54 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com
> wrote:

> Hi Sanjay,
>
> On Thu, Apr 2, 2015 at 6:01 PM, Sanjay Patel <spatel at rotateright.com>
> wrote:
> > Patch updated again:
> >
> > I removed all of the changes related to blend vs. movs, so this patch is
> now purely about adjusting the AddedComplexity to fix PR23073.
> >
> > I did some svn blaming and see the reasoning for the blend patterns.
> These were added in r219022 by Chandler. But I think that change
> overstepped, so I've put some FIXMEs in here. I think the procedure is to
> follow-up on the commit mail for that checkin, so I'll do that next
>
> Hi Sanjay,
>
> I don't think those patterns are a mistake. Blend instructions always
> have better reciprocal throughput than movss on
> SandyBridge/IvyBridge/Haswell. On Haswell, a blend instruction has
> 0.33 reciprocal throughput because it can be scheduled for execution
> on three different ports. On Jaguar and other AMD chips, blendps
> doesn't have a better throughput with respect to movss; so movss may
> be a better choice.
>
> I remember this problem was raised a while ago during the evaluation
> of the new shuffle lowering, and Chandler suggested to add some logic
> to simplify the machine code (after regalloc) matching all the complex
> variants of movs[s|d] (and potentially converting blends to movs if
> necessary).
> From a 'shuffle lowering' (and ISel) point of view, it was easier to
> reason in terms of blends rather than movss. movss/d have a
> memory-register form that is quite complicated to match..
>
> -Andrea
>
> >
> >
> > http://reviews.llvm.org/D8794
> >
> > Files:
> >   lib/Target/X86/X86InstrSSE.td
> >   test/CodeGen/X86/vector-shuffle-256-v4.ll
> >   test/CodeGen/X86/vector-shuffle-256-v8.ll
> >
> > Index: lib/Target/X86/X86InstrSSE.td
> > ===================================================================
> > --- lib/Target/X86/X86InstrSSE.td
> > +++ lib/Target/X86/X86InstrSSE.td
> > @@ -7168,6 +7168,10 @@
> >  }
> >
> >  // Patterns
> > +// FIXME: Prefer a movss or movsd over a blendps when optimizing for
> size or
> > +// on targets where they have equal performance. These were changed to
> use
> > +// blends because blends have better throughput on SandyBridge and
> Haswell, but
> > +// movs[s/d] are 1-2 byte shorter instructions.
> >  let Predicates = [UseAVX] in {
> >    let AddedComplexity = 15 in {
> >    // Move scalar to XMM zero-extended, zeroing a VR128 then do a
> > @@ -7184,8 +7188,10 @@
> >    // Move low f32 and clear high bits.
> >    def : Pat<(v8f32 (X86vzmovl (v8f32 VR256:$src))),
> >              (VBLENDPSYrri (v8f32 (AVX_SET0)), VR256:$src, (i8 1))>;
> > -  def : Pat<(v8i32 (X86vzmovl (v8i32 VR256:$src))),
> > -            (VBLENDPSYrri (v8i32 (AVX_SET0)), VR256:$src, (i8 1))>;
> > +
> > +  // Move low f64 and clear high bits.
> > +  def : Pat<(v4f64 (X86vzmovl (v4f64 VR256:$src))),
> > +            (VBLENDPDYrri (v4f64 (AVX_SET0)), VR256:$src, (i8 1))>;
> >    }
> >
> >    def : Pat<(v8f32 (X86vzmovl (insert_subvector undef,
> > @@ -7199,14 +7205,19 @@
> >                             (v2f64 (VMOVSDrr (v2f64 (V_SET0)),
> FR64:$src)),
> >                             sub_xmm)>;
> >
> > -  // Move low f64 and clear high bits.
> > -  def : Pat<(v4f64 (X86vzmovl (v4f64 VR256:$src))),
> > -            (VBLENDPDYrri (v4f64 (AVX_SET0)), VR256:$src, (i8 1))>;
> > -
> > +  // These will incur an FP/int domain crossing penalty, but it may be
> the only
> > +  // way without AVX2. Do not add any complexity because we may be able
> to match
> > +  // more optimal patterns defined earlier in this file.
> > +  def : Pat<(v8i32 (X86vzmovl (v8i32 VR256:$src))),
> > +            (VBLENDPSYrri (v8i32 (AVX_SET0)), VR256:$src, (i8 1))>;
> >    def : Pat<(v4i64 (X86vzmovl (v4i64 VR256:$src))),
> >              (VBLENDPDYrri (v4i64 (AVX_SET0)), VR256:$src, (i8 1))>;
> >  }
> >
> > +// FIXME: Prefer a movss or movsd over a blendps when optimizing for
> size or
> > +// on targets where they have equal performance. These were changed to
> use
> > +// blends because blends have better throughput on SandyBridge and
> Haswell, but
> > +// movs[s/d] are 1-2 byte shorter instructions.
> >  let Predicates = [UseSSE41] in {
> >    // With SSE41 we can use blends for these patterns.
> >    def : Pat<(v4f32 (X86vzmovl (v4f32 VR128:$src))),
> > Index: test/CodeGen/X86/vector-shuffle-256-v4.ll
> > ===================================================================
> > --- test/CodeGen/X86/vector-shuffle-256-v4.ll
> > +++ test/CodeGen/X86/vector-shuffle-256-v4.ll
> > @@ -843,8 +843,9 @@
> >  define <4 x double> @insert_reg_and_zero_v4f64(double %a) {
> >  ; ALL-LABEL: insert_reg_and_zero_v4f64:
> >  ; ALL:       # BB#0:
> > -; ALL-NEXT:    vxorpd %xmm1, %xmm1, %xmm1
> > -; ALL-NEXT:    vmovsd {{.*#+}} xmm0 = xmm0[0],xmm1[1]
> > +; ALL-NEXT:    # kill: XMM0<def> XMM0<kill> YMM0<def>
> > +; ALL-NEXT:    vxorpd %ymm1, %ymm1, %ymm1
> > +; ALL-NEXT:    vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3]
> >  ; ALL-NEXT:    retq
> >    %v = insertelement <4 x double> undef, double %a, i32 0
> >    %shuffle = shufflevector <4 x double> %v, <4 x double>
> zeroinitializer, <4 x i32> <i32 0, i32 5, i32 6, i32 7>
> > Index: test/CodeGen/X86/vector-shuffle-256-v8.ll
> > ===================================================================
> > --- test/CodeGen/X86/vector-shuffle-256-v8.ll
> > +++ test/CodeGen/X86/vector-shuffle-256-v8.ll
> > @@ -133,8 +133,6 @@
> >  ; AVX2:       # BB#0:
> >  ; AVX2-NEXT:    movl $7, %eax
> >  ; AVX2-NEXT:    vmovd %eax, %xmm1
> > -; AVX2-NEXT:    vxorps %ymm2, %ymm2, %ymm2
> > -; AVX2-NEXT:    vblendps {{.*#+}} ymm1 = ymm1[0],ymm2[1,2,3,4,5,6,7]
> >  ; AVX2-NEXT:    vpermps %ymm0, %ymm1, %ymm0
> >  ; AVX2-NEXT:    retq
> >    %shuffle = shufflevector <8 x float> %a, <8 x float> %b, <8 x i32>
> <i32 7, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
> > @@ -962,8 +960,6 @@
> >  ; AVX2:       # BB#0:
> >  ; AVX2-NEXT:    movl $7, %eax
> >  ; AVX2-NEXT:    vmovd %eax, %xmm1
> > -; AVX2-NEXT:    vxorps %ymm2, %ymm2, %ymm2
> > -; AVX2-NEXT:    vblendps {{.*#+}} ymm1 = ymm1[0],ymm2[1,2,3,4,5,6,7]
> >  ; AVX2-NEXT:    vpermd %ymm0, %ymm1, %ymm0
> >  ; AVX2-NEXT:    retq
> >    %shuffle = shufflevector <8 x i32> %a, <8 x i32> %b, <8 x i32> <i32
> 7, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
> >
> > EMAIL PREFERENCES
> >   http://reviews.llvm.org/settings/panel/emailpreferences/
> >
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150402/e9cc3fc5/attachment.html>