[PATCH] [X86, AVX] adjust tablegen patterns to generate better code for scalar insertion into zero vector (PR23073)

Andrea Di Biagio andrea.dibiagio at gmail.com
Thu Apr 2 10:54:34 PDT 2015


Hi Sanjay,

On Thu, Apr 2, 2015 at 6:01 PM, Sanjay Patel <spatel at rotateright.com> wrote:
> Patch updated again:
>
> I removed all of the changes related to blend vs. movs, so this patch is now purely about adjusting the AddedComplexity to fix PR23073.
>
> I did some svn blaming and see the reasoning for the blend patterns. These were added in r219022 by Chandler. But I think that change overstepped, so I've put some FIXMEs in here. I think the procedure is to follow-up on the commit mail for that checkin, so I'll do that next

Hi Sanjay,

I don't think those patterns are a mistake. Blend instructions always
have better reciprocal throughput than movss on
SandyBridge/IvyBridge/Haswell. On Haswell, a blend instruction has
0.33 reciprocal throughput because it can be scheduled for execution
on three different ports. On Jaguar and other AMD chips, blendps
doesn't have a better throughput with respect to movss; so movss may
be a better choice.

I remember this problem was raised a while ago during the evaluation
of the new shuffle lowering, and Chandler suggested to add some logic
to simplify the machine code (after regalloc) matching all the complex
variants of movs[s|d] (and potentially converting blends to movs if
necessary).
>From a 'shuffle lowering' (and ISel) point of view, it was easier to
reason in terms of blends rather than movss. movss/d have a
memory-register form that is quite complicated to match..

-Andrea

>
>
> http://reviews.llvm.org/D8794
>
> Files:
>   lib/Target/X86/X86InstrSSE.td
>   test/CodeGen/X86/vector-shuffle-256-v4.ll
>   test/CodeGen/X86/vector-shuffle-256-v8.ll
>
> Index: lib/Target/X86/X86InstrSSE.td
> ===================================================================
> --- lib/Target/X86/X86InstrSSE.td
> +++ lib/Target/X86/X86InstrSSE.td
> @@ -7168,6 +7168,10 @@
>  }
>
>  // Patterns
> +// FIXME: Prefer a movss or movsd over a blendps when optimizing for size or
> +// on targets where they have equal performance. These were changed to use
> +// blends because blends have better throughput on SandyBridge and Haswell, but
> +// movs[s/d] are 1-2 byte shorter instructions.
>  let Predicates = [UseAVX] in {
>    let AddedComplexity = 15 in {
>    // Move scalar to XMM zero-extended, zeroing a VR128 then do a
> @@ -7184,8 +7188,10 @@
>    // Move low f32 and clear high bits.
>    def : Pat<(v8f32 (X86vzmovl (v8f32 VR256:$src))),
>              (VBLENDPSYrri (v8f32 (AVX_SET0)), VR256:$src, (i8 1))>;
> -  def : Pat<(v8i32 (X86vzmovl (v8i32 VR256:$src))),
> -            (VBLENDPSYrri (v8i32 (AVX_SET0)), VR256:$src, (i8 1))>;
> +
> +  // Move low f64 and clear high bits.
> +  def : Pat<(v4f64 (X86vzmovl (v4f64 VR256:$src))),
> +            (VBLENDPDYrri (v4f64 (AVX_SET0)), VR256:$src, (i8 1))>;
>    }
>
>    def : Pat<(v8f32 (X86vzmovl (insert_subvector undef,
> @@ -7199,14 +7205,19 @@
>                             (v2f64 (VMOVSDrr (v2f64 (V_SET0)), FR64:$src)),
>                             sub_xmm)>;
>
> -  // Move low f64 and clear high bits.
> -  def : Pat<(v4f64 (X86vzmovl (v4f64 VR256:$src))),
> -            (VBLENDPDYrri (v4f64 (AVX_SET0)), VR256:$src, (i8 1))>;
> -
> +  // These will incur an FP/int domain crossing penalty, but it may be the only
> +  // way without AVX2. Do not add any complexity because we may be able to match
> +  // more optimal patterns defined earlier in this file.
> +  def : Pat<(v8i32 (X86vzmovl (v8i32 VR256:$src))),
> +            (VBLENDPSYrri (v8i32 (AVX_SET0)), VR256:$src, (i8 1))>;
>    def : Pat<(v4i64 (X86vzmovl (v4i64 VR256:$src))),
>              (VBLENDPDYrri (v4i64 (AVX_SET0)), VR256:$src, (i8 1))>;
>  }
>
> +// FIXME: Prefer a movss or movsd over a blendps when optimizing for size or
> +// on targets where they have equal performance. These were changed to use
> +// blends because blends have better throughput on SandyBridge and Haswell, but
> +// movs[s/d] are 1-2 byte shorter instructions.
>  let Predicates = [UseSSE41] in {
>    // With SSE41 we can use blends for these patterns.
>    def : Pat<(v4f32 (X86vzmovl (v4f32 VR128:$src))),
> Index: test/CodeGen/X86/vector-shuffle-256-v4.ll
> ===================================================================
> --- test/CodeGen/X86/vector-shuffle-256-v4.ll
> +++ test/CodeGen/X86/vector-shuffle-256-v4.ll
> @@ -843,8 +843,9 @@
>  define <4 x double> @insert_reg_and_zero_v4f64(double %a) {
>  ; ALL-LABEL: insert_reg_and_zero_v4f64:
>  ; ALL:       # BB#0:
> -; ALL-NEXT:    vxorpd %xmm1, %xmm1, %xmm1
> -; ALL-NEXT:    vmovsd {{.*#+}} xmm0 = xmm0[0],xmm1[1]
> +; ALL-NEXT:    # kill: XMM0<def> XMM0<kill> YMM0<def>
> +; ALL-NEXT:    vxorpd %ymm1, %ymm1, %ymm1
> +; ALL-NEXT:    vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3]
>  ; ALL-NEXT:    retq
>    %v = insertelement <4 x double> undef, double %a, i32 0
>    %shuffle = shufflevector <4 x double> %v, <4 x double> zeroinitializer, <4 x i32> <i32 0, i32 5, i32 6, i32 7>
> Index: test/CodeGen/X86/vector-shuffle-256-v8.ll
> ===================================================================
> --- test/CodeGen/X86/vector-shuffle-256-v8.ll
> +++ test/CodeGen/X86/vector-shuffle-256-v8.ll
> @@ -133,8 +133,6 @@
>  ; AVX2:       # BB#0:
>  ; AVX2-NEXT:    movl $7, %eax
>  ; AVX2-NEXT:    vmovd %eax, %xmm1
> -; AVX2-NEXT:    vxorps %ymm2, %ymm2, %ymm2
> -; AVX2-NEXT:    vblendps {{.*#+}} ymm1 = ymm1[0],ymm2[1,2,3,4,5,6,7]
>  ; AVX2-NEXT:    vpermps %ymm0, %ymm1, %ymm0
>  ; AVX2-NEXT:    retq
>    %shuffle = shufflevector <8 x float> %a, <8 x float> %b, <8 x i32> <i32 7, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
> @@ -962,8 +960,6 @@
>  ; AVX2:       # BB#0:
>  ; AVX2-NEXT:    movl $7, %eax
>  ; AVX2-NEXT:    vmovd %eax, %xmm1
> -; AVX2-NEXT:    vxorps %ymm2, %ymm2, %ymm2
> -; AVX2-NEXT:    vblendps {{.*#+}} ymm1 = ymm1[0],ymm2[1,2,3,4,5,6,7]
>  ; AVX2-NEXT:    vpermd %ymm0, %ymm1, %ymm0
>  ; AVX2-NEXT:    retq
>    %shuffle = shufflevector <8 x i32> %a, <8 x i32> %b, <8 x i32> <i32 7, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
>
> EMAIL PREFERENCES
>   http://reviews.llvm.org/settings/panel/emailpreferences/
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>




More information about the llvm-commits mailing list