[PATCH] [X86, AVX] adjust tablegen patterns to generate better code for scalar insertion into zero vector (PR23073)
Andrea Di Biagio
andrea.dibiagio at gmail.com
Thu Apr 2 12:10:12 PDT 2015
On Thu, Apr 2, 2015 at 7:17 PM, Sanjay Patel <spatel at rotateright.com> wrote:
> Hi Andrea,
>
> I don't doubt that blends have better throughput on SB/Haswell, but the
> change should not have been applied to all subtargets universally because of
> the size disadvantage of blend instructions.
Right, I agree.
>
> If fixing these up after regalloc is the preferred solution, I'll look into
> that. But does limiting these patterns to match specific chips in tablegen
> change anything in the shuffle lowering logic?
Changing the tablegen patterns to match specific chips makes sense to me.
In retrospect, the idea of having a combine could still make sense to
improve the matching of movss/d memory-register variants. However,
that it is a completely different problem.
>
>
> On Thu, Apr 2, 2015 at 11:54 AM, Andrea Di Biagio
> <andrea.dibiagio at gmail.com> wrote:
>>
>> Hi Sanjay,
>>
>> On Thu, Apr 2, 2015 at 6:01 PM, Sanjay Patel <spatel at rotateright.com>
>> wrote:
>> > Patch updated again:
>> >
>> > I removed all of the changes related to blend vs. movs, so this patch is
>> > now purely about adjusting the AddedComplexity to fix PR23073.
>> >
>> > I did some svn blaming and see the reasoning for the blend patterns.
>> > These were added in r219022 by Chandler. But I think that change
>> > overstepped, so I've put some FIXMEs in here. I think the procedure is to
>> > follow-up on the commit mail for that checkin, so I'll do that next
>>
>> Hi Sanjay,
>>
>> I don't think those patterns are a mistake. Blend instructions always
>> have better reciprocal throughput than movss on
>> SandyBridge/IvyBridge/Haswell. On Haswell, a blend instruction has
>> 0.33 reciprocal throughput because it can be scheduled for execution
>> on three different ports. On Jaguar and other AMD chips, blendps
>> doesn't have a better throughput with respect to movss; so movss may
>> be a better choice.
>>
>> I remember this problem was raised a while ago during the evaluation
>> of the new shuffle lowering, and Chandler suggested to add some logic
>> to simplify the machine code (after regalloc) matching all the complex
>> variants of movs[s|d] (and potentially converting blends to movs if
>> necessary).
>> From a 'shuffle lowering' (and ISel) point of view, it was easier to
>> reason in terms of blends rather than movss. movss/d have a
>> memory-register form that is quite complicated to match..
>>
>> -Andrea
>>
>> >
>> >
>> > http://reviews.llvm.org/D8794
>> >
>> > Files:
>> > lib/Target/X86/X86InstrSSE.td
>> > test/CodeGen/X86/vector-shuffle-256-v4.ll
>> > test/CodeGen/X86/vector-shuffle-256-v8.ll
>> >
>> > Index: lib/Target/X86/X86InstrSSE.td
>> > ===================================================================
>> > --- lib/Target/X86/X86InstrSSE.td
>> > +++ lib/Target/X86/X86InstrSSE.td
>> > @@ -7168,6 +7168,10 @@
>> > }
>> >
>> > // Patterns
>> > +// FIXME: Prefer a movss or movsd over a blendps when optimizing for
>> > size or
>> > +// on targets where they have equal performance. These were changed to
>> > use
>> > +// blends because blends have better throughput on SandyBridge and
>> > Haswell, but
>> > +// movs[s/d] are 1-2 byte shorter instructions.
>> > let Predicates = [UseAVX] in {
>> > let AddedComplexity = 15 in {
>> > // Move scalar to XMM zero-extended, zeroing a VR128 then do a
>> > @@ -7184,8 +7188,10 @@
>> > // Move low f32 and clear high bits.
>> > def : Pat<(v8f32 (X86vzmovl (v8f32 VR256:$src))),
>> > (VBLENDPSYrri (v8f32 (AVX_SET0)), VR256:$src, (i8 1))>;
>> > - def : Pat<(v8i32 (X86vzmovl (v8i32 VR256:$src))),
>> > - (VBLENDPSYrri (v8i32 (AVX_SET0)), VR256:$src, (i8 1))>;
>> > +
>> > + // Move low f64 and clear high bits.
>> > + def : Pat<(v4f64 (X86vzmovl (v4f64 VR256:$src))),
>> > + (VBLENDPDYrri (v4f64 (AVX_SET0)), VR256:$src, (i8 1))>;
>> > }
>> >
>> > def : Pat<(v8f32 (X86vzmovl (insert_subvector undef,
>> > @@ -7199,14 +7205,19 @@
>> > (v2f64 (VMOVSDrr (v2f64 (V_SET0)),
>> > FR64:$src)),
>> > sub_xmm)>;
>> >
>> > - // Move low f64 and clear high bits.
>> > - def : Pat<(v4f64 (X86vzmovl (v4f64 VR256:$src))),
>> > - (VBLENDPDYrri (v4f64 (AVX_SET0)), VR256:$src, (i8 1))>;
>> > -
>> > + // These will incur an FP/int domain crossing penalty, but it may be
>> > the only
>> > + // way without AVX2. Do not add any complexity because we may be able
>> > to match
>> > + // more optimal patterns defined earlier in this file.
>> > + def : Pat<(v8i32 (X86vzmovl (v8i32 VR256:$src))),
>> > + (VBLENDPSYrri (v8i32 (AVX_SET0)), VR256:$src, (i8 1))>;
>> > def : Pat<(v4i64 (X86vzmovl (v4i64 VR256:$src))),
>> > (VBLENDPDYrri (v4i64 (AVX_SET0)), VR256:$src, (i8 1))>;
>> > }
>> >
>> > +// FIXME: Prefer a movss or movsd over a blendps when optimizing for
>> > size or
>> > +// on targets where they have equal performance. These were changed to
>> > use
>> > +// blends because blends have better throughput on SandyBridge and
>> > Haswell, but
>> > +// movs[s/d] are 1-2 byte shorter instructions.
>> > let Predicates = [UseSSE41] in {
>> > // With SSE41 we can use blends for these patterns.
>> > def : Pat<(v4f32 (X86vzmovl (v4f32 VR128:$src))),
>> > Index: test/CodeGen/X86/vector-shuffle-256-v4.ll
>> > ===================================================================
>> > --- test/CodeGen/X86/vector-shuffle-256-v4.ll
>> > +++ test/CodeGen/X86/vector-shuffle-256-v4.ll
>> > @@ -843,8 +843,9 @@
>> > define <4 x double> @insert_reg_and_zero_v4f64(double %a) {
>> > ; ALL-LABEL: insert_reg_and_zero_v4f64:
>> > ; ALL: # BB#0:
>> > -; ALL-NEXT: vxorpd %xmm1, %xmm1, %xmm1
>> > -; ALL-NEXT: vmovsd {{.*#+}} xmm0 = xmm0[0],xmm1[1]
>> > +; ALL-NEXT: # kill: XMM0<def> XMM0<kill> YMM0<def>
>> > +; ALL-NEXT: vxorpd %ymm1, %ymm1, %ymm1
>> > +; ALL-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3]
>> > ; ALL-NEXT: retq
>> > %v = insertelement <4 x double> undef, double %a, i32 0
>> > %shuffle = shufflevector <4 x double> %v, <4 x double>
>> > zeroinitializer, <4 x i32> <i32 0, i32 5, i32 6, i32 7>
>> > Index: test/CodeGen/X86/vector-shuffle-256-v8.ll
>> > ===================================================================
>> > --- test/CodeGen/X86/vector-shuffle-256-v8.ll
>> > +++ test/CodeGen/X86/vector-shuffle-256-v8.ll
>> > @@ -133,8 +133,6 @@
>> > ; AVX2: # BB#0:
>> > ; AVX2-NEXT: movl $7, %eax
>> > ; AVX2-NEXT: vmovd %eax, %xmm1
>> > -; AVX2-NEXT: vxorps %ymm2, %ymm2, %ymm2
>> > -; AVX2-NEXT: vblendps {{.*#+}} ymm1 = ymm1[0],ymm2[1,2,3,4,5,6,7]
>> > ; AVX2-NEXT: vpermps %ymm0, %ymm1, %ymm0
>> > ; AVX2-NEXT: retq
>> > %shuffle = shufflevector <8 x float> %a, <8 x float> %b, <8 x i32>
>> > <i32 7, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
>> > @@ -962,8 +960,6 @@
>> > ; AVX2: # BB#0:
>> > ; AVX2-NEXT: movl $7, %eax
>> > ; AVX2-NEXT: vmovd %eax, %xmm1
>> > -; AVX2-NEXT: vxorps %ymm2, %ymm2, %ymm2
>> > -; AVX2-NEXT: vblendps {{.*#+}} ymm1 = ymm1[0],ymm2[1,2,3,4,5,6,7]
>> > ; AVX2-NEXT: vpermd %ymm0, %ymm1, %ymm0
>> > ; AVX2-NEXT: retq
>> > %shuffle = shufflevector <8 x i32> %a, <8 x i32> %b, <8 x i32> <i32
>> > 7, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
>> >
>> > EMAIL PREFERENCES
>> > http://reviews.llvm.org/settings/panel/emailpreferences/
>> >
>> > _______________________________________________
>> > llvm-commits mailing list
>> > llvm-commits at cs.uiuc.edu
>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> >
>
>
More information about the llvm-commits
mailing list