[PATCH][X86] Fix missed selection of non-temporal store of zero vector (PR19370).

Fri Oct 17 10:37:44 PDT 2014

Thanks Quentin!
Committed revision 220054.

On Fri, Oct 17, 2014 at 5:50 PM, Quentin Colombet <qcolombet at apple.com> wrote:
> Hi Andrea,
>
> Thanks for the quick follow-up.
> LGTM.
>
> Cheers,
> -Quentin
>
> On Oct 17, 2014, at 2:20 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com>
> wrote:
>
> On Thu, Oct 16, 2014 at 11:17 PM, Quentin Colombet <qcolombet at apple.com>
> wrote:
>
>
> On Oct 16, 2014, at 2:50 PM, Andrea Di Biagio <andrea.dibiagio at gmail.com>
> wrote:
>
> Hi Quentin,
>
> On Thu, Oct 16, 2014 at 7:49 PM, Quentin Colombet <qcolombet at apple.com>
> wrote:
>
> Hi Andrea,
>
> The general direction looks good, just a couple of questions:
> Since we are matching integer types, shouldn’t we use the integer variant:
> MOVNTD?
>
>
> The only integer variant available would be movntdq (SSE2) and
> vmovntdq (AVX) which can be used to store quad-words.
> Since we are storing all zeros in the destination, then I think in
> this case it should be ok to use it.
> In case, then I would probably need to guard the pattern that selects
> 'movntdq' with a 'let Predicates = [HasSSE2] in'.
>
> I noticed however that if we use the integer variant MOVNTDQmr, the
> backend would eventually convert it back to a MOVNTPSmr in the attempt
> to fix-up the execution dependency (I guess that is because the zero
> vector is generated by a xorps).
>
>
> I was expected this :).
> The thing I had in mind was, since we are using integer type I was expecting
> that in the general case the computation leading to this store would have
> the same type. Therefore, I thought it was better to directly generate the
> instruction that is in the right domain than relying on the domain fixer for
> that.
>
>
> I agree that emitting the instruction in the right domain is the best
> thing to do.
> I attached a new patch that uses the integer variant.
>
> To clarify,
> I was just a bit confused here because I didn't expect the backend to
> always emit a xorps for the vector of all zeros.
> In the example below, no matter what variant of non-temporal store we
> use, we always end up generating a (v)xorps:
>
> Example:
> ///
> define void @test2(<4 x i32>* %dst) {
>  store <4 x i32> zeroinitializer, <4 x i32>* %dst, align 16, !nontemporal !1
>  ret void
> }
> ///
>
> Before Post-RA pseudo instruction expansion we have ( with -mcpu=corei7):
>        %XMM0<def> = V_SET0
>        MOVNTDQmr %RDI<kill>, 1, %noreg, 0, %noreg, %XMM0<kill>;
> mem:ST16[%dst](nontemporal)
>        RETQ
>
> During Post-RA pseudo instruction expansion pass that sequence is converted
> to:
>        %XMM0<def,tied1> = XORPSrr %XMM0<undef,tied0>, %XMM0<undef>
>        MOVNTDQmr %RDI<kill>, 1, %noreg, 0, %noreg, %XMM0<kill>;
> mem:ST16[%dst](nontemporal)
>        RETQ
>
> This eventually triggers the execution domain fix:
>        %XMM0<def,tied1> = XORPSrr %XMM0<undef,tied0>, %XMM0<undef>
>        MOVNTPSmr %RDI<kill>, 1, %noreg, 0, %noreg, %XMM0<kill>;
> mem:ST16[%dst](nontemporal)
>        RETQ
>
>
>
>
> Moreover, wouldn’t it make sense to have a more general pattern for this?
> I.e., we could use VR128 instead of immAllZerosV, couldn’t we?
>
>
> Originally I tried to fix ths bug using VR128 instead of immAllZerosV.
> However, in all my experiments, the only case I found where the
> backend wrongly selected a normal store instead of a non-temporal
> store was the case where a zero vector is used as value in input. This
> is why eventually I decided to explicitly use immAllZerosV.
>
>
> I see the rational, I just wanted to be sure we caught all the cases.
>
>
> That said, If I replace immAllZerosV with VR128 it would work.
> However, I don't think (correct me if I am wrong) we would be able to
> use the integer variant MOVNTDQmr anymore, since the input vector may
> not be a zero vector.
>
>
> What is the problem with that?
> Unless I am mistaken, we are storing a 128-bits value and the way it is
> interpreted is pointless, isn’t it?
>
>
> Ah yes, you are right. There is no problem with it.
> Sorry for the confusion.
>
>
> That makes me wonder, why the actual patterns for MOVNT do not work here?
> We should be able to do implicit conversions from v4i32 to v2i64 (i.e.,
> anything that stay on VR128).
>
>
> Before this patch, we only had a single pattern per non-temporal store
> definition.
>
> For example, in the case of VMOVNTDQmr we had:
>
> let ExeDomain = SSEPackedInt in
> def VMOVNTDQmr    : VPDI<0xE7, MRMDestMem, (outs),
>                         (ins f128mem:$dst, VR128:$src),
>                         "movntdq\t{$src, $dst|$dst, $src}",
>                         [(alignednontemporalstore (v2i64 VR128:$src),
>                                                   addr:$dst)],
>                                                   IIC_SSE_MOVNT>, VEX;
>
> My understanding is that the explicit usage of 'v2i64' for the
> VMOVNTDQmr is intentional since it allows to pattern match other
> flavors of non-temporal stores when the type is not v2i64.
> This is also the reason why the actual patterns do not work here and
> we need the extra patterns from my patch.
>
> What about this new patch?
> The main differences w.r.t. the previous patch are:
> - For v4i32 non-temporal stores, MOVNTDQmr is now selected instead of
> MOVNTPSmr. This would keep the computation in the integer domain.
> - The new patterns are now more generic as they allow VR128 to be
> used instead of immAllZerosV.
>
> Please let me know if this new version of the patch looks good to you.
>
> Thanks again for your time,
> Andrea
>
>
> -Quentin
>
>
> What do you think?
>
> Thanks,
> Andrea
>
>
> Thanks,
> -Quentin
>
> On Oct 16, 2014, at 10:41 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com>
> wrote:
>
> Hi,
>
> This is a fix for PR19370.
> Currently the x86 backend wrongly selects a a normal vector store
> instead of a non-temporal store if the value to store is a vector of
> all zeros.
>
> Small reproducible:
>
> ;;
> define void @test(<4 x float>* %dst) {
> store <4 x float> zeroinitializer, <4 x float>* %dst, align 16,
> !nontemporal !1
> ret void
> }
>
> !1 = metadata !{i32 1}
> ;;
>
> llc (-mtriple=x86_64-unknown-unknown -mcpu=corei7-avx) generates:
> vxorps  %xmm0, %xmm0, %xmm0
> vmovaps %xmm0, (%rdi)
> retq
>
> Instead, llc should generate:
> vxorps  %xmm0, %xmm0, %xmm0
> vmovntps  %xmm0, (%rdi)
> retq
>
> In this example, the vector of all zeros is legalized to a zero vector
> of type v4i32.
> However, ISel doesn't have a rule to select a MOVNTPSmr when the
> source is not a vector of float elements. So, it eventually (wrongly)
> falls back to selecting a normal store.
>
> This patch fixes the problem adding extra ISel patterns to cover that
> particular corner case.
>
> Please let me know if ok to commit.
>
> Thanks!
> Andrea
> <patch-nontemporal.diff>
>
>
>
> <patch-v2.diff>
>
>