[PATCH][X86] Fix missed selection of non-temporal store of zero vector (PR19370).

Fri Oct 17 02:20:48 PDT 2014

On Thu, Oct 16, 2014 at 11:17 PM, Quentin Colombet <qcolombet at apple.com> wrote:
>
>> On Oct 16, 2014, at 2:50 PM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:
>>
>> Hi Quentin,
>>
>> On Thu, Oct 16, 2014 at 7:49 PM, Quentin Colombet <qcolombet at apple.com> wrote:
>>> Hi Andrea,
>>>
>>> The general direction looks good, just a couple of questions:
>>> Since we are matching integer types, shouldn’t we use the integer variant:
>>> MOVNTD?
>>
>> The only integer variant available would be movntdq (SSE2) and
>> vmovntdq (AVX) which can be used to store quad-words.
>> Since we are storing all zeros in the destination, then I think in
>> this case it should be ok to use it.
>> In case, then I would probably need to guard the pattern that selects
>> 'movntdq' with a 'let Predicates = [HasSSE2] in'.
>>
>> I noticed however that if we use the integer variant MOVNTDQmr, the
>> backend would eventually convert it back to a MOVNTPSmr in the attempt
>> to fix-up the execution dependency (I guess that is because the zero
>> vector is generated by a xorps).
>
> I was expected this :).
> The thing I had in mind was, since we are using integer type I was expecting that in the general case the computation leading to this store would have the same type. Therefore, I thought it was better to directly generate the instruction that is in the right domain than relying on the domain fixer for that.

I agree that emitting the instruction in the right domain is the best
thing to do.
I attached a new patch that uses the integer variant.

To clarify,
I was just a bit confused here because I didn't expect the backend to
always emit a xorps for the vector of all zeros.
In the example below, no matter what variant of non-temporal store we
use, we always end up generating a (v)xorps:

Example:
///
define void @test2(<4 x i32>* %dst) {
  store <4 x i32> zeroinitializer, <4 x i32>* %dst, align 16, !nontemporal !1
  ret void
}
///

Before Post-RA pseudo instruction expansion we have ( with -mcpu=corei7):
        %XMM0<def> = V_SET0
        MOVNTDQmr %RDI<kill>, 1, %noreg, 0, %noreg, %XMM0<kill>;
mem:ST16[%dst](nontemporal)
        RETQ

During Post-RA pseudo instruction expansion pass that sequence is converted to:
        %XMM0<def,tied1> = XORPSrr %XMM0<undef,tied0>, %XMM0<undef>
        MOVNTDQmr %RDI<kill>, 1, %noreg, 0, %noreg, %XMM0<kill>;
mem:ST16[%dst](nontemporal)
        RETQ

This eventually triggers the execution domain fix:
        %XMM0<def,tied1> = XORPSrr %XMM0<undef,tied0>, %XMM0<undef>
        MOVNTPSmr %RDI<kill>, 1, %noreg, 0, %noreg, %XMM0<kill>;
mem:ST16[%dst](nontemporal)
        RETQ

>
>>
>>>
>>> Moreover, wouldn’t it make sense to have a more general pattern for this?
>>> I.e., we could use VR128 instead of immAllZerosV, couldn’t we?
>>
>> Originally I tried to fix ths bug using VR128 instead of immAllZerosV.
>> However, in all my experiments, the only case I found where the
>> backend wrongly selected a normal store instead of a non-temporal
>> store was the case where a zero vector is used as value in input. This
>> is why eventually I decided to explicitly use immAllZerosV.
>
> I see the rational, I just wanted to be sure we caught all the cases.
>
>>
>> That said, If I replace immAllZerosV with VR128 it would work.
>> However, I don't think (correct me if I am wrong) we would be able to
>> use the integer variant MOVNTDQmr anymore, since the input vector may
>> not be a zero vector.
>
> What is the problem with that?
> Unless I am mistaken, we are storing a 128-bits value and the way it is interpreted is pointless, isn’t it?

Ah yes, you are right. There is no problem with it.
Sorry for the confusion.

>
> That makes me wonder, why the actual patterns for MOVNT do not work here?
> We should be able to do implicit conversions from v4i32 to v2i64 (i.e., anything that stay on VR128).

Before this patch, we only had a single pattern per non-temporal store
definition.

For example, in the case of VMOVNTDQmr we had:

let ExeDomain = SSEPackedInt in
def VMOVNTDQmr    : VPDI<0xE7, MRMDestMem, (outs),
                         (ins f128mem:$dst, VR128:$src),
                         "movntdq\t{$src, $dst|$dst, $src}",
                         [(alignednontemporalstore (v2i64 VR128:$src),
                                                   addr:$dst)],
                                                   IIC_SSE_MOVNT>, VEX;

My understanding is that the explicit usage of 'v2i64' for the
VMOVNTDQmr is intentional since it allows to pattern match other
flavors of non-temporal stores when the type is not v2i64.
This is also the reason why the actual patterns do not work here and
we need the extra patterns from my patch.

What about this new patch?
The main differences w.r.t. the previous patch are:
 - For v4i32 non-temporal stores, MOVNTDQmr is now selected instead of
MOVNTPSmr. This would keep the computation in the integer domain.
 - The new patterns are now more generic as they allow VR128 to be
used instead of immAllZerosV.

Please let me know if this new version of the patch looks good to you.

Thanks again for your time,
Andrea

>
> -Quentin
>
>>
>> What do you think?
>>
>> Thanks,
>> Andrea
>>
>>>
>>> Thanks,
>>> -Quentin
>>>
>>> On Oct 16, 2014, at 10:41 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> This is a fix for PR19370.
>>> Currently the x86 backend wrongly selects a a normal vector store
>>> instead of a non-temporal store if the value to store is a vector of
>>> all zeros.
>>>
>>> Small reproducible:
>>>
>>> ;;
>>> define void @test(<4 x float>* %dst) {
>>> store <4 x float> zeroinitializer, <4 x float>* %dst, align 16,
>>> !nontemporal !1
>>> ret void
>>> }
>>>
>>> !1 = metadata !{i32 1}
>>> ;;
>>>
>>> llc (-mtriple=x86_64-unknown-unknown -mcpu=corei7-avx) generates:
>>> vxorps  %xmm0, %xmm0, %xmm0
>>> vmovaps %xmm0, (%rdi)
>>> retq
>>>
>>> Instead, llc should generate:
>>> vxorps  %xmm0, %xmm0, %xmm0
>>> vmovntps  %xmm0, (%rdi)
>>> retq
>>>
>>> In this example, the vector of all zeros is legalized to a zero vector
>>> of type v4i32.
>>> However, ISel doesn't have a rule to select a MOVNTPSmr when the
>>> source is not a vector of float elements. So, it eventually (wrongly)
>>> falls back to selecting a normal store.
>>>
>>> This patch fixes the problem adding extra ISel patterns to cover that
>>> particular corner case.
>>>
>>> Please let me know if ok to commit.
>>>
>>> Thanks!
>>> Andrea
>>> <patch-nontemporal.diff>
>>>
>>>
>
-------------- next part --------------
Index: lib/Target/X86/X86InstrSSE.td
===================================================================

--- lib/Target/X86/X86InstrSSE.td	(revision 220028)
+++ lib/Target/X86/X86InstrSSE.td	(working copy)
@@ -3912,60 +3912,69 @@
 }
 
 def MOVNTPSmr : PSI<0x2B, MRMDestMem, (outs), (ins f128mem:$dst, VR128:$src),
                     "movntps\t{$src, $dst|$dst, $src}",
                     [(alignednontemporalstore (v4f32 VR128:$src), addr:$dst)],
                     IIC_SSE_MOVNT>;
 def MOVNTPDmr : PDI<0x2B, MRMDestMem, (outs), (ins f128mem:$dst, VR128:$src),
                     "movntpd\t{$src, $dst|$dst, $src}",
                     [(alignednontemporalstore(v2f64 VR128:$src), addr:$dst)],
                     IIC_SSE_MOVNT>;
 
 let ExeDomain = SSEPackedInt in
 def MOVNTDQmr : PDI<0xE7, MRMDestMem, (outs), (ins f128mem:$dst, VR128:$src),
                     "movntdq\t{$src, $dst|$dst, $src}",
                     [(alignednontemporalstore (v2i64 VR128:$src), addr:$dst)],
                     IIC_SSE_MOVNT>;
 
 // There is no AVX form for instructions below this point
 def MOVNTImr : I<0xC3, MRMDestMem, (outs), (ins i32mem:$dst, GR32:$src),
                  "movnti{l}\t{$src, $dst|$dst, $src}",
                  [(nontemporalstore (i32 GR32:$src), addr:$dst)],
                  IIC_SSE_MOVNT>,
                PS, Requires<[HasSSE2]>;
 def MOVNTI_64mr : RI<0xC3, MRMDestMem, (outs), (ins i64mem:$dst, GR64:$src),
                      "movnti{q}\t{$src, $dst|$dst, $src}",
                      [(nontemporalstore (i64 GR64:$src), addr:$dst)],
                      IIC_SSE_MOVNT>,
                   PS, Requires<[HasSSE2]>;
 } // SchedRW = [WriteStore]
 
+let Predicates = [HasAVX, NoVLX] in {
+  def : Pat<(alignednontemporalstore (v4i32 VR128:$src), addr:$dst),
+            (VMOVNTDQmr addr:$dst, VR128:$src)>;
+}
+
+def : Pat<(alignednontemporalstore (v4i32 VR128:$src), addr:$dst),
+          (MOVNTDQmr addr:$dst, VR128:$src)>;
+
+
 } // AddedComplexity
 
 //===----------------------------------------------------------------------===//
 // SSE 1 & 2 - Prefetch and memory fence
 //===----------------------------------------------------------------------===//
 
 // Prefetch intrinsic.
 let Predicates = [HasSSE1], SchedRW = [WriteLoad] in {
 def PREFETCHT0   : I<0x18, MRM1m, (outs), (ins i8mem:$src),
     "prefetcht0\t$src", [(prefetch addr:$src, imm, (i32 3), (i32 1))],
     IIC_SSE_PREFETCH>, TB;
 def PREFETCHT1   : I<0x18, MRM2m, (outs), (ins i8mem:$src),
     "prefetcht1\t$src", [(prefetch addr:$src, imm, (i32 2), (i32 1))],
     IIC_SSE_PREFETCH>, TB;
 def PREFETCHT2   : I<0x18, MRM3m, (outs), (ins i8mem:$src),
     "prefetcht2\t$src", [(prefetch addr:$src, imm, (i32 1), (i32 1))],
     IIC_SSE_PREFETCH>, TB;
 def PREFETCHNTA  : I<0x18, MRM0m, (outs), (ins i8mem:$src),
     "prefetchnta\t$src", [(prefetch addr:$src, imm, (i32 0), (i32 1))],
     IIC_SSE_PREFETCH>, TB;
 }
 
 // FIXME: How should flush instruction be modeled?
 let SchedRW = [WriteLoad] in {
 // Flush cache
 def CLFLUSH : I<0xAE, MRM7m, (outs), (ins i8mem:$src),
                "clflush\t$src", [(int_x86_sse2_clflush addr:$src)],
                IIC_SSE_PREFETCH>, TB, Requires<[HasSSE2]>;
 }
 
Index: test/CodeGen/X86/nontemporal-2.ll
===================================================================
--- test/CodeGen/X86/nontemporal-2.ll	(revision 0)
+++ test/CodeGen/X86/nontemporal-2.ll	(working copy)
@@ -0,0 +1,30 @@
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=corei7 | FileCheck %s -check-prefix=CHECK -check-prefix=SSE
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=corei7-avx | FileCheck %s -check-prefix=CHECK -check-prefix=AVX
+
+; Make sure that we generate non-temporal stores for the test cases below.
+
+define void @test1(<4 x float>* %dst) {
+; CHECK-LABEL: test1:
+; SSE: movntps
+; AVX: vmovntps
+  store <4 x float> zeroinitializer, <4 x float>* %dst, align 16, !nontemporal !1
+  ret void
+}
+
+define void @test2(<4 x i32>* %dst) {
+; CHECK-LABEL: test2:
+; SSE: movntps
+; AVX: vmovntps
+  store <4 x i32> zeroinitializer, <4 x i32>* %dst, align 16, !nontemporal !1
+  ret void
+}
+
+define void @test3(<2 x double>* %dst) {
+; CHECK-LABEL: test3:
+; SSE: movntps
+; AVX: vmovntps
+  store <2 x double> zeroinitializer, <2 x double>* %dst, align 16, !nontemporal !1
+  ret void
+}
+
+!1 = metadata !{i32 1}