[PATCH][X86] Fix missed selection of non-temporal store of zero vector (PR19370).

Fri Oct 17 09:50:37 PDT 2014

Hi Andrea,

Thanks for the quick follow-up.
LGTM.

Cheers,
-Quentin

On Oct 17, 2014, at 2:20 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:

> On Thu, Oct 16, 2014 at 11:17 PM, Quentin Colombet <qcolombet at apple.com> wrote:
>> 
>>> On Oct 16, 2014, at 2:50 PM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:
>>> 
>>> Hi Quentin,
>>> 
>>> On Thu, Oct 16, 2014 at 7:49 PM, Quentin Colombet <qcolombet at apple.com> wrote:
>>>> Hi Andrea,
>>>> 
>>>> The general direction looks good, just a couple of questions:
>>>> Since we are matching integer types, shouldn’t we use the integer variant:
>>>> MOVNTD?
>>> 
>>> The only integer variant available would be movntdq (SSE2) and
>>> vmovntdq (AVX) which can be used to store quad-words.
>>> Since we are storing all zeros in the destination, then I think in
>>> this case it should be ok to use it.
>>> In case, then I would probably need to guard the pattern that selects
>>> 'movntdq' with a 'let Predicates = [HasSSE2] in'.
>>> 
>>> I noticed however that if we use the integer variant MOVNTDQmr, the
>>> backend would eventually convert it back to a MOVNTPSmr in the attempt
>>> to fix-up the execution dependency (I guess that is because the zero
>>> vector is generated by a xorps).
>> 
>> I was expected this :).
>> The thing I had in mind was, since we are using integer type I was expecting that in the general case the computation leading to this store would have the same type. Therefore, I thought it was better to directly generate the instruction that is in the right domain than relying on the domain fixer for that.
> 
> I agree that emitting the instruction in the right domain is the best
> thing to do.
> I attached a new patch that uses the integer variant.
> 
> To clarify,
> I was just a bit confused here because I didn't expect the backend to
> always emit a xorps for the vector of all zeros.
> In the example below, no matter what variant of non-temporal store we
> use, we always end up generating a (v)xorps:
> 
> Example:
> ///
> define void @test2(<4 x i32>* %dst) {
>  store <4 x i32> zeroinitializer, <4 x i32>* %dst, align 16, !nontemporal !1
>  ret void
> }
> ///
> 
> Before Post-RA pseudo instruction expansion we have ( with -mcpu=corei7):
>        %XMM0<def> = V_SET0
>        MOVNTDQmr %RDI<kill>, 1, %noreg, 0, %noreg, %XMM0<kill>;
> mem:ST16[%dst](nontemporal)
>        RETQ
> 
> During Post-RA pseudo instruction expansion pass that sequence is converted to:
>        %XMM0<def,tied1> = XORPSrr %XMM0<undef,tied0>, %XMM0<undef>
>        MOVNTDQmr %RDI<kill>, 1, %noreg, 0, %noreg, %XMM0<kill>;
> mem:ST16[%dst](nontemporal)
>        RETQ
> 
> This eventually triggers the execution domain fix:
>        %XMM0<def,tied1> = XORPSrr %XMM0<undef,tied0>, %XMM0<undef>
>        MOVNTPSmr %RDI<kill>, 1, %noreg, 0, %noreg, %XMM0<kill>;
> mem:ST16[%dst](nontemporal)
>        RETQ
> 
>> 
>>> 
>>>> 
>>>> Moreover, wouldn’t it make sense to have a more general pattern for this?
>>>> I.e., we could use VR128 instead of immAllZerosV, couldn’t we?
>>> 
>>> Originally I tried to fix ths bug using VR128 instead of immAllZerosV.
>>> However, in all my experiments, the only case I found where the
>>> backend wrongly selected a normal store instead of a non-temporal
>>> store was the case where a zero vector is used as value in input. This
>>> is why eventually I decided to explicitly use immAllZerosV.
>> 
>> I see the rational, I just wanted to be sure we caught all the cases.
>> 
>>> 
>>> That said, If I replace immAllZerosV with VR128 it would work.
>>> However, I don't think (correct me if I am wrong) we would be able to
>>> use the integer variant MOVNTDQmr anymore, since the input vector may
>>> not be a zero vector.
>> 
>> What is the problem with that?
>> Unless I am mistaken, we are storing a 128-bits value and the way it is interpreted is pointless, isn’t it?
> 
> Ah yes, you are right. There is no problem with it.
> Sorry for the confusion.
> 
>> 
>> That makes me wonder, why the actual patterns for MOVNT do not work here?
>> We should be able to do implicit conversions from v4i32 to v2i64 (i.e., anything that stay on VR128).
> 
> Before this patch, we only had a single pattern per non-temporal store
> definition.
> 
> For example, in the case of VMOVNTDQmr we had:
> 
> let ExeDomain = SSEPackedInt in
> def VMOVNTDQmr    : VPDI<0xE7, MRMDestMem, (outs),
>                         (ins f128mem:$dst, VR128:$src),
>                         "movntdq\t{$src, $dst|$dst, $src}",
>                         [(alignednontemporalstore (v2i64 VR128:$src),
>                                                   addr:$dst)],
>                                                   IIC_SSE_MOVNT>, VEX;
> 
> My understanding is that the explicit usage of 'v2i64' for the
> VMOVNTDQmr is intentional since it allows to pattern match other
> flavors of non-temporal stores when the type is not v2i64.
> This is also the reason why the actual patterns do not work here and
> we need the extra patterns from my patch.
> 
> What about this new patch?
> The main differences w.r.t. the previous patch are:
> - For v4i32 non-temporal stores, MOVNTDQmr is now selected instead of
> MOVNTPSmr. This would keep the computation in the integer domain.
> - The new patterns are now more generic as they allow VR128 to be
> used instead of immAllZerosV.
> 
> Please let me know if this new version of the patch looks good to you.
> 
> Thanks again for your time,
> Andrea
> 
>> 
>> -Quentin
>> 
>>> 
>>> What do you think?
>>> 
>>> Thanks,
>>> Andrea
>>> 
>>>> 
>>>> Thanks,
>>>> -Quentin
>>>> 
>>>> On Oct 16, 2014, at 10:41 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com>
>>>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> This is a fix for PR19370.
>>>> Currently the x86 backend wrongly selects a a normal vector store
>>>> instead of a non-temporal store if the value to store is a vector of
>>>> all zeros.
>>>> 
>>>> Small reproducible:
>>>> 
>>>> ;;
>>>> define void @test(<4 x float>* %dst) {
>>>> store <4 x float> zeroinitializer, <4 x float>* %dst, align 16,
>>>> !nontemporal !1
>>>> ret void
>>>> }
>>>> 
>>>> !1 = metadata !{i32 1}
>>>> ;;
>>>> 
>>>> llc (-mtriple=x86_64-unknown-unknown -mcpu=corei7-avx) generates:
>>>> vxorps  %xmm0, %xmm0, %xmm0
>>>> vmovaps %xmm0, (%rdi)
>>>> retq
>>>> 
>>>> Instead, llc should generate:
>>>> vxorps  %xmm0, %xmm0, %xmm0
>>>> vmovntps  %xmm0, (%rdi)
>>>> retq
>>>> 
>>>> In this example, the vector of all zeros is legalized to a zero vector
>>>> of type v4i32.
>>>> However, ISel doesn't have a rule to select a MOVNTPSmr when the
>>>> source is not a vector of float elements. So, it eventually (wrongly)
>>>> falls back to selecting a normal store.
>>>> 
>>>> This patch fixes the problem adding extra ISel patterns to cover that
>>>> particular corner case.
>>>> 
>>>> Please let me know if ok to commit.
>>>> 
>>>> Thanks!
>>>> Andrea
>>>> <patch-nontemporal.diff>
>>>> 
>>>> 
>> 
> <patch-v2.diff>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20141017/bebc94b1/attachment.html>