[llvm] r249820 - [MemCpyOpt] Fix wrong merging adjacent nontemporal stores into memset calls.

Andrea Di Biagio via llvm-commits llvm-commits at lists.llvm.org
Fri Oct 9 15:00:09 PDT 2015


On Fri, Oct 9, 2015 at 10:40 PM, Quentin Colombet <qcolombet at apple.com>
wrote:

>
> On Oct 9, 2015, at 2:24 PM, Andrea Di Biagio <andrea.dibiagio at gmail.com>
> wrote:
>
> Hi Philip,
>
> I have attached the prototype memset patch that I mentioned before.
>
> The patch is not complicated and would allows to merge nontemporal stores
> together.
> There are two important implications:
> 1) It will _not_ try to mix temporal stores with nontemporal stores. If
> people really really want the other behavior then we can think of adding a
> flag as a future extension.
> 2) Large memsets would still generate library calls. On X86, we would
> follow the usual rules (i.e.: memset is unrolled up to 16 stores; otherwise
> we generate a libcall).
>
> Point 2. is sub-optimal. However, I don't think it would occur often in
> practice to have to expand very big nontemporal memsets. Also, for very
> large memsets, the library implementation should be (hopefully!) smart
> enough to use a specialized loop of nontemporal stores.
>
>
> How do you envision something like that to be possible?
> AFAICT you cannot pass the non temporal information to the library.
>

You cannot pass that information to the library.
What I meant is that library memcpy/memset functions tend to be heavily
optimized for the target. For example, you could have a smart specialized
variant of memset that uses nontemporal stores for very large memsets.
That said, I don't like when the compiler speculates too much on what a
library call would do. I prefer a more conservative approach like my
original patch.


>
> As a side note: the patch would require a change in the nontemporal.ll
> test.
> I think it will also require extra x86 tests to make sure that the
> nontemporal memset is expanded to a sequence of movnt instructions.
>
> In conclusion:
> do you (Philip and Quentin) think this approach would be a good compromise?
>
>
> That would work for me, though I was already (and am still :)) happy!
>

Don't get me wrong, I am very happy as well!
Since I mentioned that I had an alternative patch, I thought it would have
been bad not to share it.

-Andrea


>
> Cheers,
> -Quentin
>
> For what is worth, this approach would still address most of our customer
> concerns (i.e. it is probably as good as the previous patch).
> If you prefer this approach more then I am okay with sending it for review
> (especially if this approach makes everybody happy :-) ).
>
> Cheers,
> Andrea
>
> On Fri, Oct 9, 2015 at 8:51 PM, Philip Reames <listmail at philipreames.com>
> wrote:
>
>> I don't have a strong enough opinion here to push the issue, but I think
>> I still disagree with the chosen approach.  However, I will defer to you
>> two since you seem to have discussed the tradeoffs and are better
>> positioned to make the decision.
>>
>> p.s. If the patch to support non-temporal memsets was simple, it might be
>> interesting in it's own right.  Even if we don't canonicalize to it, having
>> the ability to chose it in the frontend could be interesting.  Note the
>> "could"; I have no concrete use for this in the near future.
>>
>> Philip
>>
>>
>>
>>
>> On 10/09/2015 11:15 AM, Quentin Colombet wrote:
>>
>> Hi Andrea,
>>
>> On Oct 9, 2015, at 10:58 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com>
>> wrote:
>>
>>
>>
>> On Fri, Oct 9, 2015 at 6:09 PM, Quentin Colombet via llvm-commits <
>> llvm-commits at lists.llvm.org> wrote:
>>
>>> Hi Philip,
>>>
>>> >
>>> > One of my worries is about how disabling this will effect other
>>> optimizations around non-temporal stores.  In particular, if we had a
>>> memset following the stores, we'd no longer have the chance to combine them
>>> into a single memset.  Replacing two non-temporal stores and a large memset
>>> with a larger memset seems highly likely to be profitable.
>>>
>>> I am not sure about that. I would expect any memset implementation to
>>> touch the cache and thus, to have an impact on the performance of the
>>> surrounding code.
>>>
>>
>> I agree with Quentin here. I don't think it would be better since you
>> would end up polluting the caches.
>>
>>
>>> >
>>> > Analogously, replacing 8 adjacent non-temporal byte stores with a
>>> single memset (and then i64 store) also seems profitable and possibly
>>> disabled by this transform.
>>> >
>>> > Also, if we keep this change, the comment needs to be extended to
>>> highlight this is a profitability decision, not a legality one.
>>>
>>> Agree with that part.
>>>
>>
>> Adjacent stores like the ones you described should be easily handled by
>> 'MergeConsecutiveStores' in the DAGCombiner. So, I don't think this
>> transformation is pessimizing the codegen for that case.
>> MergeConsecutiveStores should be able to merge those small stores and still
>> preserve the nontemporal hint.
>>
>> Overall I am not completely against the idea of propagating the
>> !nontemporal metadata to memset/memcpy (I actually have a patch that
>> implements that for memset - if people really prefer that approach I can
>> upload it for review). However, at the moment I don't see the advantage in
>> doing it (I'd like to see an example where that transformation would be
>> profitable).
>>
>>
>> The thing I was agreeing with was that the comment should say that we
>> block the transformation not because we couldn’t fix memset, but because it
>> is not worth it given we would need to split the memset later. That was how
>> I read the profitability part, though I guess Philip was more into what you
>> understood :).
>>
>> Cheers,
>> Q.
>>
>>
>> -Andrea
>>
>> Thanks,
>>
>>> -Quentin
>>>
>>> >
>>> > Philip
>>> >
>>> > On 10/09/2015 03:53 AM, Andrea Di Biagio via llvm-commits wrote:
>>> >> Author: adibiagio
>>> >> Date: Fri Oct  9 05:53:41 2015
>>> >> New Revision: 249820
>>> >>
>>> >> URL:  <http://llvm.org/viewvc/llvm-project?rev=249820&view=rev>
>>> http://llvm.org/viewvc/llvm-project?rev=249820&view=rev
>>> >> Log:
>>> >> [MemCpyOpt] Fix wrong merging adjacent nontemporal stores into memset
>>> calls.
>>> >>
>>> >> Pass MemCpyOpt doesn't check if a store instruction is nontemporal.
>>> >> As a consequence, adjacent nontemporal stores are always merged into a
>>> >> memset call.
>>> >>
>>> >> Example:
>>> >>
>>> >> ;;;
>>> >> define void @foo(<4 x float>* nocapture %p) {
>>> >> entry:
>>> >>   store <4 x float> zeroinitializer, <4 x float>* %p, align 16,
>>> !nontemporal !0
>>> >>   %p1 = getelementptr inbounds <4 x float>, <4 x float>* %dst, i64 1
>>> >>   store <4 x float> zeroinitializer, <4 x float>* %p1, align 16,
>>> !nontemporal !0
>>> >>   ret void
>>> >> }
>>> >>
>>> >> !0 = !{i32 1}
>>> >> ;;;
>>> >>
>>> >> In this example, the two nontemporal stores are combined to a memset
>>> of zero
>>> >> which does not preserve the nontemporal hint. Later on the backend
>>> (tested on a
>>> >> x86-64 corei7) expands that memset call into a sequence of two normal
>>> 16-byte
>>> >> aligned vector stores.
>>> >>
>>> >> opt -memcpyopt example.ll -S -o - | llc -mcpu=corei7 -o -
>>> >>
>>> >> Before:
>>> >>   xorps  %xmm0, %xmm0
>>> >>   movaps  %xmm0, 16(%rdi)
>>> >>   movaps  %xmm0, (%rdi)
>>> >>
>>> >> With this patch, we no longer merge nontemporal stores into calls to
>>> memset.
>>> >> In this example, llc correctly expands the two stores into two
>>> movntps:
>>> >>   xorps  %xmm0, %xmm0
>>> >>   movntps %xmm0, 16(%rdi)
>>> >>   movntps  %xmm0, (%rdi)
>>> >>
>>> >> In theory, we could extend the usage of !nontemporal metadata to
>>> memcpy/memset
>>> >> calls. However a change like that would only have the effect of
>>> forcing the
>>> >> backend to expand !nontemporal memsets back to sequences of store
>>> instructions.
>>> >> A memset library call would not have exactly the same semantic of a
>>> builtin
>>> >> !nontemporal memset call. So, SelectionDAG will have to
>>> conservatively expand
>>> >> it back to a sequence of !nontemporal stores (effectively undoing the
>>> merging).
>>> >>
>>> >> Differential Revision:  <http://reviews.llvm.org/D13519>
>>> http://reviews.llvm.org/D13519
>>> >>
>>> >> Added:
>>> >>     llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll
>>> >> Modified:
>>> >>     llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp
>>> >>
>>> >> Modified: llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp
>>> >> URL:
>>> <http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp?rev=249820&r1=249819&r2=249820&view=diff>
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp?rev=249820&r1=249819&r2=249820&view=diff
>>> >>
>>> ==============================================================================
>>> >> --- llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp (original)
>>> >> +++ llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp Fri Oct  9
>>> 05:53:41 2015
>>> >> @@ -488,6 +488,16 @@ Instruction *MemCpyOpt::tryMergingIntoMe
>>> >>    bool MemCpyOpt::processStore(StoreInst *SI, BasicBlock::iterator
>>> &BBI) {
>>> >>    if (!SI->isSimple()) return false;
>>> >> +
>>> >> +  // Avoid merging nontemporal stores since the resulting
>>> >> +  // memcpy/memset would not be able to preserve the nontemporal
>>> hint.
>>> >> +  // In theory we could teach how to propagate the !nontemporal
>>> metadata to
>>> >> +  // memset calls. However, that change would force the backend to
>>> >> +  // conservatively expand !nontemporal memset calls back to
>>> sequences of
>>> >> +  // store instructions (effectively undoing the merging).
>>> >> +  if (SI->getMetadata(LLVMContext::MD_nontemporal))
>>> >> +    return false;
>>> >> +
>>> >>    const DataLayout &DL = SI->getModule()->getDataLayout();
>>> >>      // Detect cases where we're performing call slot forwarding, but
>>> >>
>>> >> Added: llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll
>>> >> URL:
>>> <http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll?rev=249820&view=auto>
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll?rev=249820&view=auto
>>> >>
>>> ==============================================================================
>>> >> --- llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll (added)
>>> >> +++ llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll Fri Oct  9
>>> 05:53:41 2015
>>> >> @@ -0,0 +1,49 @@
>>> >> +; RUN: opt < %s -memcpyopt -S | FileCheck %s
>>> >> +
>>> >> +target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
>>> >> +
>>> >> +; Verify that we don't combine nontemporal stores into memset calls.
>>> >> +
>>> >> +define void @nontemporal_stores_1(<4 x float>* nocapture %dst) {
>>> >> +; CHECK-LABEL: @nontemporal_stores_1
>>> >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>* %dst, align
>>> 16, !nontemporal !0
>>> >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>* %ptr1,
>>> align 16, !nontemporal !0
>>> >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>* %ptr2,
>>> align 16, !nontemporal !0
>>> >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>* %ptr3,
>>> align 16, !nontemporal !0
>>> >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>* %ptr4,
>>> align 16, !nontemporal !0
>>> >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>* %ptr5,
>>> align 16, !nontemporal !0
>>> >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>* %ptr6,
>>> align 16, !nontemporal !0
>>> >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>* %ptr7,
>>> align 16, !nontemporal !0
>>> >> +; CHECK-NEXT: ret void
>>> >> +entry:
>>> >> +  store <4 x float> zeroinitializer, <4 x float>* %dst, align 16,
>>> !nontemporal !0
>>> >> +  %ptr1 = getelementptr inbounds <4 x float>, <4 x float>* %dst, i64
>>> 1
>>> >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr1, align 16,
>>> !nontemporal !0
>>> >> +  %ptr2 = getelementptr inbounds <4 x float>, <4 x float>* %dst, i64
>>> 2
>>> >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr2, align 16,
>>> !nontemporal !0
>>> >> +  %ptr3 = getelementptr inbounds <4 x float>, <4 x float>* %dst, i64
>>> 3
>>> >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr3, align 16,
>>> !nontemporal !0
>>> >> +  %ptr4 = getelementptr inbounds <4 x float>, <4 x float>* %dst, i64
>>> 4
>>> >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr4, align 16,
>>> !nontemporal !0
>>> >> +  %ptr5 = getelementptr inbounds <4 x float>, <4 x float>* %dst, i64
>>> 5
>>> >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr5, align 16,
>>> !nontemporal !0
>>> >> +  %ptr6 = getelementptr inbounds <4 x float>, <4 x float>* %dst, i64
>>> 6
>>> >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr6, align 16,
>>> !nontemporal !0
>>> >> +  %ptr7 = getelementptr inbounds <4 x float>, <4 x float>* %dst, i64
>>> 7
>>> >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr7, align 16,
>>> !nontemporal !0
>>> >> +  ret void
>>> >> +}
>>> >> +
>>> >> +define void @nontemporal_stores_2(<4 x float>* nocapture %dst) {
>>> >> +; CHECK-LABEL: @nontemporal_stores_2
>>> >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>* %dst, align
>>> 16, !nontemporal !0
>>> >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>* %ptr1,
>>> align 16, !nontemporal !0
>>> >> +; CHECK-NEXT: ret void
>>> >> +entry:
>>> >> +  store <4 x float> zeroinitializer, <4 x float>* %dst, align 16,
>>> !nontemporal !0
>>> >> +  %ptr1 = getelementptr inbounds <4 x float>, <4 x float>* %dst, i64
>>> 1
>>> >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr1, align 16,
>>> !nontemporal !0
>>> >> +  ret void
>>> >> +}
>>> >> +
>>> >> +!0 = !{i32 1}
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> llvm-commits mailing list
>>> >>  <llvm-commits at lists.llvm.org>llvm-commits at lists.llvm.org
>>> >>  <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits>
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>> >
>>> > _______________________________________________
>>> > llvm-commits mailing list
>>> >  <llvm-commits at lists.llvm.org>llvm-commits at lists.llvm.org
>>> >  <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits>
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>
>>
>>
>>
> <patch-alternative-solution-nontemporal.diff>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20151009/8dc14420/attachment.html>


More information about the llvm-commits mailing list