[llvm] r249820 - [MemCpyOpt] Fix wrong merging adjacent nontemporal stores into memset calls.

Fri Oct 9 14:43:22 PDT 2015

I definitely like this approach better, but again, I'll leave the actual 
choice to you two.  I think you're better positioned to understand all 
the implications and use cases.

If you do chose to go with this approach, separating it into two patches 
would be helpful.  Patch one could add the backend support for a memset 
with the !nontemporal metadata.  Patch two would create them.  I can't 
really help with patch 1, but I can help review patch 2 if desired.  
(Figure I should volunteer some help if I'm asking you to do extra work.)

Philip

On 10/09/2015 02:24 PM, Andrea Di Biagio wrote:
> Hi Philip,
>
> I have attached the prototype memset patch that I mentioned before.
>
> The patch is not complicated and would allows to merge nontemporal 
> stores together.
> There are two important implications:
> 1) It will _not_ try to mix temporal stores with nontemporal stores. 
> If people really really want the other behavior then we can think of 
> adding a flag as a future extension.
> 2) Large memsets would still generate library calls. On X86, we would 
> follow the usual rules (i.e.: memset is unrolled up to 16 stores; 
> otherwise we generate a libcall).
>
> Point 2. is sub-optimal. However, I don't think it would occur often 
> in practice to have to expand very big nontemporal memsets. Also, for 
> very large memsets, the library implementation should be (hopefully!) 
> smart enough to use a specialized loop of nontemporal stores.
>
> As a side note: the patch would require a change in the nontemporal.ll 
> test.
> I think it will also require extra x86 tests to make sure that the 
> nontemporal memset is expanded to a sequence of movnt instructions.
>
> In conclusion:
> do you (Philip and Quentin) think this approach would be a good 
> compromise?
> For what is worth, this approach would still address most of our 
> customer concerns (i.e. it is probably as good as the previous patch).
> If you prefer this approach more then I am okay with sending it for 
> review (especially if this approach makes everybody happy :-) ).
>
> Cheers,
> Andrea
>
> On Fri, Oct 9, 2015 at 8:51 PM, Philip Reames 
> <listmail at philipreames.com <mailto:listmail at philipreames.com>> wrote:
>
>     I don't have a strong enough opinion here to push the issue, but I
>     think I still disagree with the chosen approach.  However, I will
>     defer to you two since you seem to have discussed the tradeoffs
>     and are better positioned to make the decision.
>
>     p.s. If the patch to support non-temporal memsets was simple, it
>     might be interesting in it's own right.  Even if we don't
>     canonicalize to it, having the ability to chose it in the frontend
>     could be interesting.  Note the "could"; I have no concrete use
>     for this in the near future.
>
>     Philip
>
>
>
>
>     On 10/09/2015 11:15 AM, Quentin Colombet wrote:
>>     Hi Andrea,
>>
>>>     On Oct 9, 2015, at 10:58 AM, Andrea Di Biagio
>>>     <andrea.dibiagio at gmail.com <mailto:andrea.dibiagio at gmail.com>>
>>>     wrote:
>>>
>>>
>>>
>>>     On Fri, Oct 9, 2015 at 6:09 PM, Quentin Colombet via
>>>     llvm-commits<llvm-commits at lists.llvm.org
>>>     <mailto:llvm-commits at lists.llvm.org>>wrote:
>>>
>>>         Hi Philip,
>>>
>>>         >
>>>         > One of my worries is about how disabling this will effect
>>>         other optimizations around non-temporal stores.  In
>>>         particular, if we had a memset following the stores, we'd no
>>>         longer have the chance to combine them into a single
>>>         memset.  Replacing two non-temporal stores and a large
>>>         memset with a larger memset seems highly likely to be
>>>         profitable.
>>>
>>>         I am not sure about that. I would expect any memset
>>>         implementation to touch the cache and thus, to have an
>>>         impact on the performance of the surrounding code.
>>>
>>>
>>>     I agree with Quentin here. I don't think it would be better
>>>     since you would end up polluting the caches.
>>>
>>>
>>>         >
>>>         > Analogously, replacing 8 adjacent non-temporal byte stores
>>>         with a single memset (and then i64 store) also seems
>>>         profitable and possibly disabled by this transform.
>>>         >
>>>         > Also, if we keep this change, the comment needs to be
>>>         extended to highlight this is a profitability decision, not
>>>         a legality one.
>>>
>>>         Agree with that part.
>>>
>>>
>>>     Adjacent stores like the ones you described should be easily
>>>     handled by 'MergeConsecutiveStores' in the DAGCombiner. So, I
>>>     don't think this transformation is pessimizing the codegen for
>>>     that case. MergeConsecutiveStores should be able to merge those
>>>     small stores and still preserve the nontemporal hint.
>>>
>>>     Overall I am not completely against the idea of propagating the
>>>     !nontemporal metadata to memset/memcpy (I actually have a patch
>>>     that implements that for memset - if people really prefer that
>>>     approach I can upload it for review). However, at the moment I
>>>     don't see the advantage in doing it (I'd like to see an example
>>>     where that transformation would be profitable).
>>
>>     The thing I was agreeing with was that the comment should say
>>     that we block the transformation not because we couldn’t fix
>>     memset, but because it is not worth it given we would need to
>>     split the memset later. That was how I read the profitability
>>     part, though I guess Philip was more into what you understood :).
>>
>>     Cheers,
>>     Q.
>>
>>>
>>>     -Andrea
>>>
>>>     Thanks,
>>>
>>>         -Quentin
>>>
>>>         >
>>>         > Philip
>>>         >
>>>         > On 10/09/2015 03:53 AM, Andrea Di Biagio via llvm-commits
>>>         wrote:
>>>         >> Author: adibiagio
>>>         >> Date: Fri Oct  9 05:53:41 2015
>>>         >> New Revision: 249820
>>>         >>
>>>         >> URL:http://llvm.org/viewvc/llvm-project?rev=249820&view=rev
>>>         >> Log:
>>>         >> [MemCpyOpt] Fix wrong merging adjacent nontemporal stores
>>>         into memset calls.
>>>         >>
>>>         >> Pass MemCpyOpt doesn't check if a store instruction is
>>>         nontemporal.
>>>         >> As a consequence, adjacent nontemporal stores are always
>>>         merged into a
>>>         >> memset call.
>>>         >>
>>>         >> Example:
>>>         >>
>>>         >> ;;;
>>>         >> define void @foo(<4 x float>* nocapture %p) {
>>>         >> entry:
>>>         >>   store <4 x float> zeroinitializer, <4 x float>* %p,
>>>         align 16, !nontemporal !0
>>>         >>   %p1 = getelementptr inbounds <4 x float>, <4 x float>*
>>>         %dst, i64 1
>>>         >>   store <4 x float> zeroinitializer, <4 x float>* %p1,
>>>         align 16, !nontemporal !0
>>>         >>   ret void
>>>         >> }
>>>         >>
>>>         >> !0 = !{i32 1}
>>>         >> ;;;
>>>         >>
>>>         >> In this example, the two nontemporal stores are combined
>>>         to a memset of zero
>>>         >> which does not preserve the nontemporal hint. Later on
>>>         the backend (tested on a
>>>         >> x86-64 corei7) expands that memset call into a sequence
>>>         of two normal 16-byte
>>>         >> aligned vector stores.
>>>         >>
>>>         >> opt -memcpyopt example.ll -S -o - | llc -mcpu=corei7 -o -
>>>         >>
>>>         >> Before:
>>>         >>   xorps  %xmm0, %xmm0
>>>         >>   movaps  %xmm0, 16(%rdi)
>>>         >>   movaps  %xmm0, (%rdi)
>>>         >>
>>>         >> With this patch, we no longer merge nontemporal stores
>>>         into calls to memset.
>>>         >> In this example, llc correctly expands the two stores
>>>         into two movntps:
>>>         >>   xorps  %xmm0, %xmm0
>>>         >>   movntps %xmm0, 16(%rdi)
>>>         >>   movntps  %xmm0, (%rdi)
>>>         >>
>>>         >> In theory, we could extend the usage of !nontemporal
>>>         metadata to memcpy/memset
>>>         >> calls. However a change like that would only have the
>>>         effect of forcing the
>>>         >> backend to expand !nontemporal memsets back to sequences
>>>         of store instructions.
>>>         >> A memset library call would not have exactly the same
>>>         semantic of a builtin
>>>         >> !nontemporal memset call. So, SelectionDAG will have to
>>>         conservatively expand
>>>         >> it back to a sequence of !nontemporal stores (effectively
>>>         undoing the merging).
>>>         >>
>>>         >> Differential Revision:http://reviews.llvm.org/D13519
>>>         >>
>>>         >> Added:
>>>         >>  llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll
>>>         >> Modified:
>>>         >>  llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp
>>>         >>
>>>         >> Modified:
>>>         llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp
>>>         >>
>>>         URL:http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp?rev=249820&r1=249819&r2=249820&view=diff
>>>         >>
>>>         ==============================================================================
>>>         >> --- llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp
>>>         (original)
>>>         >> +++ llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp
>>>         Fri Oct  9 05:53:41 2015
>>>         >> @@ -488,6 +488,16 @@ Instruction *MemCpyOpt::tryMergingIntoMe
>>>         >>    bool MemCpyOpt::processStore(StoreInst *SI,
>>>         BasicBlock::iterator &BBI) {
>>>         >>    if (!SI->isSimple()) return false;
>>>         >> +
>>>         >> +  // Avoid merging nontemporal stores since the resulting
>>>         >> +  // memcpy/memset would not be able to preserve the
>>>         nontemporal hint.
>>>         >> +  // In theory we could teach how to propagate the
>>>         !nontemporal metadata to
>>>         >> +  // memset calls. However, that change would force the
>>>         backend to
>>>         >> +  // conservatively expand !nontemporal memset calls
>>>         back to sequences of
>>>         >> +  // store instructions (effectively undoing the merging).
>>>         >> +  if (SI->getMetadata(LLVMContext::MD_nontemporal))
>>>         >> +    return false;
>>>         >> +
>>>         >>    const DataLayout &DL = SI->getModule()->getDataLayout();
>>>         >>      // Detect cases where we're performing call slot
>>>         forwarding, but
>>>         >>
>>>         >> Added: llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll
>>>         >>
>>>         URL:http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll?rev=249820&view=auto
>>>         >>
>>>         ==============================================================================
>>>         >> --- llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll
>>>         (added)
>>>         >> +++ llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll
>>>         Fri Oct  9 05:53:41 2015
>>>         >> @@ -0,0 +1,49 @@
>>>         >> +; RUN: opt < %s -memcpyopt -S | FileCheck %s
>>>         >> +
>>>         >> +target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
>>>         >> +
>>>         >> +; Verify that we don't combine nontemporal stores into
>>>         memset calls.
>>>         >> +
>>>         >> +define void @nontemporal_stores_1(<4 x float>* nocapture
>>>         %dst) {
>>>         >> +; CHECK-LABEL: @nontemporal_stores_1
>>>         >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>*
>>>         %dst, align 16, !nontemporal !0
>>>         >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>*
>>>         %ptr1, align 16, !nontemporal !0
>>>         >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>*
>>>         %ptr2, align 16, !nontemporal !0
>>>         >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>*
>>>         %ptr3, align 16, !nontemporal !0
>>>         >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>*
>>>         %ptr4, align 16, !nontemporal !0
>>>         >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>*
>>>         %ptr5, align 16, !nontemporal !0
>>>         >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>*
>>>         %ptr6, align 16, !nontemporal !0
>>>         >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>*
>>>         %ptr7, align 16, !nontemporal !0
>>>         >> +; CHECK-NEXT: ret void
>>>         >> +entry:
>>>         >> +  store <4 x float> zeroinitializer, <4 x float>* %dst,
>>>         align 16, !nontemporal !0
>>>         >> +  %ptr1 = getelementptr inbounds <4 x float>, <4 x
>>>         float>* %dst, i64 1
>>>         >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr1,
>>>         align 16, !nontemporal !0
>>>         >> +  %ptr2 = getelementptr inbounds <4 x float>, <4 x
>>>         float>* %dst, i64 2
>>>         >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr2,
>>>         align 16, !nontemporal !0
>>>         >> +  %ptr3 = getelementptr inbounds <4 x float>, <4 x
>>>         float>* %dst, i64 3
>>>         >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr3,
>>>         align 16, !nontemporal !0
>>>         >> +  %ptr4 = getelementptr inbounds <4 x float>, <4 x
>>>         float>* %dst, i64 4
>>>         >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr4,
>>>         align 16, !nontemporal !0
>>>         >> +  %ptr5 = getelementptr inbounds <4 x float>, <4 x
>>>         float>* %dst, i64 5
>>>         >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr5,
>>>         align 16, !nontemporal !0
>>>         >> +  %ptr6 = getelementptr inbounds <4 x float>, <4 x
>>>         float>* %dst, i64 6
>>>         >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr6,
>>>         align 16, !nontemporal !0
>>>         >> +  %ptr7 = getelementptr inbounds <4 x float>, <4 x
>>>         float>* %dst, i64 7
>>>         >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr7,
>>>         align 16, !nontemporal !0
>>>         >> +  ret void
>>>         >> +}
>>>         >> +
>>>         >> +define void @nontemporal_stores_2(<4 x float>* nocapture
>>>         %dst) {
>>>         >> +; CHECK-LABEL: @nontemporal_stores_2
>>>         >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>*
>>>         %dst, align 16, !nontemporal !0
>>>         >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>*
>>>         %ptr1, align 16, !nontemporal !0
>>>         >> +; CHECK-NEXT: ret void
>>>         >> +entry:
>>>         >> +  store <4 x float> zeroinitializer, <4 x float>* %dst,
>>>         align 16, !nontemporal !0
>>>         >> +  %ptr1 = getelementptr inbounds <4 x float>, <4 x
>>>         float>* %dst, i64 1
>>>         >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr1,
>>>         align 16, !nontemporal !0
>>>         >> +  ret void
>>>         >> +}
>>>         >> +
>>>         >> +!0 = !{i32 1}
>>>         >>
>>>         >>
>>>         >> _______________________________________________
>>>         >> llvm-commits mailing list
>>>         >>llvm-commits at lists.llvm.org
>>>         <mailto:llvm-commits at lists.llvm.org>
>>>         >>http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>         >
>>>         > _______________________________________________
>>>         > llvm-commits mailing list
>>>         >llvm-commits at lists.llvm.org
>>>         <mailto:llvm-commits at lists.llvm.org>
>>>         >http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>
>>>         _______________________________________________
>>>         llvm-commits mailing list
>>>         llvm-commits at lists.llvm.org <mailto:llvm-commits at lists.llvm.org>
>>>         http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20151009/6c4bdb29/attachment.html>