[llvm] r249820 - [MemCpyOpt] Fix wrong merging adjacent nontemporal stores into memset calls.

Fri Oct 9 14:24:44 PDT 2015

Hi Philip,

I have attached the prototype memset patch that I mentioned before.

The patch is not complicated and would allows to merge nontemporal stores
together.
There are two important implications:
1) It will _not_ try to mix temporal stores with nontemporal stores. If
people really really want the other behavior then we can think of adding a
flag as a future extension.
2) Large memsets would still generate library calls. On X86, we would
follow the usual rules (i.e.: memset is unrolled up to 16 stores; otherwise
we generate a libcall).

Point 2. is sub-optimal. However, I don't think it would occur often in
practice to have to expand very big nontemporal memsets. Also, for very
large memsets, the library implementation should be (hopefully!) smart
enough to use a specialized loop of nontemporal stores.

As a side note: the patch would require a change in the nontemporal.ll test.
I think it will also require extra x86 tests to make sure that the
nontemporal memset is expanded to a sequence of movnt instructions.

In conclusion:
do you (Philip and Quentin) think this approach would be a good compromise?
For what is worth, this approach would still address most of our customer
concerns (i.e. it is probably as good as the previous patch).
If you prefer this approach more then I am okay with sending it for review
(especially if this approach makes everybody happy :-) ).

Cheers,
Andrea

On Fri, Oct 9, 2015 at 8:51 PM, Philip Reames <listmail at philipreames.com>
wrote:

> I don't have a strong enough opinion here to push the issue, but I think I
> still disagree with the chosen approach.  However, I will defer to you two
> since you seem to have discussed the tradeoffs and are better positioned to
> make the decision.
>
> p.s. If the patch to support non-temporal memsets was simple, it might be
> interesting in it's own right.  Even if we don't canonicalize to it, having
> the ability to chose it in the frontend could be interesting.  Note the
> "could"; I have no concrete use for this in the near future.
>
> Philip
>
>
>
>
> On 10/09/2015 11:15 AM, Quentin Colombet wrote:
>
> Hi Andrea,
>
> On Oct 9, 2015, at 10:58 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com>
> wrote:
>
>
>
> On Fri, Oct 9, 2015 at 6:09 PM, Quentin Colombet via llvm-commits <
> llvm-commits at lists.llvm.org> wrote:
>
>> Hi Philip,
>>
>> >
>> > One of my worries is about how disabling this will effect other
>> optimizations around non-temporal stores.  In particular, if we had a
>> memset following the stores, we'd no longer have the chance to combine them
>> into a single memset.  Replacing two non-temporal stores and a large memset
>> with a larger memset seems highly likely to be profitable.
>>
>> I am not sure about that. I would expect any memset implementation to
>> touch the cache and thus, to have an impact on the performance of the
>> surrounding code.
>>
>
> I agree with Quentin here. I don't think it would be better since you
> would end up polluting the caches.
>
>
>> >
>> > Analogously, replacing 8 adjacent non-temporal byte stores with a
>> single memset (and then i64 store) also seems profitable and possibly
>> disabled by this transform.
>> >
>> > Also, if we keep this change, the comment needs to be extended to
>> highlight this is a profitability decision, not a legality one.
>>
>> Agree with that part.
>>
>
> Adjacent stores like the ones you described should be easily handled by
> 'MergeConsecutiveStores' in the DAGCombiner. So, I don't think this
> transformation is pessimizing the codegen for that case.
> MergeConsecutiveStores should be able to merge those small stores and still
> preserve the nontemporal hint.
>
> Overall I am not completely against the idea of propagating the
> !nontemporal metadata to memset/memcpy (I actually have a patch that
> implements that for memset - if people really prefer that approach I can
> upload it for review). However, at the moment I don't see the advantage in
> doing it (I'd like to see an example where that transformation would be
> profitable).
>
>
> The thing I was agreeing with was that the comment should say that we
> block the transformation not because we couldn’t fix memset, but because it
> is not worth it given we would need to split the memset later. That was how
> I read the profitability part, though I guess Philip was more into what you
> understood :).
>
> Cheers,
> Q.
>
>
> -Andrea
>
> Thanks,
>
>> -Quentin
>>
>> >
>> > Philip
>> >
>> > On 10/09/2015 03:53 AM, Andrea Di Biagio via llvm-commits wrote:
>> >> Author: adibiagio
>> >> Date: Fri Oct  9 05:53:41 2015
>> >> New Revision: 249820
>> >>
>> >> URL:  <http://llvm.org/viewvc/llvm-project?rev=249820&view=rev>
>> http://llvm.org/viewvc/llvm-project?rev=249820&view=rev
>> >> Log:
>> >> [MemCpyOpt] Fix wrong merging adjacent nontemporal stores into memset
>> calls.
>> >>
>> >> Pass MemCpyOpt doesn't check if a store instruction is nontemporal.
>> >> As a consequence, adjacent nontemporal stores are always merged into a
>> >> memset call.
>> >>
>> >> Example:
>> >>
>> >> ;;;
>> >> define void @foo(<4 x float>* nocapture %p) {
>> >> entry:
>> >>   store <4 x float> zeroinitializer, <4 x float>* %p, align 16,
>> !nontemporal !0
>> >>   %p1 = getelementptr inbounds <4 x float>, <4 x float>* %dst, i64 1
>> >>   store <4 x float> zeroinitializer, <4 x float>* %p1, align 16,
>> !nontemporal !0
>> >>   ret void
>> >> }
>> >>
>> >> !0 = !{i32 1}
>> >> ;;;
>> >>
>> >> In this example, the two nontemporal stores are combined to a memset
>> of zero
>> >> which does not preserve the nontemporal hint. Later on the backend
>> (tested on a
>> >> x86-64 corei7) expands that memset call into a sequence of two normal
>> 16-byte
>> >> aligned vector stores.
>> >>
>> >> opt -memcpyopt example.ll -S -o - | llc -mcpu=corei7 -o -
>> >>
>> >> Before:
>> >>   xorps  %xmm0, %xmm0
>> >>   movaps  %xmm0, 16(%rdi)
>> >>   movaps  %xmm0, (%rdi)
>> >>
>> >> With this patch, we no longer merge nontemporal stores into calls to
>> memset.
>> >> In this example, llc correctly expands the two stores into two movntps:
>> >>   xorps  %xmm0, %xmm0
>> >>   movntps %xmm0, 16(%rdi)
>> >>   movntps  %xmm0, (%rdi)
>> >>
>> >> In theory, we could extend the usage of !nontemporal metadata to
>> memcpy/memset
>> >> calls. However a change like that would only have the effect of
>> forcing the
>> >> backend to expand !nontemporal memsets back to sequences of store
>> instructions.
>> >> A memset library call would not have exactly the same semantic of a
>> builtin
>> >> !nontemporal memset call. So, SelectionDAG will have to conservatively
>> expand
>> >> it back to a sequence of !nontemporal stores (effectively undoing the
>> merging).
>> >>
>> >> Differential Revision:  <http://reviews.llvm.org/D13519>
>> http://reviews.llvm.org/D13519
>> >>
>> >> Added:
>> >>     llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll
>> >> Modified:
>> >>     llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp
>> >>
>> >> Modified: llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp
>> >> URL:
>> <http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp?rev=249820&r1=249819&r2=249820&view=diff>
>> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp?rev=249820&r1=249819&r2=249820&view=diff
>> >>
>> ==============================================================================
>> >> --- llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp (original)
>> >> +++ llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp Fri Oct  9
>> 05:53:41 2015
>> >> @@ -488,6 +488,16 @@ Instruction *MemCpyOpt::tryMergingIntoMe
>> >>    bool MemCpyOpt::processStore(StoreInst *SI, BasicBlock::iterator
>> &BBI) {
>> >>    if (!SI->isSimple()) return false;
>> >> +
>> >> +  // Avoid merging nontemporal stores since the resulting
>> >> +  // memcpy/memset would not be able to preserve the nontemporal hint.
>> >> +  // In theory we could teach how to propagate the !nontemporal
>> metadata to
>> >> +  // memset calls. However, that change would force the backend to
>> >> +  // conservatively expand !nontemporal memset calls back to
>> sequences of
>> >> +  // store instructions (effectively undoing the merging).
>> >> +  if (SI->getMetadata(LLVMContext::MD_nontemporal))
>> >> +    return false;
>> >> +
>> >>    const DataLayout &DL = SI->getModule()->getDataLayout();
>> >>      // Detect cases where we're performing call slot forwarding, but
>> >>
>> >> Added: llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll
>> >> URL:
>> <http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll?rev=249820&view=auto>
>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll?rev=249820&view=auto
>> >>
>> ==============================================================================
>> >> --- llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll (added)
>> >> +++ llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll Fri Oct  9
>> 05:53:41 2015
>> >> @@ -0,0 +1,49 @@
>> >> +; RUN: opt < %s -memcpyopt -S | FileCheck %s
>> >> +
>> >> +target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
>> >> +
>> >> +; Verify that we don't combine nontemporal stores into memset calls.
>> >> +
>> >> +define void @nontemporal_stores_1(<4 x float>* nocapture %dst) {
>> >> +; CHECK-LABEL: @nontemporal_stores_1
>> >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>* %dst, align
>> 16, !nontemporal !0
>> >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>* %ptr1, align
>> 16, !nontemporal !0
>> >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>* %ptr2, align
>> 16, !nontemporal !0
>> >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>* %ptr3, align
>> 16, !nontemporal !0
>> >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>* %ptr4, align
>> 16, !nontemporal !0
>> >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>* %ptr5, align
>> 16, !nontemporal !0
>> >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>* %ptr6, align
>> 16, !nontemporal !0
>> >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>* %ptr7, align
>> 16, !nontemporal !0
>> >> +; CHECK-NEXT: ret void
>> >> +entry:
>> >> +  store <4 x float> zeroinitializer, <4 x float>* %dst, align 16,
>> !nontemporal !0
>> >> +  %ptr1 = getelementptr inbounds <4 x float>, <4 x float>* %dst, i64 1
>> >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr1, align 16,
>> !nontemporal !0
>> >> +  %ptr2 = getelementptr inbounds <4 x float>, <4 x float>* %dst, i64 2
>> >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr2, align 16,
>> !nontemporal !0
>> >> +  %ptr3 = getelementptr inbounds <4 x float>, <4 x float>* %dst, i64 3
>> >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr3, align 16,
>> !nontemporal !0
>> >> +  %ptr4 = getelementptr inbounds <4 x float>, <4 x float>* %dst, i64 4
>> >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr4, align 16,
>> !nontemporal !0
>> >> +  %ptr5 = getelementptr inbounds <4 x float>, <4 x float>* %dst, i64 5
>> >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr5, align 16,
>> !nontemporal !0
>> >> +  %ptr6 = getelementptr inbounds <4 x float>, <4 x float>* %dst, i64 6
>> >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr6, align 16,
>> !nontemporal !0
>> >> +  %ptr7 = getelementptr inbounds <4 x float>, <4 x float>* %dst, i64 7
>> >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr7, align 16,
>> !nontemporal !0
>> >> +  ret void
>> >> +}
>> >> +
>> >> +define void @nontemporal_stores_2(<4 x float>* nocapture %dst) {
>> >> +; CHECK-LABEL: @nontemporal_stores_2
>> >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>* %dst, align
>> 16, !nontemporal !0
>> >> +; CHECK: store <4 x float> zeroinitializer, <4 x float>* %ptr1, align
>> 16, !nontemporal !0
>> >> +; CHECK-NEXT: ret void
>> >> +entry:
>> >> +  store <4 x float> zeroinitializer, <4 x float>* %dst, align 16,
>> !nontemporal !0
>> >> +  %ptr1 = getelementptr inbounds <4 x float>, <4 x float>* %dst, i64 1
>> >> +  store <4 x float> zeroinitializer, <4 x float>* %ptr1, align 16,
>> !nontemporal !0
>> >> +  ret void
>> >> +}
>> >> +
>> >> +!0 = !{i32 1}
>> >>
>> >>
>> >> _______________________________________________
>> >> llvm-commits mailing list
>> >>  <llvm-commits at lists.llvm.org>llvm-commits at lists.llvm.org
>> >>  <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>> >
>> > _______________________________________________
>> > llvm-commits mailing list
>> >  <llvm-commits at lists.llvm.org>llvm-commits at lists.llvm.org
>> >  <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20151009/4ae8e668/attachment.html>
-------------- next part --------------
Index: include/llvm/CodeGen/SelectionDAG.h
===================================================================

--- include/llvm/CodeGen/SelectionDAG.h	(revision 249814)
+++ include/llvm/CodeGen/SelectionDAG.h	(working copy)
@@ -722,7 +722,7 @@
 
   SDValue getMemset(SDValue Chain, SDLoc dl, SDValue Dst, SDValue Src,
                     SDValue Size, unsigned Align, bool isVol, bool isTailCall,
-                    MachinePointerInfo DstPtrInfo);
+                    bool isNontemporal, MachinePointerInfo DstPtrInfo);
 
   /// Helper function to make it easier to build SetCC's if you just
   /// have an ISD::CondCode instead of an SDValue.
Index: lib/CodeGen/SelectionDAG/SelectionDAG.cpp
===================================================================
--- lib/CodeGen/SelectionDAG/SelectionDAG.cpp	(revision 249814)
+++ lib/CodeGen/SelectionDAG/SelectionDAG.cpp	(working copy)
@@ -4445,6 +4445,7 @@
                                SDValue Chain, SDValue Dst,
                                SDValue Src, uint64_t Size,
                                unsigned Align, bool isVol,
+                               bool isNonTemporal,
                                MachinePointerInfo DstPtrInfo) {
   // Turn a memset of undef to nop.
   if (Src.getOpcode() == ISD::UNDEF)
@@ -4514,7 +4515,7 @@
     SDValue Store = DAG.getStore(Chain, dl, Value,
                                  getMemBasePlusOffset(Dst, DstOff, dl, DAG),
                                  DstPtrInfo.getWithOffset(DstOff),
-                                 isVol, false, Align);
+                                 isVol, isNonTemporal, Align);
     OutChains.push_back(Store);
     DstOff += VT.getSizeInBits() / 8;
     Size -= VTSize;
@@ -4653,7 +4654,8 @@
 
 SDValue SelectionDAG::getMemset(SDValue Chain, SDLoc dl, SDValue Dst,
                                 SDValue Src, SDValue Size,
-                                unsigned Align, bool isVol, bool isTailCall,
+                                unsigned Align, bool isVol,
+                                bool isTailCall, bool isNonTemporal,
                                 MachinePointerInfo DstPtrInfo) {
   assert(Align && "The SDAG layer expects explicit alignment and reserves 0");
 
@@ -4667,7 +4669,7 @@
 
     SDValue Result =
       getMemsetStores(*this, dl, Chain, Dst, Src, ConstantSize->getZExtValue(),
-                      Align, isVol, DstPtrInfo);
+                      Align, isVol, isNonTemporal, DstPtrInfo);
 
     if (Result.getNode())
       return Result;
Index: lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
===================================================================
--- lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp	(revision 249814)
+++ lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp	(working copy)
@@ -4378,8 +4378,10 @@
       Align = 1; // @llvm.memset defines 0 and 1 to both mean no alignment.
     bool isVol = cast<ConstantInt>(I.getArgOperand(4))->getZExtValue();
     bool isTC = I.isTailCall() && isInTailCallPosition(&I, DAG.getTarget());
+    bool isNonTemporal = I.getMetadata(LLVMContext::MD_nontemporal) != nullptr;
     SDValue MS = DAG.getMemset(getRoot(), sdl, Op1, Op2, Op3, Align, isVol,
-                               isTC, MachinePointerInfo(I.getArgOperand(0)));
+                               isTC, isNonTemporal,
+                               MachinePointerInfo(I.getArgOperand(0)));
     updateDAGForMaybeTailCall(MS);
     return nullptr;
   }
Index: lib/Target/X86/X86SelectionDAGInfo.cpp
===================================================================
--- lib/Target/X86/X86SelectionDAGInfo.cpp	(revision 249814)
+++ lib/Target/X86/X86SelectionDAGInfo.cpp	(working copy)
@@ -190,7 +190,7 @@
                                       DAG.getConstant(Offset, dl, AddrVT)),
                           Src,
                           DAG.getConstant(BytesLeft, dl, SizeVT),
-                          Align, isVolatile, false,
+                          Align, isVolatile, false, /* isNontemporal */ false,
                           DstPtrInfo.getWithOffset(Offset));
   }
 
Index: lib/Transforms/Scalar/MemCpyOptimizer.cpp
===================================================================
--- lib/Transforms/Scalar/MemCpyOptimizer.cpp	(revision 249814)
+++ lib/Transforms/Scalar/MemCpyOptimizer.cpp	(working copy)
@@ -379,6 +379,9 @@
   // are stored.
   MemsetRanges Ranges(DL);
 
+  // Is this a nontemporal store or a nontemporal memset?
+  MDNode *MDN = StartInst->getMetadata(LLVMContext::MD_nontemporal);
+
   BasicBlock::iterator BI = StartInst;
   for (++BI; !isa<TerminatorInst>(BI); ++BI) {
     if (!isa<StoreInst>(BI) && !isa<MemSetInst>(BI)) {
@@ -394,6 +397,12 @@
       // If this is a store, see if we can merge it in.
       if (!NextStore->isSimple()) break;
 
+      bool IsNextStoreNontemporal =
+          NextStore->getMetadata(LLVMContext::MD_nontemporal) != nullptr;
+      if ((!MDN && IsNextStoreNontemporal) ||
+          (MDN && !IsNextStoreNontemporal))
+        break;
+
       // Check to see if this stored value is of the same byte-splattable value.
       if (ByteVal != isBytewiseValue(NextStore->getOperand(0)))
         break;
@@ -412,6 +421,12 @@
           !isa<ConstantInt>(MSI->getLength()))
         break;
 
+      bool IsMemsetNontemporal = 
+          MSI->getMetadata(LLVMContext::MD_nontemporal) != nullptr;
+      if ((!MDN && IsMemsetNontemporal) ||
+          (MDN && !IsMemsetNontemporal))
+        break;
+
       // Check to see if this store is to a constant offset from the start ptr.
       int64_t Offset;
       if (!IsPointerOffset(StartPtr, MSI->getDest(), Offset, DL))
@@ -463,6 +478,8 @@
 
     AMemSet =
       Builder.CreateMemSet(StartPtr, ByteVal, Range.End-Range.Start, Alignment);
+    if (MDN)
+      AMemSet->setMetadata(LLVMContext::MD_nontemporal, MDN);
 
     DEBUG(dbgs() << "Replace stores:\n";
           for (unsigned i = 0, e = Range.TheStores.size(); i != e; ++i)
@@ -488,6 +505,7 @@
 
 bool MemCpyOpt::processStore(StoreInst *SI, BasicBlock::iterator &BBI) {
   if (!SI->isSimple()) return false;
+
   const DataLayout &DL = SI->getModule()->getDataLayout();
 
   // Detect cases where we're performing call slot forwarding, but