<div dir="ltr"><div><div>Hi Philip,<br><br></div>I have attached the prototype memset patch that I mentioned before.<br></div><div><br>The patch is not complicated and would allows to merge nontemporal stores together.<br>There are two important implications:<br>1)
 It will _not_ try to mix temporal stores with nontemporal stores. If 
people really really want the other behavior then we can think of adding
 a flag as a future extension.<br></div><div>2) Large memsets would 
still generate library calls. On X86, we would follow the usual rules 
(i.e.: memset is unrolled up to 16 stores; otherwise we generate a 
libcall). </div><div><br>Point 2. is sub-optimal. However, I don't think
 it would occur often in practice to have to expand very big nontemporal
 memsets. Also, for very large memsets, the library implementation 
should be (hopefully!) smart enough to use a specialized loop of 
nontemporal stores.<br><br></div><div>As a side note: the patch would require a change in the nontemporal.ll test.<br></div><div>I
 think it will also require extra x86 tests to make sure that the 
nontemporal memset is expanded to a sequence of movnt instructions.<br></div><div><br></div><div>In conclusion:<br></div><div>do you (Philip and Quentin) think this approach would be a good compromise?<br></div><div>For
 what is worth, this approach would still address most of our customer 
concerns (i.e. it is probably as good as the previous patch).<br>If you 
prefer this approach more then I am okay with sending it for review 
(especially if this approach makes everybody happy :-) ).<br><br></div><div>Cheers,<br></div>Andrea<div class=""><div id=":20h" class="" tabindex="0"><img class="" src="https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif"></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Oct 9, 2015 at 8:51 PM, Philip Reames <span dir="ltr"><<a href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div text="#000000" bgcolor="#FFFFFF">
    I don't have a strong enough opinion here to push the issue, but I
    think I still disagree with the chosen approach.  However, I will
    defer to you two since you seem to have discussed the tradeoffs and
    are better positioned to make the decision.<br>
    <br>
    p.s. If the patch to support non-temporal memsets was simple, it
    might be interesting in it's own right.  Even if we don't
    canonicalize to it, having the ability to chose it in the frontend
    could be interesting.  Note the "could"; I have no concrete use for
    this in the near future.<br>
    <br>
    Philip<div><div class="h5"><br>
    <br>
    <br>
    <br>
    <div>On 10/09/2015 11:15 AM, Quentin
      Colombet wrote:<br>
    </div>
    <blockquote type="cite">
      
      Hi Andrea,
      <div><br>
        <div>
          <blockquote type="cite">
            <div>On Oct 9, 2015, at 10:58 AM, Andrea Di Biagio
              <<a href="mailto:andrea.dibiagio@gmail.com" target="_blank">andrea.dibiagio@gmail.com</a>>
              wrote:</div>
            <br>
            <div><br>
              <br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px">
              <div class="gmail_quote" style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px">On
                Fri, Oct 9, 2015 at 6:09 PM, Quentin Colombet via
                llvm-commits<span> </span><span dir="ltr"><<a href="mailto:llvm-commits@lists.llvm.org" target="_blank">llvm-commits@lists.llvm.org</a>></span><span> </span>wrote:<br>
                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">Hi Philip,<br>
                  <span></span><span><br>
                    ><br>
                    > One of my worries is about how disabling this
                    will effect other optimizations around non-temporal
                    stores.  In particular, if we had a memset following
                    the stores, we'd no longer have the chance to
                    combine them into a single memset.  Replacing two
                    non-temporal stores and a large memset with a larger
                    memset seems highly likely to be profitable.<br>
                    <br>
                  </span>I am not sure about that. I would expect any
                  memset implementation to touch the cache and thus, to
                  have an impact on the performance of the surrounding
                  code.<br>
                </blockquote>
                <div><br>
                </div>
                <div>I agree with Quentin here. I don't think
                  it would be better since you would end up polluting
                  the caches.<br>
                  <br>
                </div>
                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><span><br>
                    ><br>
                    > Analogously, replacing 8 adjacent non-temporal
                    byte stores with a single memset (and then i64
                    store) also seems profitable and possibly disabled
                    by this transform.<br>
                    ><br>
                    > Also, if we keep this change, the comment needs
                    to be extended to highlight this is a profitability
                    decision, not a legality one.<br>
                    <br>
                  </span>Agree with that part.<br>
                </blockquote>
                <div><br>
                </div>
                <div>Adjacent stores like the ones you
                  described should be easily handled by
                  'MergeConsecutiveStores' in the DAGCombiner. So, I
                  don't think this transformation is pessimizing the
                  codegen for that case. MergeConsecutiveStores should
                  be able to merge those small stores and still preserve
                  the nontemporal hint.<br>
                </div>
                <div><br>
                </div>
                <div>Overall I am not completely against the
                  idea of propagating the !nontemporal metadata to
                  memset/memcpy (I actually have a patch that implements
                  that for memset - if people really prefer that
                  approach I can upload it for review). However, at the
                  moment I don't see the advantage in doing it (I'd like
                  to see an example where that transformation would be
                  profitable).<br>
                </div>
              </div>
            </div>
          </blockquote>
          <div><br>
          </div>
          <div>The thing I was agreeing with was that the comment should
            say that we block the transformation not because we couldn’t
            fix memset, but because it is not worth it given we would
            need to split the memset later. That was how I read the
            profitability part, though I guess Philip was more into what
            you understood :).</div>
          <div><br>
          </div>
          <div>Cheers,</div>
          <div>Q.</div>
          <br>
          <blockquote type="cite">
            <div>
              <div class="gmail_quote" style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px">
                <div><br>
                  -Andrea<br>
                </div>
                <div><br>
                  Thanks,<br>
                </div>
                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">-Quentin<br>
                  <div>
                    <div><br>
                      ><br>
                      > Philip<br>
                      ><br>
                      > On 10/09/2015 03:53 AM, Andrea Di Biagio via
                      llvm-commits wrote:<br>
                      >> Author: adibiagio<br>
                      >> Date: Fri Oct  9 05:53:41 2015<br>
                      >> New Revision: 249820<br>
                      >><br>
                      >> URL:<span> </span><a href="http://llvm.org/viewvc/llvm-project?rev=249820&view=rev" rel="noreferrer" target="_blank"></a><a href="http://llvm.org/viewvc/llvm-project?rev=249820&view=rev" target="_blank">http://llvm.org/viewvc/llvm-project?rev=249820&view=rev</a><br>
                      >> Log:<br>
                      >> [MemCpyOpt] Fix wrong merging adjacent
                      nontemporal stores into memset calls.<br>
                      >><br>
                      >> Pass MemCpyOpt doesn't check if a store
                      instruction is nontemporal.<br>
                      >> As a consequence, adjacent nontemporal
                      stores are always merged into a<br>
                      >> memset call.<br>
                      >><br>
                      >> Example:<br>
                      >><br>
                      >> ;;;<br>
                      >> define void @foo(<4 x float>*
                      nocapture %p) {<br>
                      >> entry:<br>
                      >>   store <4 x float>
                      zeroinitializer, <4 x float>* %p, align 16,
                      !nontemporal !0<br>
                      >>   %p1 = getelementptr inbounds <4 x
                      float>, <4 x float>* %dst, i64 1<br>
                      >>   store <4 x float>
                      zeroinitializer, <4 x float>* %p1, align 16,
                      !nontemporal !0<br>
                      >>   ret void<br>
                      >> }<br>
                      >><br>
                      >> !0 = !{i32 1}<br>
                      >> ;;;<br>
                      >><br>
                      >> In this example, the two nontemporal
                      stores are combined to a memset of zero<br>
                      >> which does not preserve the nontemporal
                      hint. Later on the backend (tested on a<br>
                      >> x86-64 corei7) expands that memset call
                      into a sequence of two normal 16-byte<br>
                      >> aligned vector stores.<br>
                      >><br>
                      >> opt -memcpyopt example.ll -S -o - | llc
                      -mcpu=corei7 -o -<br>
                      >><br>
                      >> Before:<br>
                      >>   xorps  %xmm0, %xmm0<br>
                      >>   movaps  %xmm0, 16(%rdi)<br>
                      >>   movaps  %xmm0, (%rdi)<br>
                      >><br>
                      >> With this patch, we no longer merge
                      nontemporal stores into calls to memset.<br>
                      >> In this example, llc correctly expands
                      the two stores into two movntps:<br>
                      >>   xorps  %xmm0, %xmm0<br>
                      >>   movntps %xmm0, 16(%rdi)<br>
                      >>   movntps  %xmm0, (%rdi)<br>
                      >><br>
                      >> In theory, we could extend the usage of
                      !nontemporal metadata to memcpy/memset<br>
                      >> calls. However a change like that would
                      only have the effect of forcing the<br>
                      >> backend to expand !nontemporal memsets
                      back to sequences of store instructions.<br>
                      >> A memset library call would not have
                      exactly the same semantic of a builtin<br>
                      >> !nontemporal memset call. So,
                      SelectionDAG will have to conservatively expand<br>
                      >> it back to a sequence of !nontemporal
                      stores (effectively undoing the merging).<br>
                      >><br>
                      >> Differential Revision:<span> </span><a href="http://reviews.llvm.org/D13519" rel="noreferrer" target="_blank"></a><a href="http://reviews.llvm.org/D13519" target="_blank">http://reviews.llvm.org/D13519</a><br>
                      >><br>
                      >> Added:<br>
                      >>   
                       llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll<br>
                      >> Modified:<br>
                      >>   
                       llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp<br>
                      >><br>
                      >> Modified:
                      llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp<br>
                      >> URL:<span> </span><a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp?rev=249820&r1=249819&r2=249820&view=diff" rel="noreferrer" target="_blank"></a><a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp?rev=249820&r1=249819&r2=249820&view=diff" target="_blank">http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp?rev=249820&r1=249819&r2=249820&view=diff</a><br>
                      >>
==============================================================================<br>
                      >> ---
                      llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp
                      (original)<br>
                      >> +++
                      llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp
                      Fri Oct  9 05:53:41 2015<br>
                      >> @@ -488,6 +488,16 @@ Instruction
                      *MemCpyOpt::tryMergingIntoMe<br>
                      >>    bool MemCpyOpt::processStore(StoreInst
                      *SI, BasicBlock::iterator &BBI) {<br>
                      >>    if (!SI->isSimple()) return false;<br>
                      >> +<br>
                      >> +  // Avoid merging nontemporal stores
                      since the resulting<br>
                      >> +  // memcpy/memset would not be able to
                      preserve the nontemporal hint.<br>
                      >> +  // In theory we could teach how to
                      propagate the !nontemporal metadata to<br>
                      >> +  // memset calls. However, that change
                      would force the backend to<br>
                      >> +  // conservatively expand !nontemporal
                      memset calls back to sequences of<br>
                      >> +  // store instructions (effectively
                      undoing the merging).<br>
                      >> +  if
                      (SI->getMetadata(LLVMContext::MD_nontemporal))<br>
                      >> +    return false;<br>
                      >> +<br>
                      >>    const DataLayout &DL =
                      SI->getModule()->getDataLayout();<br>
                      >>      // Detect cases where we're
                      performing call slot forwarding, but<br>
                      >><br>
                      >> Added:
                      llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll<br>
                      >> URL:<span> </span><a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll?rev=249820&view=auto" rel="noreferrer" target="_blank"></a><a href="http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll?rev=249820&view=auto" target="_blank">http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll?rev=249820&view=auto</a><br>
                      >>
==============================================================================<br>
                      >> ---
                      llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll
                      (added)<br>
                      >> +++
                      llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll
                      Fri Oct  9 05:53:41 2015<br>
                      >> @@ -0,0 +1,49 @@<br>
                      >> +; RUN: opt < %s -memcpyopt -S |
                      FileCheck %s<br>
                      >> +<br>
                      >> +target datalayout =
                      "e-m:e-i64:64-f80:128-n8:16:32:64-S128"<br>
                      >> +<br>
                      >> +; Verify that we don't combine
                      nontemporal stores into memset calls.<br>
                      >> +<br>
                      >> +define void @nontemporal_stores_1(<4
                      x float>* nocapture %dst) {<br>
                      >> +; CHECK-LABEL: @nontemporal_stores_1<br>
                      >> +; CHECK: store <4 x float>
                      zeroinitializer, <4 x float>* %dst, align
                      16, !nontemporal !0<br>
                      >> +; CHECK: store <4 x float>
                      zeroinitializer, <4 x float>* %ptr1, align
                      16, !nontemporal !0<br>
                      >> +; CHECK: store <4 x float>
                      zeroinitializer, <4 x float>* %ptr2, align
                      16, !nontemporal !0<br>
                      >> +; CHECK: store <4 x float>
                      zeroinitializer, <4 x float>* %ptr3, align
                      16, !nontemporal !0<br>
                      >> +; CHECK: store <4 x float>
                      zeroinitializer, <4 x float>* %ptr4, align
                      16, !nontemporal !0<br>
                      >> +; CHECK: store <4 x float>
                      zeroinitializer, <4 x float>* %ptr5, align
                      16, !nontemporal !0<br>
                      >> +; CHECK: store <4 x float>
                      zeroinitializer, <4 x float>* %ptr6, align
                      16, !nontemporal !0<br>
                      >> +; CHECK: store <4 x float>
                      zeroinitializer, <4 x float>* %ptr7, align
                      16, !nontemporal !0<br>
                      >> +; CHECK-NEXT: ret void<br>
                      >> +entry:<br>
                      >> +  store <4 x float>
                      zeroinitializer, <4 x float>* %dst, align
                      16, !nontemporal !0<br>
                      >> +  %ptr1 = getelementptr inbounds <4 x
                      float>, <4 x float>* %dst, i64 1<br>
                      >> +  store <4 x float>
                      zeroinitializer, <4 x float>* %ptr1, align
                      16, !nontemporal !0<br>
                      >> +  %ptr2 = getelementptr inbounds <4 x
                      float>, <4 x float>* %dst, i64 2<br>
                      >> +  store <4 x float>
                      zeroinitializer, <4 x float>* %ptr2, align
                      16, !nontemporal !0<br>
                      >> +  %ptr3 = getelementptr inbounds <4 x
                      float>, <4 x float>* %dst, i64 3<br>
                      >> +  store <4 x float>
                      zeroinitializer, <4 x float>* %ptr3, align
                      16, !nontemporal !0<br>
                      >> +  %ptr4 = getelementptr inbounds <4 x
                      float>, <4 x float>* %dst, i64 4<br>
                      >> +  store <4 x float>
                      zeroinitializer, <4 x float>* %ptr4, align
                      16, !nontemporal !0<br>
                      >> +  %ptr5 = getelementptr inbounds <4 x
                      float>, <4 x float>* %dst, i64 5<br>
                      >> +  store <4 x float>
                      zeroinitializer, <4 x float>* %ptr5, align
                      16, !nontemporal !0<br>
                      >> +  %ptr6 = getelementptr inbounds <4 x
                      float>, <4 x float>* %dst, i64 6<br>
                      >> +  store <4 x float>
                      zeroinitializer, <4 x float>* %ptr6, align
                      16, !nontemporal !0<br>
                      >> +  %ptr7 = getelementptr inbounds <4 x
                      float>, <4 x float>* %dst, i64 7<br>
                      >> +  store <4 x float>
                      zeroinitializer, <4 x float>* %ptr7, align
                      16, !nontemporal !0<br>
                      >> +  ret void<br>
                      >> +}<br>
                      >> +<br>
                      >> +define void @nontemporal_stores_2(<4
                      x float>* nocapture %dst) {<br>
                      >> +; CHECK-LABEL: @nontemporal_stores_2<br>
                      >> +; CHECK: store <4 x float>
                      zeroinitializer, <4 x float>* %dst, align
                      16, !nontemporal !0<br>
                      >> +; CHECK: store <4 x float>
                      zeroinitializer, <4 x float>* %ptr1, align
                      16, !nontemporal !0<br>
                      >> +; CHECK-NEXT: ret void<br>
                      >> +entry:<br>
                      >> +  store <4 x float>
                      zeroinitializer, <4 x float>* %dst, align
                      16, !nontemporal !0<br>
                      >> +  %ptr1 = getelementptr inbounds <4 x
                      float>, <4 x float>* %dst, i64 1<br>
                      >> +  store <4 x float>
                      zeroinitializer, <4 x float>* %ptr1, align
                      16, !nontemporal !0<br>
                      >> +  ret void<br>
                      >> +}<br>
                      >> +<br>
                      >> +!0 = !{i32 1}<br>
                      >><br>
                      >><br>
                      >>
                      _______________________________________________<br>
                      >> llvm-commits mailing list<br>
                      >><span> </span><a href="mailto:llvm-commits@lists.llvm.org" target="_blank"></a><a href="mailto:llvm-commits@lists.llvm.org" target="_blank">llvm-commits@lists.llvm.org</a><br>
                      >><span> </span><a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits" rel="noreferrer" target="_blank"></a><a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits</a><br>
                      ><br>
                      >
                      _______________________________________________<br>
                      > llvm-commits mailing list<br>
                      ><span> </span><a href="mailto:llvm-commits@lists.llvm.org" target="_blank"></a><a href="mailto:llvm-commits@lists.llvm.org" target="_blank">llvm-commits@lists.llvm.org</a><br>
                      ><span> </span><a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits" rel="noreferrer" target="_blank"></a><a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits</a><br>
                      <br>
                      _______________________________________________<br>
                      llvm-commits mailing list<br>
                      <a href="mailto:llvm-commits@lists.llvm.org" target="_blank">llvm-commits@lists.llvm.org</a><br>
                      <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits</a></div>
                  </div>
                </blockquote>
              </div>
            </div>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <br>
  </div></div></div>

</blockquote></div><br></div>