<html>
  <head>
    <meta content="text/html; charset=windows-1252"
      http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    I definitely like this approach better, but again, I'll leave the
    actual choice to you two.  I think you're better positioned to
    understand all the implications and use cases.  <br>
    <br>
    If you do chose to go with this approach, separating it into two
    patches would be helpful.  Patch one could add the backend support
    for a memset with the !nontemporal metadata.  Patch two would create
    them.  I can't really help with patch 1, but I can help review patch
    2 if desired.  (Figure I should volunteer some help if I'm asking
    you to do extra work.) <br>
    <br>
    Philip<br>
    <br>
    <div class="moz-cite-prefix">On 10/09/2015 02:24 PM, Andrea Di
      Biagio wrote:<br>
    </div>
    <blockquote
cite="mid:CAMw=4mKDjT_E46M033XDk6FFaqRQqpqMX=B_1DxR0Ce7_c_feQ@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div>
          <div>Hi Philip,<br>
            <br>
          </div>
          I have attached the prototype memset patch that I mentioned
          before.<br>
        </div>
        <div><br>
          The patch is not complicated and would allows to merge
          nontemporal stores together.<br>
          There are two important implications:<br>
          1) It will _not_ try to mix temporal stores with nontemporal
          stores. If people really really want the other behavior then
          we can think of adding a flag as a future extension.<br>
        </div>
        <div>2) Large memsets would still generate library calls. On
          X86, we would follow the usual rules (i.e.: memset is unrolled
          up to 16 stores; otherwise we generate a libcall). </div>
        <div><br>
          Point 2. is sub-optimal. However, I don't think it would occur
          often in practice to have to expand very big nontemporal
          memsets. Also, for very large memsets, the library
          implementation should be (hopefully!) smart enough to use a
          specialized loop of nontemporal stores.<br>
          <br>
        </div>
        <div>As a side note: the patch would require a change in the
          nontemporal.ll test.<br>
        </div>
        <div>I think it will also require extra x86 tests to make sure
          that the nontemporal memset is expanded to a sequence of movnt
          instructions.<br>
        </div>
        <div><br>
        </div>
        <div>In conclusion:<br>
        </div>
        <div>do you (Philip and Quentin) think this approach would be a
          good compromise?<br>
        </div>
        <div>For what is worth, this approach would still address most
          of our customer concerns (i.e. it is probably as good as the
          previous patch).<br>
          If you prefer this approach more then I am okay with sending
          it for review (especially if this approach makes everybody
          happy :-) ).<br>
          <br>
        </div>
        <div>Cheers,<br>
        </div>
        Andrea
        <div class="">
          <div id=":20h" class="" tabindex="0"><img
              moz-do-not-send="true" class=""
              src="https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif"></div>
        </div>
      </div>
      <div class="gmail_extra"><br>
        <div class="gmail_quote">On Fri, Oct 9, 2015 at 8:51 PM, Philip
          Reames <span dir="ltr"><<a moz-do-not-send="true"
              href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a>></span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div text="#000000" bgcolor="#FFFFFF"> I don't have a strong
              enough opinion here to push the issue, but I think I still
              disagree with the chosen approach.  However, I will defer
              to you two since you seem to have discussed the tradeoffs
              and are better positioned to make the decision.<br>
              <br>
              p.s. If the patch to support non-temporal memsets was
              simple, it might be interesting in it's own right.  Even
              if we don't canonicalize to it, having the ability to
              chose it in the frontend could be interesting.  Note the
              "could"; I have no concrete use for this in the near
              future.<br>
              <br>
              Philip
              <div>
                <div class="h5"><br>
                  <br>
                  <br>
                  <br>
                  <div>On 10/09/2015 11:15 AM, Quentin Colombet wrote:<br>
                  </div>
                  <blockquote type="cite"> Hi Andrea,
                    <div><br>
                      <div>
                        <blockquote type="cite">
                          <div>On Oct 9, 2015, at 10:58 AM, Andrea Di
                            Biagio <<a moz-do-not-send="true"
                              href="mailto:andrea.dibiagio@gmail.com"
                              target="_blank">andrea.dibiagio@gmail.com</a>>

                            wrote:</div>
                          <br>
                          <div><br>
                            <br
style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px">
                            <div class="gmail_quote"
style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px">On

                              Fri, Oct 9, 2015 at 6:09 PM, Quentin
                              Colombet via llvm-commits<span> </span><span
                                dir="ltr"><<a moz-do-not-send="true"
href="mailto:llvm-commits@lists.llvm.org" target="_blank">llvm-commits@lists.llvm.org</a>></span><span> </span>wrote:<br>
                              <blockquote class="gmail_quote"
                                style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">Hi
                                Philip,<br>
                                <span></span><span><br>
                                  ><br>
                                  > One of my worries is about how
                                  disabling this will effect other
                                  optimizations around non-temporal
                                  stores.  In particular, if we had a
                                  memset following the stores, we'd no
                                  longer have the chance to combine them
                                  into a single memset.  Replacing two
                                  non-temporal stores and a large memset
                                  with a larger memset seems highly
                                  likely to be profitable.<br>
                                  <br>
                                </span>I am not sure about that. I would
                                expect any memset implementation to
                                touch the cache and thus, to have an
                                impact on the performance of the
                                surrounding code.<br>
                              </blockquote>
                              <div><br>
                              </div>
                              <div>I agree with Quentin here. I don't
                                think it would be better since you would
                                end up polluting the caches.<br>
                                <br>
                              </div>
                              <blockquote class="gmail_quote"
                                style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><span><br>
                                  ><br>
                                  > Analogously, replacing 8 adjacent
                                  non-temporal byte stores with a single
                                  memset (and then i64 store) also seems
                                  profitable and possibly disabled by
                                  this transform.<br>
                                  ><br>
                                  > Also, if we keep this change, the
                                  comment needs to be extended to
                                  highlight this is a profitability
                                  decision, not a legality one.<br>
                                  <br>
                                </span>Agree with that part.<br>
                              </blockquote>
                              <div><br>
                              </div>
                              <div>Adjacent stores like the ones you
                                described should be easily handled by
                                'MergeConsecutiveStores' in the
                                DAGCombiner. So, I don't think this
                                transformation is pessimizing the
                                codegen for that case.
                                MergeConsecutiveStores should be able to
                                merge those small stores and still
                                preserve the nontemporal hint.<br>
                              </div>
                              <div><br>
                              </div>
                              <div>Overall I am not completely against
                                the idea of propagating the !nontemporal
                                metadata to memset/memcpy (I actually
                                have a patch that implements that for
                                memset - if people really prefer that
                                approach I can upload it for review).
                                However, at the moment I don't see the
                                advantage in doing it (I'd like to see
                                an example where that transformation
                                would be profitable).<br>
                              </div>
                            </div>
                          </div>
                        </blockquote>
                        <div><br>
                        </div>
                        <div>The thing I was agreeing with was that the
                          comment should say that we block the
                          transformation not because we couldn’t fix
                          memset, but because it is not worth it given
                          we would need to split the memset later. That
                          was how I read the profitability part, though
                          I guess Philip was more into what you
                          understood :).</div>
                        <div><br>
                        </div>
                        <div>Cheers,</div>
                        <div>Q.</div>
                        <br>
                        <blockquote type="cite">
                          <div>
                            <div class="gmail_quote"
style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px">
                              <div><br>
                                -Andrea<br>
                              </div>
                              <div><br>
                                Thanks,<br>
                              </div>
                              <blockquote class="gmail_quote"
                                style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">-Quentin<br>
                                <div>
                                  <div><br>
                                    ><br>
                                    > Philip<br>
                                    ><br>
                                    > On 10/09/2015 03:53 AM, Andrea
                                    Di Biagio via llvm-commits wrote:<br>
                                    >> Author: adibiagio<br>
                                    >> Date: Fri Oct  9 05:53:41
                                    2015<br>
                                    >> New Revision: 249820<br>
                                    >><br>
                                    >> URL:<span> </span><a
                                      moz-do-not-send="true"
                                      href="http://llvm.org/viewvc/llvm-project?rev=249820&view=rev"
                                      target="_blank"><a class="moz-txt-link-freetext" href="http://llvm.org/viewvc/llvm-project?rev=249820&view=rev">http://llvm.org/viewvc/llvm-project?rev=249820&view=rev</a></a><br>
                                    >> Log:<br>
                                    >> [MemCpyOpt] Fix wrong
                                    merging adjacent nontemporal stores
                                    into memset calls.<br>
                                    >><br>
                                    >> Pass MemCpyOpt doesn't
                                    check if a store instruction is
                                    nontemporal.<br>
                                    >> As a consequence, adjacent
                                    nontemporal stores are always merged
                                    into a<br>
                                    >> memset call.<br>
                                    >><br>
                                    >> Example:<br>
                                    >><br>
                                    >> ;;;<br>
                                    >> define void @foo(<4 x
                                    float>* nocapture %p) {<br>
                                    >> entry:<br>
                                    >>   store <4 x float>
                                    zeroinitializer, <4 x float>*
                                    %p, align 16, !nontemporal !0<br>
                                    >>   %p1 = getelementptr
                                    inbounds <4 x float>, <4 x
                                    float>* %dst, i64 1<br>
                                    >>   store <4 x float>
                                    zeroinitializer, <4 x float>*
                                    %p1, align 16, !nontemporal !0<br>
                                    >>   ret void<br>
                                    >> }<br>
                                    >><br>
                                    >> !0 = !{i32 1}<br>
                                    >> ;;;<br>
                                    >><br>
                                    >> In this example, the two
                                    nontemporal stores are combined to a
                                    memset of zero<br>
                                    >> which does not preserve the
                                    nontemporal hint. Later on the
                                    backend (tested on a<br>
                                    >> x86-64 corei7) expands that
                                    memset call into a sequence of two
                                    normal 16-byte<br>
                                    >> aligned vector stores.<br>
                                    >><br>
                                    >> opt -memcpyopt example.ll
                                    -S -o - | llc -mcpu=corei7 -o -<br>
                                    >><br>
                                    >> Before:<br>
                                    >>   xorps  %xmm0, %xmm0<br>
                                    >>   movaps  %xmm0, 16(%rdi)<br>
                                    >>   movaps  %xmm0, (%rdi)<br>
                                    >><br>
                                    >> With this patch, we no
                                    longer merge nontemporal stores into
                                    calls to memset.<br>
                                    >> In this example, llc
                                    correctly expands the two stores
                                    into two movntps:<br>
                                    >>   xorps  %xmm0, %xmm0<br>
                                    >>   movntps %xmm0, 16(%rdi)<br>
                                    >>   movntps  %xmm0, (%rdi)<br>
                                    >><br>
                                    >> In theory, we could extend
                                    the usage of !nontemporal metadata
                                    to memcpy/memset<br>
                                    >> calls. However a change
                                    like that would only have the effect
                                    of forcing the<br>
                                    >> backend to expand
                                    !nontemporal memsets back to
                                    sequences of store instructions.<br>
                                    >> A memset library call would
                                    not have exactly the same semantic
                                    of a builtin<br>
                                    >> !nontemporal memset call.
                                    So, SelectionDAG will have to
                                    conservatively expand<br>
                                    >> it back to a sequence of
                                    !nontemporal stores (effectively
                                    undoing the merging).<br>
                                    >><br>
                                    >> Differential Revision:<span> </span><a
                                      moz-do-not-send="true"
                                      href="http://reviews.llvm.org/D13519"
                                      target="_blank"><a class="moz-txt-link-freetext" href="http://reviews.llvm.org/D13519">http://reviews.llvm.org/D13519</a></a><br>
                                    >><br>
                                    >> Added:<br>
                                    >>   
                                     llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll<br>
                                    >> Modified:<br>
                                    >>   
                                     llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp<br>
                                    >><br>
                                    >> Modified:
                                    llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp<br>
                                    >> URL:<span> </span><a
                                      moz-do-not-send="true"
href="http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp?rev=249820&r1=249819&r2=249820&view=diff"
                                      target="_blank"><a class="moz-txt-link-freetext" href="http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp?rev=249820&r1=249819&r2=249820&view=diff">http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp?rev=249820&r1=249819&r2=249820&view=diff</a></a><br>
                                    >>
==============================================================================<br>
                                    >> ---
                                    llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp
                                    (original)<br>
                                    >> +++
                                    llvm/trunk/lib/Transforms/Scalar/MemCpyOptimizer.cpp
                                    Fri Oct  9 05:53:41 2015<br>
                                    >> @@ -488,6 +488,16 @@
                                    Instruction
                                    *MemCpyOpt::tryMergingIntoMe<br>
                                    >>    bool
                                    MemCpyOpt::processStore(StoreInst
                                    *SI, BasicBlock::iterator &BBI)
                                    {<br>
                                    >>    if (!SI->isSimple())
                                    return false;<br>
                                    >> +<br>
                                    >> +  // Avoid merging
                                    nontemporal stores since the
                                    resulting<br>
                                    >> +  // memcpy/memset would
                                    not be able to preserve the
                                    nontemporal hint.<br>
                                    >> +  // In theory we could
                                    teach how to propagate the
                                    !nontemporal metadata to<br>
                                    >> +  // memset calls.
                                    However, that change would force the
                                    backend to<br>
                                    >> +  // conservatively expand
                                    !nontemporal memset calls back to
                                    sequences of<br>
                                    >> +  // store instructions
                                    (effectively undoing the merging).<br>
                                    >> +  if
                                    (SI->getMetadata(LLVMContext::MD_nontemporal))<br>
                                    >> +    return false;<br>
                                    >> +<br>
                                    >>    const DataLayout &DL
                                    =
                                    SI->getModule()->getDataLayout();<br>
                                    >>      // Detect cases where
                                    we're performing call slot
                                    forwarding, but<br>
                                    >><br>
                                    >> Added:
                                    llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll<br>
                                    >> URL:<span> </span><a
                                      moz-do-not-send="true"
href="http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll?rev=249820&view=auto"
                                      target="_blank"><a class="moz-txt-link-freetext" href="http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll?rev=249820&view=auto">http://llvm.org/viewvc/llvm-project/llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll?rev=249820&view=auto</a></a><br>
                                    >>
==============================================================================<br>
                                    >> ---
                                    llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll
                                    (added)<br>
                                    >> +++
                                    llvm/trunk/test/Transforms/MemCpyOpt/nontemporal.ll
                                    Fri Oct  9 05:53:41 2015<br>
                                    >> @@ -0,0 +1,49 @@<br>
                                    >> +; RUN: opt < %s
                                    -memcpyopt -S | FileCheck %s<br>
                                    >> +<br>
                                    >> +target datalayout =
                                    "e-m:e-i64:64-f80:128-n8:16:32:64-S128"<br>
                                    >> +<br>
                                    >> +; Verify that we don't
                                    combine nontemporal stores into
                                    memset calls.<br>
                                    >> +<br>
                                    >> +define void
                                    @nontemporal_stores_1(<4 x
                                    float>* nocapture %dst) {<br>
                                    >> +; CHECK-LABEL:
                                    @nontemporal_stores_1<br>
                                    >> +; CHECK: store <4 x
                                    float> zeroinitializer, <4 x
                                    float>* %dst, align 16,
                                    !nontemporal !0<br>
                                    >> +; CHECK: store <4 x
                                    float> zeroinitializer, <4 x
                                    float>* %ptr1, align 16,
                                    !nontemporal !0<br>
                                    >> +; CHECK: store <4 x
                                    float> zeroinitializer, <4 x
                                    float>* %ptr2, align 16,
                                    !nontemporal !0<br>
                                    >> +; CHECK: store <4 x
                                    float> zeroinitializer, <4 x
                                    float>* %ptr3, align 16,
                                    !nontemporal !0<br>
                                    >> +; CHECK: store <4 x
                                    float> zeroinitializer, <4 x
                                    float>* %ptr4, align 16,
                                    !nontemporal !0<br>
                                    >> +; CHECK: store <4 x
                                    float> zeroinitializer, <4 x
                                    float>* %ptr5, align 16,
                                    !nontemporal !0<br>
                                    >> +; CHECK: store <4 x
                                    float> zeroinitializer, <4 x
                                    float>* %ptr6, align 16,
                                    !nontemporal !0<br>
                                    >> +; CHECK: store <4 x
                                    float> zeroinitializer, <4 x
                                    float>* %ptr7, align 16,
                                    !nontemporal !0<br>
                                    >> +; CHECK-NEXT: ret void<br>
                                    >> +entry:<br>
                                    >> +  store <4 x float>
                                    zeroinitializer, <4 x float>*
                                    %dst, align 16, !nontemporal !0<br>
                                    >> +  %ptr1 = getelementptr
                                    inbounds <4 x float>, <4 x
                                    float>* %dst, i64 1<br>
                                    >> +  store <4 x float>
                                    zeroinitializer, <4 x float>*
                                    %ptr1, align 16, !nontemporal !0<br>
                                    >> +  %ptr2 = getelementptr
                                    inbounds <4 x float>, <4 x
                                    float>* %dst, i64 2<br>
                                    >> +  store <4 x float>
                                    zeroinitializer, <4 x float>*
                                    %ptr2, align 16, !nontemporal !0<br>
                                    >> +  %ptr3 = getelementptr
                                    inbounds <4 x float>, <4 x
                                    float>* %dst, i64 3<br>
                                    >> +  store <4 x float>
                                    zeroinitializer, <4 x float>*
                                    %ptr3, align 16, !nontemporal !0<br>
                                    >> +  %ptr4 = getelementptr
                                    inbounds <4 x float>, <4 x
                                    float>* %dst, i64 4<br>
                                    >> +  store <4 x float>
                                    zeroinitializer, <4 x float>*
                                    %ptr4, align 16, !nontemporal !0<br>
                                    >> +  %ptr5 = getelementptr
                                    inbounds <4 x float>, <4 x
                                    float>* %dst, i64 5<br>
                                    >> +  store <4 x float>
                                    zeroinitializer, <4 x float>*
                                    %ptr5, align 16, !nontemporal !0<br>
                                    >> +  %ptr6 = getelementptr
                                    inbounds <4 x float>, <4 x
                                    float>* %dst, i64 6<br>
                                    >> +  store <4 x float>
                                    zeroinitializer, <4 x float>*
                                    %ptr6, align 16, !nontemporal !0<br>
                                    >> +  %ptr7 = getelementptr
                                    inbounds <4 x float>, <4 x
                                    float>* %dst, i64 7<br>
                                    >> +  store <4 x float>
                                    zeroinitializer, <4 x float>*
                                    %ptr7, align 16, !nontemporal !0<br>
                                    >> +  ret void<br>
                                    >> +}<br>
                                    >> +<br>
                                    >> +define void
                                    @nontemporal_stores_2(<4 x
                                    float>* nocapture %dst) {<br>
                                    >> +; CHECK-LABEL:
                                    @nontemporal_stores_2<br>
                                    >> +; CHECK: store <4 x
                                    float> zeroinitializer, <4 x
                                    float>* %dst, align 16,
                                    !nontemporal !0<br>
                                    >> +; CHECK: store <4 x
                                    float> zeroinitializer, <4 x
                                    float>* %ptr1, align 16,
                                    !nontemporal !0<br>
                                    >> +; CHECK-NEXT: ret void<br>
                                    >> +entry:<br>
                                    >> +  store <4 x float>
                                    zeroinitializer, <4 x float>*
                                    %dst, align 16, !nontemporal !0<br>
                                    >> +  %ptr1 = getelementptr
                                    inbounds <4 x float>, <4 x
                                    float>* %dst, i64 1<br>
                                    >> +  store <4 x float>
                                    zeroinitializer, <4 x float>*
                                    %ptr1, align 16, !nontemporal !0<br>
                                    >> +  ret void<br>
                                    >> +}<br>
                                    >> +<br>
                                    >> +!0 = !{i32 1}<br>
                                    >><br>
                                    >><br>
                                    >>
                                    _______________________________________________<br>
                                    >> llvm-commits mailing list<br>
                                    >><span> </span><a
                                      moz-do-not-send="true"
                                      href="mailto:llvm-commits@lists.llvm.org"
                                      target="_blank"><a class="moz-txt-link-abbreviated" href="mailto:llvm-commits@lists.llvm.org">llvm-commits@lists.llvm.org</a></a><br>
                                    >><span> </span><a
                                      moz-do-not-send="true"
                                      href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits"
                                      target="_blank"><a class="moz-txt-link-freetext" href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits</a></a><br>
                                    ><br>
                                    >
                                    _______________________________________________<br>
                                    > llvm-commits mailing list<br>
                                    ><span> </span><a
                                      moz-do-not-send="true"
                                      href="mailto:llvm-commits@lists.llvm.org"
                                      target="_blank"><a class="moz-txt-link-abbreviated" href="mailto:llvm-commits@lists.llvm.org">llvm-commits@lists.llvm.org</a></a><br>
                                    ><span> </span><a
                                      moz-do-not-send="true"
                                      href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits"
                                      target="_blank"><a class="moz-txt-link-freetext" href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits</a></a><br>
                                    <br>
_______________________________________________<br>
                                    llvm-commits mailing list<br>
                                    <a moz-do-not-send="true"
                                      href="mailto:llvm-commits@lists.llvm.org"
                                      target="_blank">llvm-commits@lists.llvm.org</a><br>
                                    <a moz-do-not-send="true"
                                      href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits"
                                      rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits</a></div>
                                </div>
                              </blockquote>
                            </div>
                          </div>
                        </blockquote>
                      </div>
                      <br>
                    </div>
                  </blockquote>
                  <br>
                </div>
              </div>
            </div>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <br>
  </body>
</html>