<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <p>I will put the code up for review, including the pair of

      benchmarks that I have authored.<br>

    </p>

    <p><br>

    </p>

    <p>For armv7a, armv8, and aarch64, I used the following reduced .cpp

      code to generate assembly.  I haven't figured out the libc++ cross

      compiling setup yet, and I'm not well set up to run benchmarks on

      those platforms.<br>

    </p>

    <p>struct __shared_weak_count {<br>

        long __shared_weak_owners_;<br>

        void __release_weak() noexcept;<br>

        void __on_zero_shared_weak() noexcept;<br>

      };<br>

      <br>

      void<br>

      __shared_weak_count::__release_weak() noexcept {<br>

          if (__atomic_load_n(&__shared_weak_owners_, 2

      /*_AO_Acquire*/) == 0)<br>

              __on_zero_shared_weak();<br>

          else if (__atomic_add_fetch(&__shared_weak_owners_, -1, 4

      /*_AO_Acq_Rel*/) == -1)<br>

              __on_zero_shared_weak();<br>

      }<br>

    </p>

    <p>ARMv7a assembly, with notes:</p>

    <p>_ZN19__shared_weak_count14__release_weakEv:<br>

              .fnstart<br>

      @ BB#0:                                 @ %entry<br>

              ldr     r1, [r0]  @bcraig note: it would be nice to

      combine this load with the ldrex<br>

              dmb     ish      @bcraig note: ... and this barrier with

      the one in BB#2<br>

              cmp     r1, #0<br>

              it      eq         @bcraig note: unsure of why this is

      here...<br>

              beq     _ZN19__shared_weak_count21__on_zero_shared_weakEv<br>

              dmb     ish<br>

      .LBB0_1:                                @ %atomicrmw.start<br>

                                              @ =>This Inner Loop

      Header: Depth=1<br>

              ldrex   r1, [r0]<br>

              subs    r2, r1, #1<br>

              strex   r3, r2, [r0]<br>

              cmp     r3, #0<br>

              bne     .LBB0_1<br>

      @ BB#2:                                 @ %atomicrmw.end<br>

              cmp     r1, #1<br>

              dmb     ish<br>

              it      ne<br>

              bxne    lr<br>

              b       _ZN19__shared_weak_count21__on_zero_shared_weakEv<br>

      <br>

    </p>

    <br>

    Aarch64 assembly, with notes:<br>

    _ZN19__shared_weak_count14__release_weakEv:<br>

            .fnstart<br>

    @ BB#0:                                 @ %entry<br>

            lda     r1, [r0]                 @bcraig note: it would be

    nice to combine this with the ldaex<br>

            cbz     r1, .LBB0_3<br>

    .LBB0_1:                                @ %atomicrmw.start<br>

                                            @ =>This Inner Loop

    Header: Depth=1<br>

            ldaex   r1, [r0]<br>

            subs    r2, r1, #1<br>

            stlex   r3, r2, [r0]<br>

            cmp     r3, #0<br>

            bne     .LBB0_1<br>

    @ BB#2:                                 @ %atomicrmw.end<br>

            cmp     r1, #1<br>

            bne     .LBB0_4<br>

    .LBB0_3:                                @ %if.then5<br>

            b       _ZN19__shared_weak_count21__on_zero_shared_weakEv<br>

    .LBB0_4:                                @ %if.end6<br>

            bx      lr<br>

    <br>

    <br>

    <div class="moz-cite-prefix">On 7/18/2016 12:29 PM, JF Bastien

      wrote:<br>

    </div>

    <blockquote

cite="mid:CABdywOf2FN00vxx_-z4My=GXxMr8SQZSq5XCLEQ6aaSn=ZPaFQ@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div class="gmail_extra">

          <div class="gmail_quote">On Mon, Jul 18, 2016 at 8:31 AM,

            Craig, Ben <span dir="ltr"><<a moz-do-not-send="true"

                href="mailto:ben.craig@codeaurora.org" target="_blank">ben.craig@codeaurora.org</a>></span>

            wrote:<br>

            <blockquote class="gmail_quote" style="margin:0 0 0

              .8ex;border-left:1px #ccc solid;padding-left:1ex">Currently,

              when the last shared_ptr to an object is destroyed, libc++

              performs two atomic decrements, one for the "strong"

              shared count, and one for the "weak" shared count.  I

              think we can do better than this in the uncontended case,

              but I would like some feedback for this optimization,

              particularly on the ARM side.<br>

              <br>

              Here's the code change...<br>

              diff --git a/src/memory.cpp b/src/memory.cpp<br>

              index 08f2259..b459eb1 100644<br>

              --- a/src/memory.cpp<br>

              +++ b/src/memory.cpp<br>

              @@ -30,12 +30,12 @@ increment(T& t) _NOEXCEPT<br>

                   return __libcpp_atomic_add(&t, 1, _AO_Relaxed);<br>

               }<br>

              <br>

               template <class T><br>

               inline T<br>

               decrement(T& t) _NOEXCEPT<br>

               {<br>

                   return __libcpp_atomic_add(&t, -1, _AO_Acq_Rel);<br>

               }<br>

              <br>

               }  // namespace<br>

              @@ -96,7 +96,9 @@ __shared_weak_count::__release_shared()

              _NOEXCEPT<br>

               void<br>

               __shared_weak_count::__release_weak() _NOEXCEPT<br>

               {<br>

              -    if (decrement(__shared_weak_owners_) == -1)<br>

              +    if (__libcpp_atomic_load(&__shared_weak_owners_,

              _AO_Acquire) == 0)<br>

              +        __on_zero_shared_weak();<br>

              +    else if (decrement(__shared_weak_owners_) == -1)<br>

                       __on_zero_shared_weak();<br>

               }<br>

              <br>

              The general idea is that if the current thread is

              destroying the last weak reference, then no other thread

              can legally be accessing this object.  Given that, we can

              avoid an expensive atomic store. On x86_64, a

              quick-and-dirty benchmark is showing an 8% improvement in

              performance for the combination of make_shared<int>

              and the accompanying destruction.  I don't have

              performance numbers for other architectures at this

              point.  That 8% is pretty promising though, as the atomic

              operation improvements are showing through, despite being

              measured along with a heap allocation and deallocation.<br>

            </blockquote>

            <div><br>

            </div>

            <div>Do you have a repo with this benchmark?</div>

            <div><br>

            </div>

            <div><br>

            </div>

            <blockquote class="gmail_quote" style="margin:0 0 0

              .8ex;border-left:1px #ccc solid;padding-left:1ex">

              Note that this optimization wouldn't be safe for the

              strong count, as the last strong count decrement can still

              contend with a weak_ptr::lock() call.<br>

              <br>

              This comes at the cost of adding an extra load acquire for

              all but the last decrement (and sometimes even the last

              decrement).  On x86, this is really cheap (just a regular

              mov).  Currently with aarch64 and 32-bit armv8, you get an

              extra lda, and with armv7 you get extra barriers.<br>

              <br>

              I would hope / expect that on LL/SC architectures, the

              first acquire load could be folded with the locked load in

              the atomic add.  The check and branch (inside the ll / sc

              loop) would then be the only overhead.  Is this a

              reasonable optimization to hope for in the future on the

              compiler front?<br>

            </blockquote>

            <div><br>

            </div>

            <div>What do you mean exactly, could you provide assembly? I

              think I understand (sounds clever & doable), but

              assembly is easier :-)</div>

            <div>That can be benchmarked as well.</div>

            <div><br>

            </div>

            <div><br>

            </div>

            <blockquote class="gmail_quote" style="margin:0 0 0

              .8ex;border-left:1px #ccc solid;padding-left:1ex">

              Also, I'm being a bit conservative here by making my

              atomic load an acquire operation.  It might be safe to

              make the operation relaxed, but that seems risky to me, as

              __on_zero_shared_weak may end up touching unsynchronized

              data in those cases.</blockquote>

            <div><br>

            </div>

            <div>I haven't thought enough about shared_ptr to convince

              myself either way. Would be good to benchmark to see if

              it's even worth proving. </div>

          </div>

        </div>

      </div>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Employee of Qualcomm Innovation Center, Inc.

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

</pre>

  </body>

</html>