<p dir="ltr">Thanks! Nice pattern!</p>

<div class="gmail_quote">On Sep 22, 2015 12:01 AM, "Dmitry Vyukov" <<a href="mailto:dvyukov@google.com">dvyukov@google.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">An atomic increment of a shared counter definitely can affect<br>

performance in parallel setups if it is on hot path. I don't whether<br>

it is on hot path here or not.<br>

<br>

FWIW, I can suggest the following scheme:<br>

<br>

std::atomic<int> x;<br>

<br>

__thread int pos;<br>

__thread int end;<br>

<br>

int allocate_id() {<br>

    if (pos == end) {<br>

        pos = x.fetch_add(64);<br>

        end = pos + 64;<br>

    }<br>

    return pos++;<br>

}<br>

<br>

<br>

<br>

On Tue, Sep 22, 2015 at 1:08 AM, Rui Ueyama <<a href="mailto:ruiu@google.com">ruiu@google.com</a>> wrote:<br>

> On Mon, Sep 21, 2015 at 4:04 PM, Sean Silva <<a href="mailto:chisophugis@gmail.com">chisophugis@gmail.com</a>> wrote:<br>

>><br>

>><br>

>><br>

>> On Mon, Sep 21, 2015 at 3:56 PM, Rui Ueyama <<a href="mailto:ruiu@google.com">ruiu@google.com</a>> wrote:<br>

>>><br>

>>> I mean I would do that if there's a portable way in C++11 to get the<br>

>>> current thread ID as a small integer.<br>

>><br>

>><br>

>> Or an integer *at all*. Apparently it was decided to use an opaque class<br>

>> for this with no way to access it as an integer/void*:<br>

>> <a href="http://en.cppreference.com/w/cpp/thread/thread/id" rel="noreferrer" target="_blank">http://en.cppreference.com/w/cpp/thread/thread/id</a><br>

><br>

><br>

> Yup. I don't think even POSIX provides that functionality. Everything is in<br>

> pthread_t which is an opaque data type.<br>

><br>

>><br>

>> -- Sean Silva<br>

>><br>

>>><br>

>>> If we have such function, we could write like this.<br>

>>><br>

>>>   thread_local uint64_t NextID = 1;<br>

>>><br>

>>>  ...<br>

>>><br>

>>>   ID = (getThreadID() << 32) + NextID++;<br>

>>><br>

>>> But I don't think such function exists.<br>

>>><br>

>>> On Mon, Sep 21, 2015 at 3:49 PM, Rui Ueyama <<a href="mailto:ruiu@google.com">ruiu@google.com</a>> wrote:<br>

>>>><br>

>>>> On Mon, Sep 21, 2015 at 3:35 PM, Sean Silva <<a href="mailto:chisophugis@gmail.com">chisophugis@gmail.com</a>><br>

>>>> wrote:<br>

>>>>><br>

>>>>><br>

>>>>><br>

>>>>> On Mon, Sep 21, 2015 at 8:20 AM, Rui Ueyama <<a href="mailto:ruiu@google.com">ruiu@google.com</a>> wrote:<br>

>>>>>><br>

>>>>>> On Sun, Sep 20, 2015 at 6:42 PM, Rui Ueyama <<a href="mailto:ruiu@google.com">ruiu@google.com</a>> wrote:<br>

>>>>>>><br>

>>>>>>> We can, but I think we don't have to, because I didn't see any<br>

>>>>>>> measurable difference between this code and the old racy one.<br>

>>>>>><br>

>>>>>> ... in terms of speed.<br>

>>>>><br>

>>>>><br>

>>>>> On x86 at a hardware level a similar amount of synchronization is<br>

>>>>> required for both cases, so this isn't surprising (regardless of whether the<br>

>>>>> access is atomic or not, the cache coherence mechanisms kick in). Simply<br>

>>>>> replacing a non-atomic access with an atomic access is not enough. From<br>

>>>>> inspection of the code, there seems to be quite a bit of work being done<br>

>>>>> between increments of this value, so it's probably not going to matter, but<br>

>>>>> if you someday change this to use fine-grained parallelism so that it scales<br>

>>>>> linearly with # cores (it is highly parallelizable, so this should be<br>

>>>>> possible), it will probably be necessary to factor this in.<br>

>>>>><br>

>>>>> For reference, I have attached a simple test program that shows the<br>

>>>>> relative impact of the different choices.<br>

>>>>><br>

>>>>> Sean:~/tmp % clang++ -std=c++11 -DNUM_THREADS=12<br>

>>>>> -DNUM_OPS_PER_THREAD=5000000 -O3 testsynchronization.cpp && ./a.out<br>

>>>>><br>

>>>>> Per thread, non-atomic: 1.00x (baseline)<br>

>>>>><br>

>>>>> Per-thread, atomic:     3.61x slower than baseline<br>

>>>>><br>

>>>>> Shared, non-atomic:     29.47x slower than baseline<br>

>>>>><br>

>>>>> Shared, atomic:         112.72x slower than baseline<br>

>>>>><br>

>>>>><br>

>>>>> This is on a Xeon E5-2697 v2 (12 physical cores).<br>

>>>>><br>

>>>>> As you can see, the effect of making the access non-atomic is very<br>

>>>>> small compared to the effect of avoiding cache coherence overhead (3-4x vs.<br>

>>>>> 30x).<br>

>>>><br>

>>>><br>

>>>> Thank you for the interesting numbers. Shared atomic variable are indeed<br>

>>>> much slower than non-shared, non-atomic variables when you have a lot of<br>

>>>> concurrent writes. If I'm really trying to get until the last drop from<br>

>>>> multi-cores, I'd do as you suggested so that threads share almost nothing. I<br>

>>>> didn't do that simply because the cost of updating the atomic counter is in<br>

>>>> practice negligible in this code because we are doing so many things other<br>

>>>> than updating the atomic variable. I don't generally optimize unless doing<br>

>>>> that makes measurable performance difference.<br>

>>>><br>

>>>>><br>

>>>>> CC'ing Dmitry who may have more insight to give.<br>

>>>>><br>

>>>>> -- Sean Silva<br>

>>>>><br>

>>>>>><br>

>>>>>><br>

>>>>>><br>

>>>>>>><br>

>>>>>>> 2015/09/20 18:32 "Sean Silva" <<a href="mailto:chisophugis@gmail.com">chisophugis@gmail.com</a>>:<br>

>>>>>>><br>

>>>>>>>> It seems unfortunate for us to need any synchronization at all just<br>

>>>>>>>> to generate a unique ID. Can we just have a bunch of counters, one per<br>

>>>>>>>> thread, then initialize them to `ThreadIndex << 50` or something? (make sure<br>

>>>>>>>> they are each on their own cache line)<br>

>>>>>>>><br>

>>>>>>>> -- Sean Silva<br>

>>>>>>>><br>

>>>>>>>> On Sat, Sep 19, 2015 at 6:44 PM, Rui Ueyama via llvm-commits<br>

>>>>>>>> <<a href="mailto:llvm-commits@lists.llvm.org">llvm-commits@lists.llvm.org</a>> wrote:<br>

>>>>>>>>><br>

>>>>>>>>> Author: ruiu<br>

>>>>>>>>> Date: Sat Sep 19 20:44:44 2015<br>

>>>>>>>>> New Revision: 248106<br>

>>>>>>>>><br>

>>>>>>>>> URL: <a href="http://llvm.org/viewvc/llvm-project?rev=248106&view=rev" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-project?rev=248106&view=rev</a><br>

>>>>>>>>> Log:<br>

>>>>>>>>> COFF: Fix race condition.<br>

>>>>>>>>><br>

>>>>>>>>> NextID is updated inside parallel_for_each, so it needs mutual<br>

>>>>>>>>> exclusion.<br>

>>>>>>>>><br>

>>>>>>>>> Modified:<br>

>>>>>>>>>     lld/trunk/COFF/ICF.cpp<br>

>>>>>>>>><br>

>>>>>>>>> Modified: lld/trunk/COFF/ICF.cpp<br>

>>>>>>>>> URL:<br>

>>>>>>>>> <a href="http://llvm.org/viewvc/llvm-project/lld/trunk/COFF/ICF.cpp?rev=248106&r1=248105&r2=248106&view=diff" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-project/lld/trunk/COFF/ICF.cpp?rev=248106&r1=248105&r2=248106&view=diff</a><br>

>>>>>>>>><br>

>>>>>>>>> ==============================================================================<br>

>>>>>>>>> --- lld/trunk/COFF/ICF.cpp (original)<br>

>>>>>>>>> +++ lld/trunk/COFF/ICF.cpp Sat Sep 19 20:44:44 2015<br>

>>>>>>>>> @@ -73,7 +73,7 @@ private:<br>

>>>>>>>>>    bool forEachGroup(std::vector<SectionChunk *> &Chunks,<br>

>>>>>>>>> Comparator Eq);<br>

>>>>>>>>>    bool partition(ChunkIterator Begin, ChunkIterator End,<br>

>>>>>>>>> Comparator Eq);<br>

>>>>>>>>><br>

>>>>>>>>> -  uint64_t NextID = 1;<br>

>>>>>>>>> +  std::atomic<uint64_t> NextID = { 1 };<br>

>>>>>>>>>  };<br>

>>>>>>>>><br>

>>>>>>>>>  // Entry point to ICF.<br>

>>>>>>>>><br>

>>>>>>>>><br>

>>>>>>>>> _______________________________________________<br>

>>>>>>>>> llvm-commits mailing list<br>

>>>>>>>>> <a href="mailto:llvm-commits@lists.llvm.org">llvm-commits@lists.llvm.org</a><br>

>>>>>>>>> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits</a><br>

>>>>>>>><br>

>>>>>>>><br>

>>>>>><br>

>>>>><br>

>>>><br>

>>><br>

>><br>

><br>

<br>

<br>

<br>

--<br>

Dmitry Vyukov, Software Engineer, <a href="mailto:dvyukov@google.com">dvyukov@google.com</a><br>

Google Germany GmbH, Dienerstraße 12, 80331, München<br>

Geschäftsführer: Graham Law, Christine Elizabeth Flores<br>

Registergericht und -nummer: Hamburg, HRB 86891<br>

Sitz der Gesellschaft: Hamburg<br>

Diese E-Mail ist vertraulich. Wenn Sie nicht der richtige Adressat<br>

sind, leiten Sie diese bitte nicht weiter, informieren Sie den<br>

Absender und löschen Sie die E-Mail und alle Anhänge. Vielen Dank.<br>

This e-mail is confidential. If you are not the right addressee please<br>

do not forward it, please inform the sender, and please erase this<br>

e-mail including any attachments. Thanks.<br>

</blockquote></div>