[lld] r248106 - COFF: Fix race condition.

Tue Sep 22 00:00:49 PDT 2015

An atomic increment of a shared counter definitely can affect
performance in parallel setups if it is on hot path. I don't whether
it is on hot path here or not.

FWIW, I can suggest the following scheme:

std::atomic<int> x;

__thread int pos;
__thread int end;

int allocate_id() {
    if (pos == end) {
        pos = x.fetch_add(64);
        end = pos + 64;
    }
    return pos++;
}

On Tue, Sep 22, 2015 at 1:08 AM, Rui Ueyama <ruiu at google.com> wrote:
> On Mon, Sep 21, 2015 at 4:04 PM, Sean Silva <chisophugis at gmail.com> wrote:
>>
>>
>>
>> On Mon, Sep 21, 2015 at 3:56 PM, Rui Ueyama <ruiu at google.com> wrote:
>>>
>>> I mean I would do that if there's a portable way in C++11 to get the
>>> current thread ID as a small integer.
>>
>>
>> Or an integer *at all*. Apparently it was decided to use an opaque class
>> for this with no way to access it as an integer/void*:
>> http://en.cppreference.com/w/cpp/thread/thread/id
>
>
> Yup. I don't think even POSIX provides that functionality. Everything is in
> pthread_t which is an opaque data type.
>
>>
>> -- Sean Silva
>>
>>>
>>> If we have such function, we could write like this.
>>>
>>>   thread_local uint64_t NextID = 1;
>>>
>>>  ...
>>>
>>>   ID = (getThreadID() << 32) + NextID++;
>>>
>>> But I don't think such function exists.
>>>
>>> On Mon, Sep 21, 2015 at 3:49 PM, Rui Ueyama <ruiu at google.com> wrote:
>>>>
>>>> On Mon, Sep 21, 2015 at 3:35 PM, Sean Silva <chisophugis at gmail.com>
>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Sep 21, 2015 at 8:20 AM, Rui Ueyama <ruiu at google.com> wrote:
>>>>>>
>>>>>> On Sun, Sep 20, 2015 at 6:42 PM, Rui Ueyama <ruiu at google.com> wrote:
>>>>>>>
>>>>>>> We can, but I think we don't have to, because I didn't see any
>>>>>>> measurable difference between this code and the old racy one.
>>>>>>
>>>>>> ... in terms of speed.
>>>>>
>>>>>
>>>>> On x86 at a hardware level a similar amount of synchronization is
>>>>> required for both cases, so this isn't surprising (regardless of whether the
>>>>> access is atomic or not, the cache coherence mechanisms kick in). Simply
>>>>> replacing a non-atomic access with an atomic access is not enough. From
>>>>> inspection of the code, there seems to be quite a bit of work being done
>>>>> between increments of this value, so it's probably not going to matter, but
>>>>> if you someday change this to use fine-grained parallelism so that it scales
>>>>> linearly with # cores (it is highly parallelizable, so this should be
>>>>> possible), it will probably be necessary to factor this in.
>>>>>
>>>>> For reference, I have attached a simple test program that shows the
>>>>> relative impact of the different choices.
>>>>>
>>>>> Sean:~/tmp % clang++ -std=c++11 -DNUM_THREADS=12
>>>>> -DNUM_OPS_PER_THREAD=5000000 -O3 testsynchronization.cpp && ./a.out
>>>>>
>>>>> Per thread, non-atomic: 1.00x (baseline)
>>>>>
>>>>> Per-thread, atomic:     3.61x slower than baseline
>>>>>
>>>>> Shared, non-atomic:     29.47x slower than baseline
>>>>>
>>>>> Shared, atomic:         112.72x slower than baseline
>>>>>
>>>>>
>>>>> This is on a Xeon E5-2697 v2 (12 physical cores).
>>>>>
>>>>> As you can see, the effect of making the access non-atomic is very
>>>>> small compared to the effect of avoiding cache coherence overhead (3-4x vs.
>>>>> 30x).
>>>>
>>>>
>>>> Thank you for the interesting numbers. Shared atomic variable are indeed
>>>> much slower than non-shared, non-atomic variables when you have a lot of
>>>> concurrent writes. If I'm really trying to get until the last drop from
>>>> multi-cores, I'd do as you suggested so that threads share almost nothing. I
>>>> didn't do that simply because the cost of updating the atomic counter is in
>>>> practice negligible in this code because we are doing so many things other
>>>> than updating the atomic variable. I don't generally optimize unless doing
>>>> that makes measurable performance difference.
>>>>
>>>>>
>>>>> CC'ing Dmitry who may have more insight to give.
>>>>>
>>>>> -- Sean Silva
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> 2015/09/20 18:32 "Sean Silva" <chisophugis at gmail.com>:
>>>>>>>
>>>>>>>> It seems unfortunate for us to need any synchronization at all just
>>>>>>>> to generate a unique ID. Can we just have a bunch of counters, one per
>>>>>>>> thread, then initialize them to `ThreadIndex << 50` or something? (make sure
>>>>>>>> they are each on their own cache line)
>>>>>>>>
>>>>>>>> -- Sean Silva
>>>>>>>>
>>>>>>>> On Sat, Sep 19, 2015 at 6:44 PM, Rui Ueyama via llvm-commits
>>>>>>>> <llvm-commits at lists.llvm.org> wrote:
>>>>>>>>>
>>>>>>>>> Author: ruiu
>>>>>>>>> Date: Sat Sep 19 20:44:44 2015
>>>>>>>>> New Revision: 248106
>>>>>>>>>
>>>>>>>>> URL: http://llvm.org/viewvc/llvm-project?rev=248106&view=rev
>>>>>>>>> Log:
>>>>>>>>> COFF: Fix race condition.
>>>>>>>>>
>>>>>>>>> NextID is updated inside parallel_for_each, so it needs mutual
>>>>>>>>> exclusion.
>>>>>>>>>
>>>>>>>>> Modified:
>>>>>>>>>     lld/trunk/COFF/ICF.cpp
>>>>>>>>>
>>>>>>>>> Modified: lld/trunk/COFF/ICF.cpp
>>>>>>>>> URL:
>>>>>>>>> http://llvm.org/viewvc/llvm-project/lld/trunk/COFF/ICF.cpp?rev=248106&r1=248105&r2=248106&view=diff
>>>>>>>>>
>>>>>>>>> ==============================================================================
>>>>>>>>> --- lld/trunk/COFF/ICF.cpp (original)
>>>>>>>>> +++ lld/trunk/COFF/ICF.cpp Sat Sep 19 20:44:44 2015
>>>>>>>>> @@ -73,7 +73,7 @@ private:
>>>>>>>>>    bool forEachGroup(std::vector<SectionChunk *> &Chunks,
>>>>>>>>> Comparator Eq);
>>>>>>>>>    bool partition(ChunkIterator Begin, ChunkIterator End,
>>>>>>>>> Comparator Eq);
>>>>>>>>>
>>>>>>>>> -  uint64_t NextID = 1;
>>>>>>>>> +  std::atomic<uint64_t> NextID = { 1 };
>>>>>>>>>  };
>>>>>>>>>
>>>>>>>>>  // Entry point to ICF.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> llvm-commits mailing list
>>>>>>>>> llvm-commits at lists.llvm.org
>>>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

-- 
Dmitry Vyukov, Software Engineer, dvyukov at google.com
Google Germany GmbH, Dienerstraße 12, 80331, München
Geschäftsführer: Graham Law, Christine Elizabeth Flores
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Diese E-Mail ist vertraulich. Wenn Sie nicht der richtige Adressat
sind, leiten Sie diese bitte nicht weiter, informieren Sie den
Absender und löschen Sie die E-Mail und alle Anhänge. Vielen Dank.
This e-mail is confidential. If you are not the right addressee please
do not forward it, please inform the sender, and please erase this
e-mail including any attachments. Thanks.