[lld] r248106 - COFF: Fix race condition.

Mon Sep 28 04:24:41 PDT 2015

Thanks! Nice pattern!
On Sep 22, 2015 12:01 AM, "Dmitry Vyukov" <dvyukov at google.com> wrote:

> An atomic increment of a shared counter definitely can affect
> performance in parallel setups if it is on hot path. I don't whether
> it is on hot path here or not.
>
> FWIW, I can suggest the following scheme:
>
> std::atomic<int> x;
>
> __thread int pos;
> __thread int end;
>
> int allocate_id() {
>     if (pos == end) {
>         pos = x.fetch_add(64);
>         end = pos + 64;
>     }
>     return pos++;
> }
>
>
>
> On Tue, Sep 22, 2015 at 1:08 AM, Rui Ueyama <ruiu at google.com> wrote:
> > On Mon, Sep 21, 2015 at 4:04 PM, Sean Silva <chisophugis at gmail.com>
> wrote:
> >>
> >>
> >>
> >> On Mon, Sep 21, 2015 at 3:56 PM, Rui Ueyama <ruiu at google.com> wrote:
> >>>
> >>> I mean I would do that if there's a portable way in C++11 to get the
> >>> current thread ID as a small integer.
> >>
> >>
> >> Or an integer *at all*. Apparently it was decided to use an opaque class
> >> for this with no way to access it as an integer/void*:
> >> http://en.cppreference.com/w/cpp/thread/thread/id
> >
> >
> > Yup. I don't think even POSIX provides that functionality. Everything is
> in
> > pthread_t which is an opaque data type.
> >
> >>
> >> -- Sean Silva
> >>
> >>>
> >>> If we have such function, we could write like this.
> >>>
> >>>   thread_local uint64_t NextID = 1;
> >>>
> >>>  ...
> >>>
> >>>   ID = (getThreadID() << 32) + NextID++;
> >>>
> >>> But I don't think such function exists.
> >>>
> >>> On Mon, Sep 21, 2015 at 3:49 PM, Rui Ueyama <ruiu at google.com> wrote:
> >>>>
> >>>> On Mon, Sep 21, 2015 at 3:35 PM, Sean Silva <chisophugis at gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Mon, Sep 21, 2015 at 8:20 AM, Rui Ueyama <ruiu at google.com> wrote:
> >>>>>>
> >>>>>> On Sun, Sep 20, 2015 at 6:42 PM, Rui Ueyama <ruiu at google.com>
> wrote:
> >>>>>>>
> >>>>>>> We can, but I think we don't have to, because I didn't see any
> >>>>>>> measurable difference between this code and the old racy one.
> >>>>>>
> >>>>>> ... in terms of speed.
> >>>>>
> >>>>>
> >>>>> On x86 at a hardware level a similar amount of synchronization is
> >>>>> required for both cases, so this isn't surprising (regardless of
> whether the
> >>>>> access is atomic or not, the cache coherence mechanisms kick in).
> Simply
> >>>>> replacing a non-atomic access with an atomic access is not enough.
> From
> >>>>> inspection of the code, there seems to be quite a bit of work being
> done
> >>>>> between increments of this value, so it's probably not going to
> matter, but
> >>>>> if you someday change this to use fine-grained parallelism so that
> it scales
> >>>>> linearly with # cores (it is highly parallelizable, so this should be
> >>>>> possible), it will probably be necessary to factor this in.
> >>>>>
> >>>>> For reference, I have attached a simple test program that shows the
> >>>>> relative impact of the different choices.
> >>>>>
> >>>>> Sean:~/tmp % clang++ -std=c++11 -DNUM_THREADS=12
> >>>>> -DNUM_OPS_PER_THREAD=5000000 -O3 testsynchronization.cpp && ./a.out
> >>>>>
> >>>>> Per thread, non-atomic: 1.00x (baseline)
> >>>>>
> >>>>> Per-thread, atomic:     3.61x slower than baseline
> >>>>>
> >>>>> Shared, non-atomic:     29.47x slower than baseline
> >>>>>
> >>>>> Shared, atomic:         112.72x slower than baseline
> >>>>>
> >>>>>
> >>>>> This is on a Xeon E5-2697 v2 (12 physical cores).
> >>>>>
> >>>>> As you can see, the effect of making the access non-atomic is very
> >>>>> small compared to the effect of avoiding cache coherence overhead
> (3-4x vs.
> >>>>> 30x).
> >>>>
> >>>>
> >>>> Thank you for the interesting numbers. Shared atomic variable are
> indeed
> >>>> much slower than non-shared, non-atomic variables when you have a lot
> of
> >>>> concurrent writes. If I'm really trying to get until the last drop
> from
> >>>> multi-cores, I'd do as you suggested so that threads share almost
> nothing. I
> >>>> didn't do that simply because the cost of updating the atomic counter
> is in
> >>>> practice negligible in this code because we are doing so many things
> other
> >>>> than updating the atomic variable. I don't generally optimize unless
> doing
> >>>> that makes measurable performance difference.
> >>>>
> >>>>>
> >>>>> CC'ing Dmitry who may have more insight to give.
> >>>>>
> >>>>> -- Sean Silva
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> 2015/09/20 18:32 "Sean Silva" <chisophugis at gmail.com>:
> >>>>>>>
> >>>>>>>> It seems unfortunate for us to need any synchronization at all
> just
> >>>>>>>> to generate a unique ID. Can we just have a bunch of counters,
> one per
> >>>>>>>> thread, then initialize them to `ThreadIndex << 50` or something?
> (make sure
> >>>>>>>> they are each on their own cache line)
> >>>>>>>>
> >>>>>>>> -- Sean Silva
> >>>>>>>>
> >>>>>>>> On Sat, Sep 19, 2015 at 6:44 PM, Rui Ueyama via llvm-commits
> >>>>>>>> <llvm-commits at lists.llvm.org> wrote:
> >>>>>>>>>
> >>>>>>>>> Author: ruiu
> >>>>>>>>> Date: Sat Sep 19 20:44:44 2015
> >>>>>>>>> New Revision: 248106
> >>>>>>>>>
> >>>>>>>>> URL: http://llvm.org/viewvc/llvm-project?rev=248106&view=rev
> >>>>>>>>> Log:
> >>>>>>>>> COFF: Fix race condition.
> >>>>>>>>>
> >>>>>>>>> NextID is updated inside parallel_for_each, so it needs mutual
> >>>>>>>>> exclusion.
> >>>>>>>>>
> >>>>>>>>> Modified:
> >>>>>>>>>     lld/trunk/COFF/ICF.cpp
> >>>>>>>>>
> >>>>>>>>> Modified: lld/trunk/COFF/ICF.cpp
> >>>>>>>>> URL:
> >>>>>>>>>
> http://llvm.org/viewvc/llvm-project/lld/trunk/COFF/ICF.cpp?rev=248106&r1=248105&r2=248106&view=diff
> >>>>>>>>>
> >>>>>>>>>
> ==============================================================================
> >>>>>>>>> --- lld/trunk/COFF/ICF.cpp (original)
> >>>>>>>>> +++ lld/trunk/COFF/ICF.cpp Sat Sep 19 20:44:44 2015
> >>>>>>>>> @@ -73,7 +73,7 @@ private:
> >>>>>>>>>    bool forEachGroup(std::vector<SectionChunk *> &Chunks,
> >>>>>>>>> Comparator Eq);
> >>>>>>>>>    bool partition(ChunkIterator Begin, ChunkIterator End,
> >>>>>>>>> Comparator Eq);
> >>>>>>>>>
> >>>>>>>>> -  uint64_t NextID = 1;
> >>>>>>>>> +  std::atomic<uint64_t> NextID = { 1 };
> >>>>>>>>>  };
> >>>>>>>>>
> >>>>>>>>>  // Entry point to ICF.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> llvm-commits mailing list
> >>>>>>>>> llvm-commits at lists.llvm.org
> >>>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
>
>
> --
> Dmitry Vyukov, Software Engineer, dvyukov at google.com
> Google Germany GmbH, Dienerstraße 12, 80331, München
> Geschäftsführer: Graham Law, Christine Elizabeth Flores
> Registergericht und -nummer: Hamburg, HRB 86891
> Sitz der Gesellschaft: Hamburg
> Diese E-Mail ist vertraulich. Wenn Sie nicht der richtige Adressat
> sind, leiten Sie diese bitte nicht weiter, informieren Sie den
> Absender und löschen Sie die E-Mail und alle Anhänge. Vielen Dank.
> This e-mail is confidential. If you are not the right addressee please
> do not forward it, please inform the sender, and please erase this
> e-mail including any attachments. Thanks.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150928/5c9a5c63/attachment.html>