[lld] r232460 - [ELF] Use parallel_for_each for writing.

Wed Mar 18 09:13:17 PDT 2015

On Wed, Mar 18, 2015 at 9:04 AM, Rafael Espíndola <
rafael.espindola at gmail.com> wrote:

> In this case, when linking a Release+Asserts clang what I got was
>
> master:
>
>        1858.530684      task-clock (msec)         #    0.999 CPUs
> utilized            ( +-  0.02% )
>              1,246      context-switches          #    0.670 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                ( +-100.00% )
>            191,223      page-faults               #    0.103 M/sec
>                ( +-  0.00% )
>      5,579,119,294      cycles                    #    3.002 GHz
>                ( +-  0.02% )
>      3,086,413,171      stalled-cycles-frontend   #   55.32% frontend
> cycles idle     ( +-  0.03% )
>    <not supported>      stalled-cycles-backend
>      6,059,256,591      instructions              #    1.09  insns per
> cycle
>                                                   #    0.51  stalled
> cycles per insn  ( +-  0.00% )
>      1,261,645,273      branches                  #  678.840 M/sec
>                ( +-  0.00% )
>         26,517,441      branch-misses             #    2.10% of all
> branches          ( +-  0.00% )
>
>        1.860335083 seconds time elapsed
>           ( +-  0.02% )
>
>
> master with your patch reverted:
>
>
>        1840.225861      task-clock (msec)         #    0.999 CPUs
> utilized            ( +-  0.06% )
>              1,170      context-switches          #    0.636 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                ( +- 68.82% )
>            191,225      page-faults               #    0.104 M/sec
>                ( +-  0.00% )
>      5,532,122,558      cycles                    #    3.006 GHz
>                ( +-  0.04% )
>      3,052,067,591      stalled-cycles-frontend   #   55.17% frontend
> cycles idle     ( +-  0.08% )
>    <not supported>      stalled-cycles-backend
>      6,002,264,641      instructions              #    1.08  insns per
> cycle
>                                                   #    0.51  stalled
> cycles per insn  ( +-  0.00% )
>      1,250,316,604      branches                  #  679.436 M/sec
>                ( +-  0.00% )
>         26,532,702      branch-misses             #    2.12% of all
> branches          ( +-  0.00% )
>
>        1.842000792 seconds time elapsed
>           ( +-  0.06% )

It looks to me that the results of the two are almost the same?

>
> On 18 March 2015 at 11:55, Rafael Espíndola <rafael.espindola at gmail.com>
> wrote:
> > Are you on Linux? What I normally do for benchmarking is
> >
> > * Put all the files on tmpfs
> > * Disable address space randomization:
> >   echo 0 > /proc/sys/kernel/randomize_va_space
> > * Disable cpu frequency scaling
> >  for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
> > echo performance > $i; done
> >
> > * Use perf to run it multiple times and schedtool to run it at very
> > high priority:
> >   sudo schedtool -F  -p 99 -a 0x4 -e perf stat -r 20
> >
> >
> > On 17 March 2015 at 18:27, Rui Ueyama <ruiu at google.com> wrote:
> >> Why don't you just run it many more times?
> >>
> >> On Tue, Mar 17, 2015 at 3:20 PM, Shankar Easwaran <
> shankare at codeaurora.org>
> >> wrote:
> >>>
> >>> Not sure if doing this same experiment on different unixes may give
> some
> >>> information (or) linking the same object files on windows will give
> more
> >>> information ?
> >>>
> >>> How may data points do you usually collect ?
> >>>
> >>> Shankar Easwaran
> >>>
> >>>
> >>> On 3/17/2015 5:10 PM, Rui Ueyama wrote:
> >>>>
> >>>> I reformat your results here. As you can see S/N is too low. Maybe we
> >>>> cannot say anything only from four data points.
> >>>>
> >>>> LLD with patch
> >>>> 4.16user 0.80system 0:03.06elapsed 162%CPU (0avgtext+0avgdata
> >>>> 7174160maxresident)k
> >>>> 3.94user 0.86system 0:02.93elapsed 163%CPU (0avgtext+0avgdata
> >>>> 7175808maxresident)k
> >>>> 4.36user 1.05system 0:03.08elapsed 175%CPU (0avgtext+0avgdata
> >>>> 7176320maxresident)k
> >>>> 4.17user 0.72system 0:02.93elapsed 166%CPU (0avgtext+0avgdata
> >>>> 7175120maxresident)k
> >>>>
> >>>> LLD without patch
> >>>> 4.49user 0.92system 0:03.32elapsed 162%CPU (0avgtext+0avgdata
> >>>> 7179984maxresident)k
> >>>> 4.12user 0.83system 0:03.22elapsed 154%CPU (0avgtext+0avgdata
> >>>> 7172704maxresident)k
> >>>> 4.38user 0.90system 0:03.14elapsed 168%CPU (0avgtext+0avgdata
> >>>> 7175600maxresident)k
> >>>> 4.20user 0.79system 0:03.08elapsed 161%CPU (0avgtext+0avgdata
> >>>> 7174864maxresident)k
> >>>>
> >>>>
> >>>> On Tue, Mar 17, 2015 at 2:57 PM, Shankar Easwaran
> >>>> <shankare at codeaurora.org>
> >>>> wrote:
> >>>>
> >>>>> I tried to measure this again with 4 tries and got results, to make
> sure
> >>>>> just in case, and I see few results identical to what I measured
> before
> >>>>> :-
> >>>>>
> >>>>> *Raw data below :-*
> >>>>>
> >>>>>
> >>>>> LLD Try With Patch #1
> >>>>> 4.16user 0.80system 0:03.06elapsed 162%CPU (0avgtext+0avgdata
> >>>>> 7174160maxresident)k
> >>>>> LLD Try Without Patch #1
> >>>>> 4.49user 0.92system 0:03.32elapsed 162%CPU (0avgtext+0avgdata
> >>>>> 7179984maxresident)k
> >>>>> BFD Try #1
> >>>>> 7.81user 0.68system 0:08.53elapsed 99%CPU (0avgtext+0avgdata
> >>>>> 3230416maxresident)k
> >>>>> LLD Try With Patch #2
> >>>>> 3.94user 0.86system 0:02.93elapsed 163%CPU (0avgtext+0avgdata
> >>>>> 7175808maxresident)k
> >>>>> LLD Try Without Patch #2
> >>>>> 4.12user 0.83system 0:03.22elapsed 154%CPU (0avgtext+0avgdata
> >>>>> 7172704maxresident)k
> >>>>> BFD Try #2
> >>>>> 7.78user 0.75system 0:08.57elapsed 99%CPU (0avgtext+0avgdata
> >>>>> 3230416maxresident)k
> >>>>> LLD Try With Patch #3
> >>>>> 4.36user 1.05system 0:03.08elapsed 175%CPU (0avgtext+0avgdata
> >>>>> 7176320maxresident)k
> >>>>> LLD Try Without Patch #3
> >>>>> 4.38user 0.90system 0:03.14elapsed 168%CPU (0avgtext+0avgdata
> >>>>> 7175600maxresident)k
> >>>>> BFD Try #3
> >>>>> 7.78user 0.64system 0:08.46elapsed 99%CPU (0avgtext+0avgdata
> >>>>> 3230416maxresident)k
> >>>>> LLD Try With Patch #4
> >>>>> 4.17user 0.72system 0:02.93elapsed 166%CPU (0avgtext+0avgdata
> >>>>> 7175120maxresident)k
> >>>>> LLD Try Without Patch #4
> >>>>> 4.20user 0.79system 0:03.08elapsed 161%CPU (0avgtext+0avgdata
> >>>>> 7174864maxresident)k
> >>>>> BFD Try #4
> >>>>> 7.77user 0.66system 0:08.46elapsed 99%CPU (0avgtext+0avgdata
> >>>>> 3230416maxresident)k
> >>>>>
> >>>>> *Questions :-*
> >>>>>
> >>>>> As Rui mentions I dont know why the user time is more without the
> patch,
> >>>>> any methods to verify this ?
> >>>>> Could this be because of user threads instead of kernel threads ?
> >>>>>
> >>>>> Shankar Easwaran
> >>>>>
> >>>>>
> >>>>> On 3/17/2015 3:35 PM, Shankar Easwaran wrote:
> >>>>>
> >>>>> Yes, this is true. There were several logs of runs in the same file
> that
> >>>>> I
> >>>>> read into the commit and manually removing them resulted in two user
> >>>>> lines.
> >>>>>
> >>>>> But the result for all reasons is true. I can re-measure the time
> taken
> >>>>> though.
> >>>>>
> >>>>> Shankar Easwaran
> >>>>>
> >>>>> On 3/17/2015 2:30 PM, Rui Ueyama wrote:
> >>>>>
> >>>>> On Mon, Mar 16, 2015 at 8:29 PM, Shankar Easwaran
> >>>>> <shankare at codeaurora.org> <shankare at codeaurora.org>
> >>>>>
> >>>>> wrote:
> >>>>>
> >>>>> Author: shankare
> >>>>> Date: Mon Mar 16 22:29:32 2015
> >>>>> New Revision: 232460
> >>>>>
> >>>>> URL: http://llvm.org/viewvc/llvm-project?rev=232460&view=rev
> >>>>> Log:
> >>>>> [ELF] Use parallel_for_each for writing.
> >>>>>
> >>>>> This changes improves performance of lld, when self-hosting lld, when
> >>>>> compared
> >>>>> with the bfd linker. BFD linker on average takes 8 seconds in elapsed
> >>>>> time.
> >>>>> lld takes 3 seconds elapased time average. Without this change, lld
> >>>>> takes
> >>>>> ~5
> >>>>> seconds average. The runtime comparisons were done on a release build
> >>>>> and
> >>>>> measured by running linking thrice.
> >>>>>
> >>>>> lld self-host without the change
> >>>>> ----------------------------------
> >>>>> real    0m3.196s
> >>>>> user    0m4.580s
> >>>>> sys     0m0.832s
> >>>>>
> >>>>> lld self-host with lld
> >>>>> -----------------------
> >>>>> user    0m3.024s
> >>>>> user    0m3.252s
> >>>>> sys     0m0.796s
> >>>>>
> >>>>>   The above results don't look real output of "time" command.
> >>>>>
> >>>>> If it's real, it's too good to be true, assuming the first line of
> the
> >>>>> second result is "real" instead of "user".
> >>>>>
> >>>>> "real" is wall clock time from process start to process exit. "user"
> is
> >>>>> CPU
> >>>>> time consumed by the process in user mode (if a process is
> >>>>> multi-threaded,
> >>>>> it can be larger than real).
> >>>>>
> >>>>> Your result shows significant improvement in user time. Which means
> you
> >>>>> have significantly reduced the amount of processing time to do the
> same
> >>>>> thing compared to before. However, because this change didn't change
> >>>>> algorithm, but just execute them in parallel, it couldn't happen.
> >>>>>
> >>>>> Something's not correct.
> >>>>>
> >>>>> I appreciate your effort to make LLD faster, but we need to be
> careful
> >>>>> about benchmark results. If we don't measure improvements accurately,
> >>>>> it's
> >>>>> easy to make an "optimization" that makes things slower.
> >>>>>
> >>>>> Another important thing is to disbelieve what you do when you
> optimize
> >>>>> something and measure its effect. It sometimes happen that I believe
> >>>>> something is going to improve performance 100% sure but it actually
> >>>>> wouldn't.
> >>>>>
> >>>>> time taken to build lld with bfd
> >>>>>
> >>>>> --------------------------------
> >>>>> real    0m8.419s
> >>>>> user    0m7.748s
> >>>>> sys     0m0.632s
> >>>>>
> >>>>> Modified:
> >>>>>       lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h
> >>>>>       lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h
> >>>>>
> >>>>> Modified: lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h
> >>>>> URL:
> >>>>>
> >>>>>
> >>>>>
> http://llvm.org/viewvc/llvm-project/lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h?rev=232460&r1=232459&r2=232460&view=diff
> >>>>>
> >>>>>
> >>>>>
> ==============================================================================
> >>>>>
> >>>>> --- lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h (original)
> >>>>> +++ lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h Mon Mar 16
> 22:29:32
> >>>>> 2015
> >>>>> @@ -586,8 +586,10 @@ std::error_code OutputELFWriter<ELFT>::w
> >>>>>      _elfHeader->write(this, _layout, *buffer);
> >>>>>      _programHeader->write(this, _layout, *buffer);
> >>>>>
> >>>>> -  for (auto section : _layout.sections())
> >>>>> -    section->write(this, _layout, *buffer);
> >>>>> +  auto sections = _layout.sections();
> >>>>> +  parallel_for_each(
> >>>>> +      sections.begin(), sections.end(),
> >>>>> +      [&](Chunk<ELFT> *section) { section->write(this, _layout,
> >>>>> *buffer);
> >>>>> });
> >>>>>      writeTask.end();
> >>>>>
> >>>>>      ScopedTask commitTask(getDefaultDomain(), "ELF Writer commit to
> >>>>> disk");
> >>>>>
> >>>>> Modified: lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h
> >>>>> URL:
> >>>>>
> >>>>>
> >>>>>
> http://llvm.org/viewvc/llvm-project/lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h?rev=232460&r1=232459&r2=232460&view=diff
> >>>>>
> >>>>>
> >>>>>
> ==============================================================================
> >>>>>
> >>>>> --- lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h (original)
> >>>>> +++ lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h Mon Mar 16
> 22:29:32
> >>>>> 2015
> >>>>> @@ -234,17 +234,17 @@ public:
> >>>>>      /// routine gets called after the linker fixes up the virtual
> >>>>> address
> >>>>>      /// of the section
> >>>>>      virtual void assignVirtualAddress(uint64_t addr) override {
> >>>>> -    for (auto &ai : _atoms) {
> >>>>> +    parallel_for_each(_atoms.begin(), _atoms.end(), [&](AtomLayout
> *ai)
> >>>>> {
> >>>>>          ai->_virtualAddr = addr + ai->_fileOffset;
> >>>>> -    }
> >>>>> +    });
> >>>>>      }
> >>>>>
> >>>>>      /// \brief Set the file offset of each Atom in the section. This
> >>>>> routine
> >>>>>      /// gets called after the linker fixes up the section offset
> >>>>>      void assignFileOffsets(uint64_t offset) override {
> >>>>> -    for (auto &ai : _atoms) {
> >>>>> +    parallel_for_each(_atoms.begin(), _atoms.end(), [&](AtomLayout
> *ai)
> >>>>> {
> >>>>>          ai->_fileOffset = offset + ai->_fileOffset;
> >>>>> -    }
> >>>>> +    });
> >>>>>      }
> >>>>>
> >>>>>      /// \brief Find the Atom address given a name, this is needed to
> >>>>> properly
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> llvm-commits mailing list
> >>>>> llvm-commits at cs.uiuc.edu
> >>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> >>>>> hosted by the Linux Foundation
> >>>>>
> >>>>>
> >>>
> >>>
> >>> --
> >>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> hosted
> >>> by the Linux Foundation
> >>>
> >>
> >>
> >> _______________________________________________
> >> llvm-commits mailing list
> >> llvm-commits at cs.uiuc.edu
> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150318/307e2b74/attachment.html>