[lld] r232460 - [ELF] Use parallel_for_each for writing.
Rui Ueyama
ruiu at google.com
Wed Mar 18 09:13:17 PDT 2015
On Wed, Mar 18, 2015 at 9:04 AM, Rafael EspĂndola <
rafael.espindola at gmail.com> wrote:
> In this case, when linking a Release+Asserts clang what I got was
>
> master:
>
> 1858.530684 task-clock (msec) # 0.999 CPUs
> utilized ( +- 0.02% )
> 1,246 context-switches # 0.670 K/sec
> 0 cpu-migrations # 0.000 K/sec
> ( +-100.00% )
> 191,223 page-faults # 0.103 M/sec
> ( +- 0.00% )
> 5,579,119,294 cycles # 3.002 GHz
> ( +- 0.02% )
> 3,086,413,171 stalled-cycles-frontend # 55.32% frontend
> cycles idle ( +- 0.03% )
> <not supported> stalled-cycles-backend
> 6,059,256,591 instructions # 1.09 insns per
> cycle
> # 0.51 stalled
> cycles per insn ( +- 0.00% )
> 1,261,645,273 branches # 678.840 M/sec
> ( +- 0.00% )
> 26,517,441 branch-misses # 2.10% of all
> branches ( +- 0.00% )
>
> 1.860335083 seconds time elapsed
> ( +- 0.02% )
>
>
> master with your patch reverted:
>
>
> 1840.225861 task-clock (msec) # 0.999 CPUs
> utilized ( +- 0.06% )
> 1,170 context-switches # 0.636 K/sec
> 0 cpu-migrations # 0.000 K/sec
> ( +- 68.82% )
> 191,225 page-faults # 0.104 M/sec
> ( +- 0.00% )
> 5,532,122,558 cycles # 3.006 GHz
> ( +- 0.04% )
> 3,052,067,591 stalled-cycles-frontend # 55.17% frontend
> cycles idle ( +- 0.08% )
> <not supported> stalled-cycles-backend
> 6,002,264,641 instructions # 1.08 insns per
> cycle
> # 0.51 stalled
> cycles per insn ( +- 0.00% )
> 1,250,316,604 branches # 679.436 M/sec
> ( +- 0.00% )
> 26,532,702 branch-misses # 2.12% of all
> branches ( +- 0.00% )
>
> 1.842000792 seconds time elapsed
> ( +- 0.06% )
It looks to me that the results of the two are almost the same?
>
> On 18 March 2015 at 11:55, Rafael EspĂndola <rafael.espindola at gmail.com>
> wrote:
> > Are you on Linux? What I normally do for benchmarking is
> >
> > * Put all the files on tmpfs
> > * Disable address space randomization:
> > echo 0 > /proc/sys/kernel/randomize_va_space
> > * Disable cpu frequency scaling
> > for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
> > echo performance > $i; done
> >
> > * Use perf to run it multiple times and schedtool to run it at very
> > high priority:
> > sudo schedtool -F -p 99 -a 0x4 -e perf stat -r 20
> >
> >
> > On 17 March 2015 at 18:27, Rui Ueyama <ruiu at google.com> wrote:
> >> Why don't you just run it many more times?
> >>
> >> On Tue, Mar 17, 2015 at 3:20 PM, Shankar Easwaran <
> shankare at codeaurora.org>
> >> wrote:
> >>>
> >>> Not sure if doing this same experiment on different unixes may give
> some
> >>> information (or) linking the same object files on windows will give
> more
> >>> information ?
> >>>
> >>> How may data points do you usually collect ?
> >>>
> >>> Shankar Easwaran
> >>>
> >>>
> >>> On 3/17/2015 5:10 PM, Rui Ueyama wrote:
> >>>>
> >>>> I reformat your results here. As you can see S/N is too low. Maybe we
> >>>> cannot say anything only from four data points.
> >>>>
> >>>> LLD with patch
> >>>> 4.16user 0.80system 0:03.06elapsed 162%CPU (0avgtext+0avgdata
> >>>> 7174160maxresident)k
> >>>> 3.94user 0.86system 0:02.93elapsed 163%CPU (0avgtext+0avgdata
> >>>> 7175808maxresident)k
> >>>> 4.36user 1.05system 0:03.08elapsed 175%CPU (0avgtext+0avgdata
> >>>> 7176320maxresident)k
> >>>> 4.17user 0.72system 0:02.93elapsed 166%CPU (0avgtext+0avgdata
> >>>> 7175120maxresident)k
> >>>>
> >>>> LLD without patch
> >>>> 4.49user 0.92system 0:03.32elapsed 162%CPU (0avgtext+0avgdata
> >>>> 7179984maxresident)k
> >>>> 4.12user 0.83system 0:03.22elapsed 154%CPU (0avgtext+0avgdata
> >>>> 7172704maxresident)k
> >>>> 4.38user 0.90system 0:03.14elapsed 168%CPU (0avgtext+0avgdata
> >>>> 7175600maxresident)k
> >>>> 4.20user 0.79system 0:03.08elapsed 161%CPU (0avgtext+0avgdata
> >>>> 7174864maxresident)k
> >>>>
> >>>>
> >>>> On Tue, Mar 17, 2015 at 2:57 PM, Shankar Easwaran
> >>>> <shankare at codeaurora.org>
> >>>> wrote:
> >>>>
> >>>>> I tried to measure this again with 4 tries and got results, to make
> sure
> >>>>> just in case, and I see few results identical to what I measured
> before
> >>>>> :-
> >>>>>
> >>>>> *Raw data below :-*
> >>>>>
> >>>>>
> >>>>> LLD Try With Patch #1
> >>>>> 4.16user 0.80system 0:03.06elapsed 162%CPU (0avgtext+0avgdata
> >>>>> 7174160maxresident)k
> >>>>> LLD Try Without Patch #1
> >>>>> 4.49user 0.92system 0:03.32elapsed 162%CPU (0avgtext+0avgdata
> >>>>> 7179984maxresident)k
> >>>>> BFD Try #1
> >>>>> 7.81user 0.68system 0:08.53elapsed 99%CPU (0avgtext+0avgdata
> >>>>> 3230416maxresident)k
> >>>>> LLD Try With Patch #2
> >>>>> 3.94user 0.86system 0:02.93elapsed 163%CPU (0avgtext+0avgdata
> >>>>> 7175808maxresident)k
> >>>>> LLD Try Without Patch #2
> >>>>> 4.12user 0.83system 0:03.22elapsed 154%CPU (0avgtext+0avgdata
> >>>>> 7172704maxresident)k
> >>>>> BFD Try #2
> >>>>> 7.78user 0.75system 0:08.57elapsed 99%CPU (0avgtext+0avgdata
> >>>>> 3230416maxresident)k
> >>>>> LLD Try With Patch #3
> >>>>> 4.36user 1.05system 0:03.08elapsed 175%CPU (0avgtext+0avgdata
> >>>>> 7176320maxresident)k
> >>>>> LLD Try Without Patch #3
> >>>>> 4.38user 0.90system 0:03.14elapsed 168%CPU (0avgtext+0avgdata
> >>>>> 7175600maxresident)k
> >>>>> BFD Try #3
> >>>>> 7.78user 0.64system 0:08.46elapsed 99%CPU (0avgtext+0avgdata
> >>>>> 3230416maxresident)k
> >>>>> LLD Try With Patch #4
> >>>>> 4.17user 0.72system 0:02.93elapsed 166%CPU (0avgtext+0avgdata
> >>>>> 7175120maxresident)k
> >>>>> LLD Try Without Patch #4
> >>>>> 4.20user 0.79system 0:03.08elapsed 161%CPU (0avgtext+0avgdata
> >>>>> 7174864maxresident)k
> >>>>> BFD Try #4
> >>>>> 7.77user 0.66system 0:08.46elapsed 99%CPU (0avgtext+0avgdata
> >>>>> 3230416maxresident)k
> >>>>>
> >>>>> *Questions :-*
> >>>>>
> >>>>> As Rui mentions I dont know why the user time is more without the
> patch,
> >>>>> any methods to verify this ?
> >>>>> Could this be because of user threads instead of kernel threads ?
> >>>>>
> >>>>> Shankar Easwaran
> >>>>>
> >>>>>
> >>>>> On 3/17/2015 3:35 PM, Shankar Easwaran wrote:
> >>>>>
> >>>>> Yes, this is true. There were several logs of runs in the same file
> that
> >>>>> I
> >>>>> read into the commit and manually removing them resulted in two user
> >>>>> lines.
> >>>>>
> >>>>> But the result for all reasons is true. I can re-measure the time
> taken
> >>>>> though.
> >>>>>
> >>>>> Shankar Easwaran
> >>>>>
> >>>>> On 3/17/2015 2:30 PM, Rui Ueyama wrote:
> >>>>>
> >>>>> On Mon, Mar 16, 2015 at 8:29 PM, Shankar Easwaran
> >>>>> <shankare at codeaurora.org> <shankare at codeaurora.org>
> >>>>>
> >>>>> wrote:
> >>>>>
> >>>>> Author: shankare
> >>>>> Date: Mon Mar 16 22:29:32 2015
> >>>>> New Revision: 232460
> >>>>>
> >>>>> URL: http://llvm.org/viewvc/llvm-project?rev=232460&view=rev
> >>>>> Log:
> >>>>> [ELF] Use parallel_for_each for writing.
> >>>>>
> >>>>> This changes improves performance of lld, when self-hosting lld, when
> >>>>> compared
> >>>>> with the bfd linker. BFD linker on average takes 8 seconds in elapsed
> >>>>> time.
> >>>>> lld takes 3 seconds elapased time average. Without this change, lld
> >>>>> takes
> >>>>> ~5
> >>>>> seconds average. The runtime comparisons were done on a release build
> >>>>> and
> >>>>> measured by running linking thrice.
> >>>>>
> >>>>> lld self-host without the change
> >>>>> ----------------------------------
> >>>>> real 0m3.196s
> >>>>> user 0m4.580s
> >>>>> sys 0m0.832s
> >>>>>
> >>>>> lld self-host with lld
> >>>>> -----------------------
> >>>>> user 0m3.024s
> >>>>> user 0m3.252s
> >>>>> sys 0m0.796s
> >>>>>
> >>>>> The above results don't look real output of "time" command.
> >>>>>
> >>>>> If it's real, it's too good to be true, assuming the first line of
> the
> >>>>> second result is "real" instead of "user".
> >>>>>
> >>>>> "real" is wall clock time from process start to process exit. "user"
> is
> >>>>> CPU
> >>>>> time consumed by the process in user mode (if a process is
> >>>>> multi-threaded,
> >>>>> it can be larger than real).
> >>>>>
> >>>>> Your result shows significant improvement in user time. Which means
> you
> >>>>> have significantly reduced the amount of processing time to do the
> same
> >>>>> thing compared to before. However, because this change didn't change
> >>>>> algorithm, but just execute them in parallel, it couldn't happen.
> >>>>>
> >>>>> Something's not correct.
> >>>>>
> >>>>> I appreciate your effort to make LLD faster, but we need to be
> careful
> >>>>> about benchmark results. If we don't measure improvements accurately,
> >>>>> it's
> >>>>> easy to make an "optimization" that makes things slower.
> >>>>>
> >>>>> Another important thing is to disbelieve what you do when you
> optimize
> >>>>> something and measure its effect. It sometimes happen that I believe
> >>>>> something is going to improve performance 100% sure but it actually
> >>>>> wouldn't.
> >>>>>
> >>>>> time taken to build lld with bfd
> >>>>>
> >>>>> --------------------------------
> >>>>> real 0m8.419s
> >>>>> user 0m7.748s
> >>>>> sys 0m0.632s
> >>>>>
> >>>>> Modified:
> >>>>> lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h
> >>>>> lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h
> >>>>>
> >>>>> Modified: lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h
> >>>>> URL:
> >>>>>
> >>>>>
> >>>>>
> http://llvm.org/viewvc/llvm-project/lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h?rev=232460&r1=232459&r2=232460&view=diff
> >>>>>
> >>>>>
> >>>>>
> ==============================================================================
> >>>>>
> >>>>> --- lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h (original)
> >>>>> +++ lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h Mon Mar 16
> 22:29:32
> >>>>> 2015
> >>>>> @@ -586,8 +586,10 @@ std::error_code OutputELFWriter<ELFT>::w
> >>>>> _elfHeader->write(this, _layout, *buffer);
> >>>>> _programHeader->write(this, _layout, *buffer);
> >>>>>
> >>>>> - for (auto section : _layout.sections())
> >>>>> - section->write(this, _layout, *buffer);
> >>>>> + auto sections = _layout.sections();
> >>>>> + parallel_for_each(
> >>>>> + sections.begin(), sections.end(),
> >>>>> + [&](Chunk<ELFT> *section) { section->write(this, _layout,
> >>>>> *buffer);
> >>>>> });
> >>>>> writeTask.end();
> >>>>>
> >>>>> ScopedTask commitTask(getDefaultDomain(), "ELF Writer commit to
> >>>>> disk");
> >>>>>
> >>>>> Modified: lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h
> >>>>> URL:
> >>>>>
> >>>>>
> >>>>>
> http://llvm.org/viewvc/llvm-project/lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h?rev=232460&r1=232459&r2=232460&view=diff
> >>>>>
> >>>>>
> >>>>>
> ==============================================================================
> >>>>>
> >>>>> --- lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h (original)
> >>>>> +++ lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h Mon Mar 16
> 22:29:32
> >>>>> 2015
> >>>>> @@ -234,17 +234,17 @@ public:
> >>>>> /// routine gets called after the linker fixes up the virtual
> >>>>> address
> >>>>> /// of the section
> >>>>> virtual void assignVirtualAddress(uint64_t addr) override {
> >>>>> - for (auto &ai : _atoms) {
> >>>>> + parallel_for_each(_atoms.begin(), _atoms.end(), [&](AtomLayout
> *ai)
> >>>>> {
> >>>>> ai->_virtualAddr = addr + ai->_fileOffset;
> >>>>> - }
> >>>>> + });
> >>>>> }
> >>>>>
> >>>>> /// \brief Set the file offset of each Atom in the section. This
> >>>>> routine
> >>>>> /// gets called after the linker fixes up the section offset
> >>>>> void assignFileOffsets(uint64_t offset) override {
> >>>>> - for (auto &ai : _atoms) {
> >>>>> + parallel_for_each(_atoms.begin(), _atoms.end(), [&](AtomLayout
> *ai)
> >>>>> {
> >>>>> ai->_fileOffset = offset + ai->_fileOffset;
> >>>>> - }
> >>>>> + });
> >>>>> }
> >>>>>
> >>>>> /// \brief Find the Atom address given a name, this is needed to
> >>>>> properly
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> llvm-commits mailing list
> >>>>> llvm-commits at cs.uiuc.edu
> >>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> >>>>> hosted by the Linux Foundation
> >>>>>
> >>>>>
> >>>
> >>>
> >>> --
> >>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> hosted
> >>> by the Linux Foundation
> >>>
> >>
> >>
> >> _______________________________________________
> >> llvm-commits mailing list
> >> llvm-commits at cs.uiuc.edu
> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150318/307e2b74/attachment.html>
More information about the llvm-commits
mailing list