[lld] r232460 - [ELF] Use parallel_for_each for writing.
Shankar Easwaran
shankare at codeaurora.org
Wed Mar 18 09:14:11 PDT 2015
Does this repeat with the same numbers across similar tries ?
On 3/18/2015 11:04 AM, Rafael EspĂndola wrote:
> In this case, when linking a Release+Asserts clang what I got was
>
> master:
>
> 1858.530684 task-clock (msec) # 0.999 CPUs
> utilized ( +- 0.02% )
> 1,246 context-switches # 0.670 K/sec
> 0 cpu-migrations # 0.000 K/sec
> ( +-100.00% )
> 191,223 page-faults # 0.103 M/sec
> ( +- 0.00% )
> 5,579,119,294 cycles # 3.002 GHz
> ( +- 0.02% )
> 3,086,413,171 stalled-cycles-frontend # 55.32% frontend
> cycles idle ( +- 0.03% )
> <not supported> stalled-cycles-backend
> 6,059,256,591 instructions # 1.09 insns per
> cycle
> # 0.51 stalled
> cycles per insn ( +- 0.00% )
> 1,261,645,273 branches # 678.840 M/sec
> ( +- 0.00% )
> 26,517,441 branch-misses # 2.10% of all
> branches ( +- 0.00% )
>
> 1.860335083 seconds time elapsed
> ( +- 0.02% )
>
>
> master with your patch reverted:
>
>
> 1840.225861 task-clock (msec) # 0.999 CPUs
> utilized ( +- 0.06% )
> 1,170 context-switches # 0.636 K/sec
> 0 cpu-migrations # 0.000 K/sec
> ( +- 68.82% )
> 191,225 page-faults # 0.104 M/sec
> ( +- 0.00% )
> 5,532,122,558 cycles # 3.006 GHz
> ( +- 0.04% )
> 3,052,067,591 stalled-cycles-frontend # 55.17% frontend
> cycles idle ( +- 0.08% )
> <not supported> stalled-cycles-backend
> 6,002,264,641 instructions # 1.08 insns per
> cycle
> # 0.51 stalled
> cycles per insn ( +- 0.00% )
> 1,250,316,604 branches # 679.436 M/sec
> ( +- 0.00% )
> 26,532,702 branch-misses # 2.12% of all
> branches ( +- 0.00% )
>
> 1.842000792 seconds time elapsed
> ( +- 0.06% )
>
>
> On 18 March 2015 at 11:55, Rafael EspĂndola <rafael.espindola at gmail.com> wrote:
>> Are you on Linux? What I normally do for benchmarking is
>>
>> * Put all the files on tmpfs
>> * Disable address space randomization:
>> echo 0 > /proc/sys/kernel/randomize_va_space
>> * Disable cpu frequency scaling
>> for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
>> echo performance > $i; done
>>
>> * Use perf to run it multiple times and schedtool to run it at very
>> high priority:
>> sudo schedtool -F -p 99 -a 0x4 -e perf stat -r 20
>>
>>
>> On 17 March 2015 at 18:27, Rui Ueyama <ruiu at google.com> wrote:
>>> Why don't you just run it many more times?
>>>
>>> On Tue, Mar 17, 2015 at 3:20 PM, Shankar Easwaran <shankare at codeaurora.org>
>>> wrote:
>>>> Not sure if doing this same experiment on different unixes may give some
>>>> information (or) linking the same object files on windows will give more
>>>> information ?
>>>>
>>>> How may data points do you usually collect ?
>>>>
>>>> Shankar Easwaran
>>>>
>>>>
>>>> On 3/17/2015 5:10 PM, Rui Ueyama wrote:
>>>>> I reformat your results here. As you can see S/N is too low. Maybe we
>>>>> cannot say anything only from four data points.
>>>>>
>>>>> LLD with patch
>>>>> 4.16user 0.80system 0:03.06elapsed 162%CPU (0avgtext+0avgdata
>>>>> 7174160maxresident)k
>>>>> 3.94user 0.86system 0:02.93elapsed 163%CPU (0avgtext+0avgdata
>>>>> 7175808maxresident)k
>>>>> 4.36user 1.05system 0:03.08elapsed 175%CPU (0avgtext+0avgdata
>>>>> 7176320maxresident)k
>>>>> 4.17user 0.72system 0:02.93elapsed 166%CPU (0avgtext+0avgdata
>>>>> 7175120maxresident)k
>>>>>
>>>>> LLD without patch
>>>>> 4.49user 0.92system 0:03.32elapsed 162%CPU (0avgtext+0avgdata
>>>>> 7179984maxresident)k
>>>>> 4.12user 0.83system 0:03.22elapsed 154%CPU (0avgtext+0avgdata
>>>>> 7172704maxresident)k
>>>>> 4.38user 0.90system 0:03.14elapsed 168%CPU (0avgtext+0avgdata
>>>>> 7175600maxresident)k
>>>>> 4.20user 0.79system 0:03.08elapsed 161%CPU (0avgtext+0avgdata
>>>>> 7174864maxresident)k
>>>>>
>>>>>
>>>>> On Tue, Mar 17, 2015 at 2:57 PM, Shankar Easwaran
>>>>> <shankare at codeaurora.org>
>>>>> wrote:
>>>>>
>>>>>> I tried to measure this again with 4 tries and got results, to make sure
>>>>>> just in case, and I see few results identical to what I measured before
>>>>>> :-
>>>>>>
>>>>>> *Raw data below :-*
>>>>>>
>>>>>>
>>>>>> LLD Try With Patch #1
>>>>>> 4.16user 0.80system 0:03.06elapsed 162%CPU (0avgtext+0avgdata
>>>>>> 7174160maxresident)k
>>>>>> LLD Try Without Patch #1
>>>>>> 4.49user 0.92system 0:03.32elapsed 162%CPU (0avgtext+0avgdata
>>>>>> 7179984maxresident)k
>>>>>> BFD Try #1
>>>>>> 7.81user 0.68system 0:08.53elapsed 99%CPU (0avgtext+0avgdata
>>>>>> 3230416maxresident)k
>>>>>> LLD Try With Patch #2
>>>>>> 3.94user 0.86system 0:02.93elapsed 163%CPU (0avgtext+0avgdata
>>>>>> 7175808maxresident)k
>>>>>> LLD Try Without Patch #2
>>>>>> 4.12user 0.83system 0:03.22elapsed 154%CPU (0avgtext+0avgdata
>>>>>> 7172704maxresident)k
>>>>>> BFD Try #2
>>>>>> 7.78user 0.75system 0:08.57elapsed 99%CPU (0avgtext+0avgdata
>>>>>> 3230416maxresident)k
>>>>>> LLD Try With Patch #3
>>>>>> 4.36user 1.05system 0:03.08elapsed 175%CPU (0avgtext+0avgdata
>>>>>> 7176320maxresident)k
>>>>>> LLD Try Without Patch #3
>>>>>> 4.38user 0.90system 0:03.14elapsed 168%CPU (0avgtext+0avgdata
>>>>>> 7175600maxresident)k
>>>>>> BFD Try #3
>>>>>> 7.78user 0.64system 0:08.46elapsed 99%CPU (0avgtext+0avgdata
>>>>>> 3230416maxresident)k
>>>>>> LLD Try With Patch #4
>>>>>> 4.17user 0.72system 0:02.93elapsed 166%CPU (0avgtext+0avgdata
>>>>>> 7175120maxresident)k
>>>>>> LLD Try Without Patch #4
>>>>>> 4.20user 0.79system 0:03.08elapsed 161%CPU (0avgtext+0avgdata
>>>>>> 7174864maxresident)k
>>>>>> BFD Try #4
>>>>>> 7.77user 0.66system 0:08.46elapsed 99%CPU (0avgtext+0avgdata
>>>>>> 3230416maxresident)k
>>>>>>
>>>>>> *Questions :-*
>>>>>>
>>>>>> As Rui mentions I dont know why the user time is more without the patch,
>>>>>> any methods to verify this ?
>>>>>> Could this be because of user threads instead of kernel threads ?
>>>>>>
>>>>>> Shankar Easwaran
>>>>>>
>>>>>>
>>>>>> On 3/17/2015 3:35 PM, Shankar Easwaran wrote:
>>>>>>
>>>>>> Yes, this is true. There were several logs of runs in the same file that
>>>>>> I
>>>>>> read into the commit and manually removing them resulted in two user
>>>>>> lines.
>>>>>>
>>>>>> But the result for all reasons is true. I can re-measure the time taken
>>>>>> though.
>>>>>>
>>>>>> Shankar Easwaran
>>>>>>
>>>>>> On 3/17/2015 2:30 PM, Rui Ueyama wrote:
>>>>>>
>>>>>> On Mon, Mar 16, 2015 at 8:29 PM, Shankar Easwaran
>>>>>> <shankare at codeaurora.org> <shankare at codeaurora.org>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>> Author: shankare
>>>>>> Date: Mon Mar 16 22:29:32 2015
>>>>>> New Revision: 232460
>>>>>>
>>>>>> URL: http://llvm.org/viewvc/llvm-project?rev=232460&view=rev
>>>>>> Log:
>>>>>> [ELF] Use parallel_for_each for writing.
>>>>>>
>>>>>> This changes improves performance of lld, when self-hosting lld, when
>>>>>> compared
>>>>>> with the bfd linker. BFD linker on average takes 8 seconds in elapsed
>>>>>> time.
>>>>>> lld takes 3 seconds elapased time average. Without this change, lld
>>>>>> takes
>>>>>> ~5
>>>>>> seconds average. The runtime comparisons were done on a release build
>>>>>> and
>>>>>> measured by running linking thrice.
>>>>>>
>>>>>> lld self-host without the change
>>>>>> ----------------------------------
>>>>>> real 0m3.196s
>>>>>> user 0m4.580s
>>>>>> sys 0m0.832s
>>>>>>
>>>>>> lld self-host with lld
>>>>>> -----------------------
>>>>>> user 0m3.024s
>>>>>> user 0m3.252s
>>>>>> sys 0m0.796s
>>>>>>
>>>>>> The above results don't look real output of "time" command.
>>>>>>
>>>>>> If it's real, it's too good to be true, assuming the first line of the
>>>>>> second result is "real" instead of "user".
>>>>>>
>>>>>> "real" is wall clock time from process start to process exit. "user" is
>>>>>> CPU
>>>>>> time consumed by the process in user mode (if a process is
>>>>>> multi-threaded,
>>>>>> it can be larger than real).
>>>>>>
>>>>>> Your result shows significant improvement in user time. Which means you
>>>>>> have significantly reduced the amount of processing time to do the same
>>>>>> thing compared to before. However, because this change didn't change
>>>>>> algorithm, but just execute them in parallel, it couldn't happen.
>>>>>>
>>>>>> Something's not correct.
>>>>>>
>>>>>> I appreciate your effort to make LLD faster, but we need to be careful
>>>>>> about benchmark results. If we don't measure improvements accurately,
>>>>>> it's
>>>>>> easy to make an "optimization" that makes things slower.
>>>>>>
>>>>>> Another important thing is to disbelieve what you do when you optimize
>>>>>> something and measure its effect. It sometimes happen that I believe
>>>>>> something is going to improve performance 100% sure but it actually
>>>>>> wouldn't.
>>>>>>
>>>>>> time taken to build lld with bfd
>>>>>>
>>>>>> --------------------------------
>>>>>> real 0m8.419s
>>>>>> user 0m7.748s
>>>>>> sys 0m0.632s
>>>>>>
>>>>>> Modified:
>>>>>> lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h
>>>>>> lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h
>>>>>>
>>>>>> Modified: lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h
>>>>>> URL:
>>>>>>
>>>>>>
>>>>>> http://llvm.org/viewvc/llvm-project/lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h?rev=232460&r1=232459&r2=232460&view=diff
>>>>>>
>>>>>>
>>>>>> ==============================================================================
>>>>>>
>>>>>> --- lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h (original)
>>>>>> +++ lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h Mon Mar 16 22:29:32
>>>>>> 2015
>>>>>> @@ -586,8 +586,10 @@ std::error_code OutputELFWriter<ELFT>::w
>>>>>> _elfHeader->write(this, _layout, *buffer);
>>>>>> _programHeader->write(this, _layout, *buffer);
>>>>>>
>>>>>> - for (auto section : _layout.sections())
>>>>>> - section->write(this, _layout, *buffer);
>>>>>> + auto sections = _layout.sections();
>>>>>> + parallel_for_each(
>>>>>> + sections.begin(), sections.end(),
>>>>>> + [&](Chunk<ELFT> *section) { section->write(this, _layout,
>>>>>> *buffer);
>>>>>> });
>>>>>> writeTask.end();
>>>>>>
>>>>>> ScopedTask commitTask(getDefaultDomain(), "ELF Writer commit to
>>>>>> disk");
>>>>>>
>>>>>> Modified: lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h
>>>>>> URL:
>>>>>>
>>>>>>
>>>>>> http://llvm.org/viewvc/llvm-project/lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h?rev=232460&r1=232459&r2=232460&view=diff
>>>>>>
>>>>>>
>>>>>> ==============================================================================
>>>>>>
>>>>>> --- lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h (original)
>>>>>> +++ lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h Mon Mar 16 22:29:32
>>>>>> 2015
>>>>>> @@ -234,17 +234,17 @@ public:
>>>>>> /// routine gets called after the linker fixes up the virtual
>>>>>> address
>>>>>> /// of the section
>>>>>> virtual void assignVirtualAddress(uint64_t addr) override {
>>>>>> - for (auto &ai : _atoms) {
>>>>>> + parallel_for_each(_atoms.begin(), _atoms.end(), [&](AtomLayout *ai)
>>>>>> {
>>>>>> ai->_virtualAddr = addr + ai->_fileOffset;
>>>>>> - }
>>>>>> + });
>>>>>> }
>>>>>>
>>>>>> /// \brief Set the file offset of each Atom in the section. This
>>>>>> routine
>>>>>> /// gets called after the linker fixes up the section offset
>>>>>> void assignFileOffsets(uint64_t offset) override {
>>>>>> - for (auto &ai : _atoms) {
>>>>>> + parallel_for_each(_atoms.begin(), _atoms.end(), [&](AtomLayout *ai)
>>>>>> {
>>>>>> ai->_fileOffset = offset + ai->_fileOffset;
>>>>>> - }
>>>>>> + });
>>>>>> }
>>>>>>
>>>>>> /// \brief Find the Atom address given a name, this is needed to
>>>>>> properly
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> llvm-commits mailing list
>>>>>> llvm-commits at cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
>>>>>> hosted by the Linux Foundation
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted
>>>> by the Linux Foundation
>>>>
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by the Linux Foundation
More information about the llvm-commits
mailing list