[lld] r232460 - [ELF] Use parallel_for_each for writing.

Wed Mar 18 09:04:41 PDT 2015

Wow, this is nice to know. Thanks for sharing the recipe. I will use 
this henceforth.

On 3/18/2015 10:55 AM, Rafael Espíndola wrote:
> Are you on Linux? What I normally do for benchmarking is
>
> * Put all the files on tmpfs
> * Disable address space randomization:
>    echo 0 > /proc/sys/kernel/randomize_va_space
> * Disable cpu frequency scaling
>   for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
> echo performance > $i; done
>
> * Use perf to run it multiple times and schedtool to run it at very
> high priority:
>    sudo schedtool -F  -p 99 -a 0x4 -e perf stat -r 20
>
>
> On 17 March 2015 at 18:27, Rui Ueyama <ruiu at google.com> wrote:
>> Why don't you just run it many more times?
>>
>> On Tue, Mar 17, 2015 at 3:20 PM, Shankar Easwaran <shankare at codeaurora.org>
>> wrote:
>>> Not sure if doing this same experiment on different unixes may give some
>>> information (or) linking the same object files on windows will give more
>>> information ?
>>>
>>> How may data points do you usually collect ?
>>>
>>> Shankar Easwaran
>>>
>>>
>>> On 3/17/2015 5:10 PM, Rui Ueyama wrote:
>>>> I reformat your results here. As you can see S/N is too low. Maybe we
>>>> cannot say anything only from four data points.
>>>>
>>>> LLD with patch
>>>> 4.16user 0.80system 0:03.06elapsed 162%CPU (0avgtext+0avgdata
>>>> 7174160maxresident)k
>>>> 3.94user 0.86system 0:02.93elapsed 163%CPU (0avgtext+0avgdata
>>>> 7175808maxresident)k
>>>> 4.36user 1.05system 0:03.08elapsed 175%CPU (0avgtext+0avgdata
>>>> 7176320maxresident)k
>>>> 4.17user 0.72system 0:02.93elapsed 166%CPU (0avgtext+0avgdata
>>>> 7175120maxresident)k
>>>>
>>>> LLD without patch
>>>> 4.49user 0.92system 0:03.32elapsed 162%CPU (0avgtext+0avgdata
>>>> 7179984maxresident)k
>>>> 4.12user 0.83system 0:03.22elapsed 154%CPU (0avgtext+0avgdata
>>>> 7172704maxresident)k
>>>> 4.38user 0.90system 0:03.14elapsed 168%CPU (0avgtext+0avgdata
>>>> 7175600maxresident)k
>>>> 4.20user 0.79system 0:03.08elapsed 161%CPU (0avgtext+0avgdata
>>>> 7174864maxresident)k
>>>>
>>>>
>>>> On Tue, Mar 17, 2015 at 2:57 PM, Shankar Easwaran
>>>> <shankare at codeaurora.org>
>>>> wrote:
>>>>
>>>>> I tried to measure this again with 4 tries and got results, to make sure
>>>>> just in case, and I see few results identical to what I measured before
>>>>> :-
>>>>>
>>>>> *Raw data below :-*
>>>>>
>>>>>
>>>>> LLD Try With Patch #1
>>>>> 4.16user 0.80system 0:03.06elapsed 162%CPU (0avgtext+0avgdata
>>>>> 7174160maxresident)k
>>>>> LLD Try Without Patch #1
>>>>> 4.49user 0.92system 0:03.32elapsed 162%CPU (0avgtext+0avgdata
>>>>> 7179984maxresident)k
>>>>> BFD Try #1
>>>>> 7.81user 0.68system 0:08.53elapsed 99%CPU (0avgtext+0avgdata
>>>>> 3230416maxresident)k
>>>>> LLD Try With Patch #2
>>>>> 3.94user 0.86system 0:02.93elapsed 163%CPU (0avgtext+0avgdata
>>>>> 7175808maxresident)k
>>>>> LLD Try Without Patch #2
>>>>> 4.12user 0.83system 0:03.22elapsed 154%CPU (0avgtext+0avgdata
>>>>> 7172704maxresident)k
>>>>> BFD Try #2
>>>>> 7.78user 0.75system 0:08.57elapsed 99%CPU (0avgtext+0avgdata
>>>>> 3230416maxresident)k
>>>>> LLD Try With Patch #3
>>>>> 4.36user 1.05system 0:03.08elapsed 175%CPU (0avgtext+0avgdata
>>>>> 7176320maxresident)k
>>>>> LLD Try Without Patch #3
>>>>> 4.38user 0.90system 0:03.14elapsed 168%CPU (0avgtext+0avgdata
>>>>> 7175600maxresident)k
>>>>> BFD Try #3
>>>>> 7.78user 0.64system 0:08.46elapsed 99%CPU (0avgtext+0avgdata
>>>>> 3230416maxresident)k
>>>>> LLD Try With Patch #4
>>>>> 4.17user 0.72system 0:02.93elapsed 166%CPU (0avgtext+0avgdata
>>>>> 7175120maxresident)k
>>>>> LLD Try Without Patch #4
>>>>> 4.20user 0.79system 0:03.08elapsed 161%CPU (0avgtext+0avgdata
>>>>> 7174864maxresident)k
>>>>> BFD Try #4
>>>>> 7.77user 0.66system 0:08.46elapsed 99%CPU (0avgtext+0avgdata
>>>>> 3230416maxresident)k
>>>>>
>>>>> *Questions :-*
>>>>>
>>>>> As Rui mentions I dont know why the user time is more without the patch,
>>>>> any methods to verify this ?
>>>>> Could this be because of user threads instead of kernel threads ?
>>>>>
>>>>> Shankar Easwaran
>>>>>
>>>>>
>>>>> On 3/17/2015 3:35 PM, Shankar Easwaran wrote:
>>>>>
>>>>> Yes, this is true. There were several logs of runs in the same file that
>>>>> I
>>>>> read into the commit and manually removing them resulted in two user
>>>>> lines.
>>>>>
>>>>> But the result for all reasons is true. I can re-measure the time taken
>>>>> though.
>>>>>
>>>>> Shankar Easwaran
>>>>>
>>>>> On 3/17/2015 2:30 PM, Rui Ueyama wrote:
>>>>>
>>>>> On Mon, Mar 16, 2015 at 8:29 PM, Shankar Easwaran
>>>>> <shankare at codeaurora.org> <shankare at codeaurora.org>
>>>>>
>>>>> wrote:
>>>>>
>>>>> Author: shankare
>>>>> Date: Mon Mar 16 22:29:32 2015
>>>>> New Revision: 232460
>>>>>
>>>>> URL: http://llvm.org/viewvc/llvm-project?rev=232460&view=rev
>>>>> Log:
>>>>> [ELF] Use parallel_for_each for writing.
>>>>>
>>>>> This changes improves performance of lld, when self-hosting lld, when
>>>>> compared
>>>>> with the bfd linker. BFD linker on average takes 8 seconds in elapsed
>>>>> time.
>>>>> lld takes 3 seconds elapased time average. Without this change, lld
>>>>> takes
>>>>> ~5
>>>>> seconds average. The runtime comparisons were done on a release build
>>>>> and
>>>>> measured by running linking thrice.
>>>>>
>>>>> lld self-host without the change
>>>>> ----------------------------------
>>>>> real    0m3.196s
>>>>> user    0m4.580s
>>>>> sys     0m0.832s
>>>>>
>>>>> lld self-host with lld
>>>>> -----------------------
>>>>> user    0m3.024s
>>>>> user    0m3.252s
>>>>> sys     0m0.796s
>>>>>
>>>>>    The above results don't look real output of "time" command.
>>>>>
>>>>> If it's real, it's too good to be true, assuming the first line of the
>>>>> second result is "real" instead of "user".
>>>>>
>>>>> "real" is wall clock time from process start to process exit. "user" is
>>>>> CPU
>>>>> time consumed by the process in user mode (if a process is
>>>>> multi-threaded,
>>>>> it can be larger than real).
>>>>>
>>>>> Your result shows significant improvement in user time. Which means you
>>>>> have significantly reduced the amount of processing time to do the same
>>>>> thing compared to before. However, because this change didn't change
>>>>> algorithm, but just execute them in parallel, it couldn't happen.
>>>>>
>>>>> Something's not correct.
>>>>>
>>>>> I appreciate your effort to make LLD faster, but we need to be careful
>>>>> about benchmark results. If we don't measure improvements accurately,
>>>>> it's
>>>>> easy to make an "optimization" that makes things slower.
>>>>>
>>>>> Another important thing is to disbelieve what you do when you optimize
>>>>> something and measure its effect. It sometimes happen that I believe
>>>>> something is going to improve performance 100% sure but it actually
>>>>> wouldn't.
>>>>>
>>>>> time taken to build lld with bfd
>>>>>
>>>>> --------------------------------
>>>>> real    0m8.419s
>>>>> user    0m7.748s
>>>>> sys     0m0.632s
>>>>>
>>>>> Modified:
>>>>>        lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h
>>>>>        lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h
>>>>>
>>>>> Modified: lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h
>>>>> URL:
>>>>>
>>>>>
>>>>> http://llvm.org/viewvc/llvm-project/lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h?rev=232460&r1=232459&r2=232460&view=diff
>>>>>
>>>>>
>>>>> ==============================================================================
>>>>>
>>>>> --- lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h (original)
>>>>> +++ lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h Mon Mar 16 22:29:32
>>>>> 2015
>>>>> @@ -586,8 +586,10 @@ std::error_code OutputELFWriter<ELFT>::w
>>>>>       _elfHeader->write(this, _layout, *buffer);
>>>>>       _programHeader->write(this, _layout, *buffer);
>>>>>
>>>>> -  for (auto section : _layout.sections())
>>>>> -    section->write(this, _layout, *buffer);
>>>>> +  auto sections = _layout.sections();
>>>>> +  parallel_for_each(
>>>>> +      sections.begin(), sections.end(),
>>>>> +      [&](Chunk<ELFT> *section) { section->write(this, _layout,
>>>>> *buffer);
>>>>> });
>>>>>       writeTask.end();
>>>>>
>>>>>       ScopedTask commitTask(getDefaultDomain(), "ELF Writer commit to
>>>>> disk");
>>>>>
>>>>> Modified: lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h
>>>>> URL:
>>>>>
>>>>>
>>>>> http://llvm.org/viewvc/llvm-project/lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h?rev=232460&r1=232459&r2=232460&view=diff
>>>>>
>>>>>
>>>>> ==============================================================================
>>>>>
>>>>> --- lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h (original)
>>>>> +++ lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h Mon Mar 16 22:29:32
>>>>> 2015
>>>>> @@ -234,17 +234,17 @@ public:
>>>>>       /// routine gets called after the linker fixes up the virtual
>>>>> address
>>>>>       /// of the section
>>>>>       virtual void assignVirtualAddress(uint64_t addr) override {
>>>>> -    for (auto &ai : _atoms) {
>>>>> +    parallel_for_each(_atoms.begin(), _atoms.end(), [&](AtomLayout *ai)
>>>>> {
>>>>>           ai->_virtualAddr = addr + ai->_fileOffset;
>>>>> -    }
>>>>> +    });
>>>>>       }
>>>>>
>>>>>       /// \brief Set the file offset of each Atom in the section. This
>>>>> routine
>>>>>       /// gets called after the linker fixes up the section offset
>>>>>       void assignFileOffsets(uint64_t offset) override {
>>>>> -    for (auto &ai : _atoms) {
>>>>> +    parallel_for_each(_atoms.begin(), _atoms.end(), [&](AtomLayout *ai)
>>>>> {
>>>>>           ai->_fileOffset = offset + ai->_fileOffset;
>>>>> -    }
>>>>> +    });
>>>>>       }
>>>>>
>>>>>       /// \brief Find the Atom address given a name, this is needed to
>>>>> properly
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> llvm-commits mailing list
>>>>> llvm-commits at cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
>>>>> hosted by the Linux Foundation
>>>>>
>>>>>
>>>
>>> --
>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted
>>> by the Linux Foundation
>>>
>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by the Linux Foundation