[lld] r232460 - [ELF] Use parallel_for_each for writing.

Tue Mar 17 15:27:07 PDT 2015

Why don't you just run it many more times?

On Tue, Mar 17, 2015 at 3:20 PM, Shankar Easwaran <shankare at codeaurora.org>
wrote:

> Not sure if doing this same experiment on different unixes may give some
> information (or) linking the same object files on windows will give more
> information ?
>
> How may data points do you usually collect ?
>
> Shankar Easwaran
>
>
> On 3/17/2015 5:10 PM, Rui Ueyama wrote:
>
>> I reformat your results here. As you can see S/N is too low. Maybe we
>> cannot say anything only from four data points.
>>
>> LLD with patch
>> 4.16user 0.80system 0:03.06elapsed 162%CPU (0avgtext+0avgdata
>> 7174160maxresident)k
>> 3.94user 0.86system 0:02.93elapsed 163%CPU (0avgtext+0avgdata
>> 7175808maxresident)k
>> 4.36user 1.05system 0:03.08elapsed 175%CPU (0avgtext+0avgdata
>> 7176320maxresident)k
>> 4.17user 0.72system 0:02.93elapsed 166%CPU (0avgtext+0avgdata
>> 7175120maxresident)k
>>
>> LLD without patch
>> 4.49user 0.92system 0:03.32elapsed 162%CPU (0avgtext+0avgdata
>> 7179984maxresident)k
>> 4.12user 0.83system 0:03.22elapsed 154%CPU (0avgtext+0avgdata
>> 7172704maxresident)k
>> 4.38user 0.90system 0:03.14elapsed 168%CPU (0avgtext+0avgdata
>> 7175600maxresident)k
>> 4.20user 0.79system 0:03.08elapsed 161%CPU (0avgtext+0avgdata
>> 7174864maxresident)k
>>
>>
>> On Tue, Mar 17, 2015 at 2:57 PM, Shankar Easwaran <
>> shankare at codeaurora.org>
>> wrote:
>>
>>  I tried to measure this again with 4 tries and got results, to make sure
>>> just in case, and I see few results identical to what I measured before
>>> :-
>>>
>>> *Raw data below :-*
>>>
>>>
>>> LLD Try With Patch #1
>>> 4.16user 0.80system 0:03.06elapsed 162%CPU (0avgtext+0avgdata
>>> 7174160maxresident)k
>>> LLD Try Without Patch #1
>>> 4.49user 0.92system 0:03.32elapsed 162%CPU (0avgtext+0avgdata
>>> 7179984maxresident)k
>>> BFD Try #1
>>> 7.81user 0.68system 0:08.53elapsed 99%CPU (0avgtext+0avgdata
>>> 3230416maxresident)k
>>> LLD Try With Patch #2
>>> 3.94user 0.86system 0:02.93elapsed 163%CPU (0avgtext+0avgdata
>>> 7175808maxresident)k
>>> LLD Try Without Patch #2
>>> 4.12user 0.83system 0:03.22elapsed 154%CPU (0avgtext+0avgdata
>>> 7172704maxresident)k
>>> BFD Try #2
>>> 7.78user 0.75system 0:08.57elapsed 99%CPU (0avgtext+0avgdata
>>> 3230416maxresident)k
>>> LLD Try With Patch #3
>>> 4.36user 1.05system 0:03.08elapsed 175%CPU (0avgtext+0avgdata
>>> 7176320maxresident)k
>>> LLD Try Without Patch #3
>>> 4.38user 0.90system 0:03.14elapsed 168%CPU (0avgtext+0avgdata
>>> 7175600maxresident)k
>>> BFD Try #3
>>> 7.78user 0.64system 0:08.46elapsed 99%CPU (0avgtext+0avgdata
>>> 3230416maxresident)k
>>> LLD Try With Patch #4
>>> 4.17user 0.72system 0:02.93elapsed 166%CPU (0avgtext+0avgdata
>>> 7175120maxresident)k
>>> LLD Try Without Patch #4
>>> 4.20user 0.79system 0:03.08elapsed 161%CPU (0avgtext+0avgdata
>>> 7174864maxresident)k
>>> BFD Try #4
>>> 7.77user 0.66system 0:08.46elapsed 99%CPU (0avgtext+0avgdata
>>> 3230416maxresident)k
>>>
>>> *Questions :-*
>>>
>>> As Rui mentions I dont know why the user time is more without the patch,
>>> any methods to verify this ?
>>> Could this be because of user threads instead of kernel threads ?
>>>
>>> Shankar Easwaran
>>>
>>>
>>> On 3/17/2015 3:35 PM, Shankar Easwaran wrote:
>>>
>>> Yes, this is true. There were several logs of runs in the same file that
>>> I
>>> read into the commit and manually removing them resulted in two user
>>> lines.
>>>
>>> But the result for all reasons is true. I can re-measure the time taken
>>> though.
>>>
>>> Shankar Easwaran
>>>
>>> On 3/17/2015 2:30 PM, Rui Ueyama wrote:
>>>
>>> On Mon, Mar 16, 2015 at 8:29 PM, Shankar Easwaran
>>> <shankare at codeaurora.org> <shankare at codeaurora.org>
>>>
>>> wrote:
>>>
>>> Author: shankare
>>> Date: Mon Mar 16 22:29:32 2015
>>> New Revision: 232460
>>>
>>> URL: http://llvm.org/viewvc/llvm-project?rev=232460&view=rev
>>> Log:
>>> [ELF] Use parallel_for_each for writing.
>>>
>>> This changes improves performance of lld, when self-hosting lld, when
>>> compared
>>> with the bfd linker. BFD linker on average takes 8 seconds in elapsed
>>> time.
>>> lld takes 3 seconds elapased time average. Without this change, lld takes
>>> ~5
>>> seconds average. The runtime comparisons were done on a release build and
>>> measured by running linking thrice.
>>>
>>> lld self-host without the change
>>> ----------------------------------
>>> real    0m3.196s
>>> user    0m4.580s
>>> sys     0m0.832s
>>>
>>> lld self-host with lld
>>> -----------------------
>>> user    0m3.024s
>>> user    0m3.252s
>>> sys     0m0.796s
>>>
>>>   The above results don't look real output of "time" command.
>>>
>>> If it's real, it's too good to be true, assuming the first line of the
>>> second result is "real" instead of "user".
>>>
>>> "real" is wall clock time from process start to process exit. "user" is
>>> CPU
>>> time consumed by the process in user mode (if a process is
>>> multi-threaded,
>>> it can be larger than real).
>>>
>>> Your result shows significant improvement in user time. Which means you
>>> have significantly reduced the amount of processing time to do the same
>>> thing compared to before. However, because this change didn't change
>>> algorithm, but just execute them in parallel, it couldn't happen.
>>>
>>> Something's not correct.
>>>
>>> I appreciate your effort to make LLD faster, but we need to be careful
>>> about benchmark results. If we don't measure improvements accurately,
>>> it's
>>> easy to make an "optimization" that makes things slower.
>>>
>>> Another important thing is to disbelieve what you do when you optimize
>>> something and measure its effect. It sometimes happen that I believe
>>> something is going to improve performance 100% sure but it actually
>>> wouldn't.
>>>
>>> time taken to build lld with bfd
>>>
>>> --------------------------------
>>> real    0m8.419s
>>> user    0m7.748s
>>> sys     0m0.632s
>>>
>>> Modified:
>>>       lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h
>>>       lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h
>>>
>>> Modified: lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h
>>> URL:
>>>
>>> http://llvm.org/viewvc/llvm-project/lld/trunk/lib/ReaderWriter/ELF/
>>> OutputELFWriter.h?rev=232460&r1=232459&r2=232460&view=diff
>>>
>>> ============================================================
>>> ==================
>>>
>>> --- lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h (original)
>>> +++ lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h Mon Mar 16 22:29:32
>>> 2015
>>> @@ -586,8 +586,10 @@ std::error_code OutputELFWriter<ELFT>::w
>>>      _elfHeader->write(this, _layout, *buffer);
>>>      _programHeader->write(this, _layout, *buffer);
>>>
>>> -  for (auto section : _layout.sections())
>>> -    section->write(this, _layout, *buffer);
>>> +  auto sections = _layout.sections();
>>> +  parallel_for_each(
>>> +      sections.begin(), sections.end(),
>>> +      [&](Chunk<ELFT> *section) { section->write(this, _layout,
>>> *buffer);
>>> });
>>>      writeTask.end();
>>>
>>>      ScopedTask commitTask(getDefaultDomain(), "ELF Writer commit to
>>> disk");
>>>
>>> Modified: lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h
>>> URL:
>>>
>>> http://llvm.org/viewvc/llvm-project/lld/trunk/lib/ReaderWriter/ELF/
>>> SectionChunks.h?rev=232460&r1=232459&r2=232460&view=diff
>>>
>>> ============================================================
>>> ==================
>>>
>>> --- lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h (original)
>>> +++ lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h Mon Mar 16 22:29:32
>>> 2015
>>> @@ -234,17 +234,17 @@ public:
>>>      /// routine gets called after the linker fixes up the virtual
>>> address
>>>      /// of the section
>>>      virtual void assignVirtualAddress(uint64_t addr) override {
>>> -    for (auto &ai : _atoms) {
>>> +    parallel_for_each(_atoms.begin(), _atoms.end(), [&](AtomLayout
>>> *ai) {
>>>          ai->_virtualAddr = addr + ai->_fileOffset;
>>> -    }
>>> +    });
>>>      }
>>>
>>>      /// \brief Set the file offset of each Atom in the section. This
>>> routine
>>>      /// gets called after the linker fixes up the section offset
>>>      void assignFileOffsets(uint64_t offset) override {
>>> -    for (auto &ai : _atoms) {
>>> +    parallel_for_each(_atoms.begin(), _atoms.end(), [&](AtomLayout
>>> *ai) {
>>>          ai->_fileOffset = offset + ai->_fileOffset;
>>> -    }
>>> +    });
>>>      }
>>>
>>>      /// \brief Find the Atom address given a name, this is needed to
>>> properly
>>>
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
>>> hosted by the Linux Foundation
>>>
>>>
>>>
>
> --
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted
> by the Linux Foundation
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150317/1d35b908/attachment.html>