[lld] r232460 - [ELF] Use parallel_for_each for writing.

Shankar Easwaran shankare at codeaurora.org
Wed Mar 18 09:14:11 PDT 2015


Does this repeat with the same numbers across similar tries ?

On 3/18/2015 11:04 AM, Rafael EspĂ­ndola wrote:
> In this case, when linking a Release+Asserts clang what I got was
>
> master:
>
>         1858.530684      task-clock (msec)         #    0.999 CPUs
> utilized            ( +-  0.02% )
>               1,246      context-switches          #    0.670 K/sec
>                   0      cpu-migrations            #    0.000 K/sec
>                 ( +-100.00% )
>             191,223      page-faults               #    0.103 M/sec
>                 ( +-  0.00% )
>       5,579,119,294      cycles                    #    3.002 GHz
>                 ( +-  0.02% )
>       3,086,413,171      stalled-cycles-frontend   #   55.32% frontend
> cycles idle     ( +-  0.03% )
>     <not supported>      stalled-cycles-backend
>       6,059,256,591      instructions              #    1.09  insns per
> cycle
>                                                    #    0.51  stalled
> cycles per insn  ( +-  0.00% )
>       1,261,645,273      branches                  #  678.840 M/sec
>                 ( +-  0.00% )
>          26,517,441      branch-misses             #    2.10% of all
> branches          ( +-  0.00% )
>
>         1.860335083 seconds time elapsed
>            ( +-  0.02% )
>
>
> master with your patch reverted:
>
>
>         1840.225861      task-clock (msec)         #    0.999 CPUs
> utilized            ( +-  0.06% )
>               1,170      context-switches          #    0.636 K/sec
>                   0      cpu-migrations            #    0.000 K/sec
>                 ( +- 68.82% )
>             191,225      page-faults               #    0.104 M/sec
>                 ( +-  0.00% )
>       5,532,122,558      cycles                    #    3.006 GHz
>                 ( +-  0.04% )
>       3,052,067,591      stalled-cycles-frontend   #   55.17% frontend
> cycles idle     ( +-  0.08% )
>     <not supported>      stalled-cycles-backend
>       6,002,264,641      instructions              #    1.08  insns per
> cycle
>                                                    #    0.51  stalled
> cycles per insn  ( +-  0.00% )
>       1,250,316,604      branches                  #  679.436 M/sec
>                 ( +-  0.00% )
>          26,532,702      branch-misses             #    2.12% of all
> branches          ( +-  0.00% )
>
>         1.842000792 seconds time elapsed
>            ( +-  0.06% )
>
>
> On 18 March 2015 at 11:55, Rafael EspĂ­ndola <rafael.espindola at gmail.com> wrote:
>> Are you on Linux? What I normally do for benchmarking is
>>
>> * Put all the files on tmpfs
>> * Disable address space randomization:
>>    echo 0 > /proc/sys/kernel/randomize_va_space
>> * Disable cpu frequency scaling
>>   for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
>> echo performance > $i; done
>>
>> * Use perf to run it multiple times and schedtool to run it at very
>> high priority:
>>    sudo schedtool -F  -p 99 -a 0x4 -e perf stat -r 20
>>
>>
>> On 17 March 2015 at 18:27, Rui Ueyama <ruiu at google.com> wrote:
>>> Why don't you just run it many more times?
>>>
>>> On Tue, Mar 17, 2015 at 3:20 PM, Shankar Easwaran <shankare at codeaurora.org>
>>> wrote:
>>>> Not sure if doing this same experiment on different unixes may give some
>>>> information (or) linking the same object files on windows will give more
>>>> information ?
>>>>
>>>> How may data points do you usually collect ?
>>>>
>>>> Shankar Easwaran
>>>>
>>>>
>>>> On 3/17/2015 5:10 PM, Rui Ueyama wrote:
>>>>> I reformat your results here. As you can see S/N is too low. Maybe we
>>>>> cannot say anything only from four data points.
>>>>>
>>>>> LLD with patch
>>>>> 4.16user 0.80system 0:03.06elapsed 162%CPU (0avgtext+0avgdata
>>>>> 7174160maxresident)k
>>>>> 3.94user 0.86system 0:02.93elapsed 163%CPU (0avgtext+0avgdata
>>>>> 7175808maxresident)k
>>>>> 4.36user 1.05system 0:03.08elapsed 175%CPU (0avgtext+0avgdata
>>>>> 7176320maxresident)k
>>>>> 4.17user 0.72system 0:02.93elapsed 166%CPU (0avgtext+0avgdata
>>>>> 7175120maxresident)k
>>>>>
>>>>> LLD without patch
>>>>> 4.49user 0.92system 0:03.32elapsed 162%CPU (0avgtext+0avgdata
>>>>> 7179984maxresident)k
>>>>> 4.12user 0.83system 0:03.22elapsed 154%CPU (0avgtext+0avgdata
>>>>> 7172704maxresident)k
>>>>> 4.38user 0.90system 0:03.14elapsed 168%CPU (0avgtext+0avgdata
>>>>> 7175600maxresident)k
>>>>> 4.20user 0.79system 0:03.08elapsed 161%CPU (0avgtext+0avgdata
>>>>> 7174864maxresident)k
>>>>>
>>>>>
>>>>> On Tue, Mar 17, 2015 at 2:57 PM, Shankar Easwaran
>>>>> <shankare at codeaurora.org>
>>>>> wrote:
>>>>>
>>>>>> I tried to measure this again with 4 tries and got results, to make sure
>>>>>> just in case, and I see few results identical to what I measured before
>>>>>> :-
>>>>>>
>>>>>> *Raw data below :-*
>>>>>>
>>>>>>
>>>>>> LLD Try With Patch #1
>>>>>> 4.16user 0.80system 0:03.06elapsed 162%CPU (0avgtext+0avgdata
>>>>>> 7174160maxresident)k
>>>>>> LLD Try Without Patch #1
>>>>>> 4.49user 0.92system 0:03.32elapsed 162%CPU (0avgtext+0avgdata
>>>>>> 7179984maxresident)k
>>>>>> BFD Try #1
>>>>>> 7.81user 0.68system 0:08.53elapsed 99%CPU (0avgtext+0avgdata
>>>>>> 3230416maxresident)k
>>>>>> LLD Try With Patch #2
>>>>>> 3.94user 0.86system 0:02.93elapsed 163%CPU (0avgtext+0avgdata
>>>>>> 7175808maxresident)k
>>>>>> LLD Try Without Patch #2
>>>>>> 4.12user 0.83system 0:03.22elapsed 154%CPU (0avgtext+0avgdata
>>>>>> 7172704maxresident)k
>>>>>> BFD Try #2
>>>>>> 7.78user 0.75system 0:08.57elapsed 99%CPU (0avgtext+0avgdata
>>>>>> 3230416maxresident)k
>>>>>> LLD Try With Patch #3
>>>>>> 4.36user 1.05system 0:03.08elapsed 175%CPU (0avgtext+0avgdata
>>>>>> 7176320maxresident)k
>>>>>> LLD Try Without Patch #3
>>>>>> 4.38user 0.90system 0:03.14elapsed 168%CPU (0avgtext+0avgdata
>>>>>> 7175600maxresident)k
>>>>>> BFD Try #3
>>>>>> 7.78user 0.64system 0:08.46elapsed 99%CPU (0avgtext+0avgdata
>>>>>> 3230416maxresident)k
>>>>>> LLD Try With Patch #4
>>>>>> 4.17user 0.72system 0:02.93elapsed 166%CPU (0avgtext+0avgdata
>>>>>> 7175120maxresident)k
>>>>>> LLD Try Without Patch #4
>>>>>> 4.20user 0.79system 0:03.08elapsed 161%CPU (0avgtext+0avgdata
>>>>>> 7174864maxresident)k
>>>>>> BFD Try #4
>>>>>> 7.77user 0.66system 0:08.46elapsed 99%CPU (0avgtext+0avgdata
>>>>>> 3230416maxresident)k
>>>>>>
>>>>>> *Questions :-*
>>>>>>
>>>>>> As Rui mentions I dont know why the user time is more without the patch,
>>>>>> any methods to verify this ?
>>>>>> Could this be because of user threads instead of kernel threads ?
>>>>>>
>>>>>> Shankar Easwaran
>>>>>>
>>>>>>
>>>>>> On 3/17/2015 3:35 PM, Shankar Easwaran wrote:
>>>>>>
>>>>>> Yes, this is true. There were several logs of runs in the same file that
>>>>>> I
>>>>>> read into the commit and manually removing them resulted in two user
>>>>>> lines.
>>>>>>
>>>>>> But the result for all reasons is true. I can re-measure the time taken
>>>>>> though.
>>>>>>
>>>>>> Shankar Easwaran
>>>>>>
>>>>>> On 3/17/2015 2:30 PM, Rui Ueyama wrote:
>>>>>>
>>>>>> On Mon, Mar 16, 2015 at 8:29 PM, Shankar Easwaran
>>>>>> <shankare at codeaurora.org> <shankare at codeaurora.org>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>> Author: shankare
>>>>>> Date: Mon Mar 16 22:29:32 2015
>>>>>> New Revision: 232460
>>>>>>
>>>>>> URL: http://llvm.org/viewvc/llvm-project?rev=232460&view=rev
>>>>>> Log:
>>>>>> [ELF] Use parallel_for_each for writing.
>>>>>>
>>>>>> This changes improves performance of lld, when self-hosting lld, when
>>>>>> compared
>>>>>> with the bfd linker. BFD linker on average takes 8 seconds in elapsed
>>>>>> time.
>>>>>> lld takes 3 seconds elapased time average. Without this change, lld
>>>>>> takes
>>>>>> ~5
>>>>>> seconds average. The runtime comparisons were done on a release build
>>>>>> and
>>>>>> measured by running linking thrice.
>>>>>>
>>>>>> lld self-host without the change
>>>>>> ----------------------------------
>>>>>> real    0m3.196s
>>>>>> user    0m4.580s
>>>>>> sys     0m0.832s
>>>>>>
>>>>>> lld self-host with lld
>>>>>> -----------------------
>>>>>> user    0m3.024s
>>>>>> user    0m3.252s
>>>>>> sys     0m0.796s
>>>>>>
>>>>>>    The above results don't look real output of "time" command.
>>>>>>
>>>>>> If it's real, it's too good to be true, assuming the first line of the
>>>>>> second result is "real" instead of "user".
>>>>>>
>>>>>> "real" is wall clock time from process start to process exit. "user" is
>>>>>> CPU
>>>>>> time consumed by the process in user mode (if a process is
>>>>>> multi-threaded,
>>>>>> it can be larger than real).
>>>>>>
>>>>>> Your result shows significant improvement in user time. Which means you
>>>>>> have significantly reduced the amount of processing time to do the same
>>>>>> thing compared to before. However, because this change didn't change
>>>>>> algorithm, but just execute them in parallel, it couldn't happen.
>>>>>>
>>>>>> Something's not correct.
>>>>>>
>>>>>> I appreciate your effort to make LLD faster, but we need to be careful
>>>>>> about benchmark results. If we don't measure improvements accurately,
>>>>>> it's
>>>>>> easy to make an "optimization" that makes things slower.
>>>>>>
>>>>>> Another important thing is to disbelieve what you do when you optimize
>>>>>> something and measure its effect. It sometimes happen that I believe
>>>>>> something is going to improve performance 100% sure but it actually
>>>>>> wouldn't.
>>>>>>
>>>>>> time taken to build lld with bfd
>>>>>>
>>>>>> --------------------------------
>>>>>> real    0m8.419s
>>>>>> user    0m7.748s
>>>>>> sys     0m0.632s
>>>>>>
>>>>>> Modified:
>>>>>>        lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h
>>>>>>        lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h
>>>>>>
>>>>>> Modified: lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h
>>>>>> URL:
>>>>>>
>>>>>>
>>>>>> http://llvm.org/viewvc/llvm-project/lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h?rev=232460&r1=232459&r2=232460&view=diff
>>>>>>
>>>>>>
>>>>>> ==============================================================================
>>>>>>
>>>>>> --- lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h (original)
>>>>>> +++ lld/trunk/lib/ReaderWriter/ELF/OutputELFWriter.h Mon Mar 16 22:29:32
>>>>>> 2015
>>>>>> @@ -586,8 +586,10 @@ std::error_code OutputELFWriter<ELFT>::w
>>>>>>       _elfHeader->write(this, _layout, *buffer);
>>>>>>       _programHeader->write(this, _layout, *buffer);
>>>>>>
>>>>>> -  for (auto section : _layout.sections())
>>>>>> -    section->write(this, _layout, *buffer);
>>>>>> +  auto sections = _layout.sections();
>>>>>> +  parallel_for_each(
>>>>>> +      sections.begin(), sections.end(),
>>>>>> +      [&](Chunk<ELFT> *section) { section->write(this, _layout,
>>>>>> *buffer);
>>>>>> });
>>>>>>       writeTask.end();
>>>>>>
>>>>>>       ScopedTask commitTask(getDefaultDomain(), "ELF Writer commit to
>>>>>> disk");
>>>>>>
>>>>>> Modified: lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h
>>>>>> URL:
>>>>>>
>>>>>>
>>>>>> http://llvm.org/viewvc/llvm-project/lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h?rev=232460&r1=232459&r2=232460&view=diff
>>>>>>
>>>>>>
>>>>>> ==============================================================================
>>>>>>
>>>>>> --- lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h (original)
>>>>>> +++ lld/trunk/lib/ReaderWriter/ELF/SectionChunks.h Mon Mar 16 22:29:32
>>>>>> 2015
>>>>>> @@ -234,17 +234,17 @@ public:
>>>>>>       /// routine gets called after the linker fixes up the virtual
>>>>>> address
>>>>>>       /// of the section
>>>>>>       virtual void assignVirtualAddress(uint64_t addr) override {
>>>>>> -    for (auto &ai : _atoms) {
>>>>>> +    parallel_for_each(_atoms.begin(), _atoms.end(), [&](AtomLayout *ai)
>>>>>> {
>>>>>>           ai->_virtualAddr = addr + ai->_fileOffset;
>>>>>> -    }
>>>>>> +    });
>>>>>>       }
>>>>>>
>>>>>>       /// \brief Set the file offset of each Atom in the section. This
>>>>>> routine
>>>>>>       /// gets called after the linker fixes up the section offset
>>>>>>       void assignFileOffsets(uint64_t offset) override {
>>>>>> -    for (auto &ai : _atoms) {
>>>>>> +    parallel_for_each(_atoms.begin(), _atoms.end(), [&](AtomLayout *ai)
>>>>>> {
>>>>>>           ai->_fileOffset = offset + ai->_fileOffset;
>>>>>> -    }
>>>>>> +    });
>>>>>>       }
>>>>>>
>>>>>>       /// \brief Find the Atom address given a name, this is needed to
>>>>>> properly
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> llvm-commits mailing list
>>>>>> llvm-commits at cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
>>>>>> hosted by the Linux Foundation
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted
>>>> by the Linux Foundation
>>>>
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>


-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by the Linux Foundation





More information about the llvm-commits mailing list