<div dir="ltr">It's not strange. Making something parallel doesn't always make it run faster. Oftentimes it makes thing even slower. That's the whole point why I emphasized the importance of accurate benchmark. (Note that this is a result of linking Clang. You might see different results depending on programs.)<div><br></div><div>Rafael, it's the ELF writer. Unless you cross link ELF executables on Windows, this piece of code is not executed on Windows.</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Mar 18, 2015 at 9:32 AM, Rafael EspÃndola <span dir="ltr"><<a href="mailto:rafael.espindola@gmail.com" target="_blank">rafael.espindola@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">As with anything threading related, it might also be worth<br>
benchmarking it on Windows.<br>
<div class="HOEnZb"><div class="h5"><br>
On 18 March 2015 at 12:31, Shankar Easwaran <<a href="mailto:shankare@codeaurora.org">shankare@codeaurora.org</a>> wrote:<br>
> It looks like these are the right numbers and Strange, I dont see a huge<br>
> advantage of the patch trying to parallelize writing output sections in<br>
> parallel.<br>
><br>
><br>
> On 3/18/2015 11:23 AM, Rafael EspÃndola wrote:<br>
>><br>
>> On 18 March 2015 at 12:14, Shankar Easwaran <<a href="mailto:shankare@codeaurora.org">shankare@codeaurora.org</a>><br>
>> wrote:<br>
>>><br>
>>> Does this repeat with the same numbers across similar tries ?<br>
>><br>
>> The "-r 20" tells perf to do 20 runs. Repeating the entire thing for<br>
>> sanity check I got<br>
>><br>
>><br>
>> master:<br>
>>Â Â Â Â Â 1850.315854Â Â Â task-clock (msec)Â Â Â Â Â #Â Â 0.999 CPUs<br>
>> utilized      ( +- 0.20% )<br>
>>        1,246   context-switches     #  0.673 K/sec<br>
>>          0   cpu-migrations      #  0.000 K/sec<br>
>>Â Â Â Â Â Â Â Â Â ( +-100.00% )<br>
>>       191,223   page-faults        #  0.103 M/sec<br>
>>Â Â Â Â Â Â Â Â Â ( +-Â 0.00% )<br>
>>    5,570,279,746   cycles          #  3.010 GHz<br>
>>Â Â Â Â Â Â Â Â Â ( +-Â 0.08% )<br>
>>    3,076,652,220   stalled-cycles-frontend  #  55.23% frontend<br>
>> cycles idle   ( +- 0.15% )<br>
>>Â Â Â <not supported>Â Â Â stalled-cycles-backend<br>
>>    6,061,467,442   instructions       #  1.09 insns per<br>
>> cycle<br>
>>Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â #Â Â 0.51Â stalled<br>
>> cycles per insn ( +- 0.00% )<br>
>>    1,262,014,047   branches         # 682.053 M/sec<br>
>>Â Â Â Â Â Â Â Â Â ( +-Â 0.00% )<br>
>>     26,526,169   branch-misses       #  2.10% of all<br>
>> branches     ( +- 0.00% )<br>
>><br>
>>Â Â Â Â Â 1.852094924 seconds time elapsed<br>
>>Â Â Â Â Â Â ( +-Â 0.20% )<br>
>><br>
>> master minus your patch:<br>
>><br>
>>Â Â Â Â Â 1837.986418Â Â Â task-clock (msec)Â Â Â Â Â #Â Â 0.999 CPUs<br>
>> utilized      ( +- 0.01% )<br>
>>        1,170   context-switches     #  0.637 K/sec<br>
>>          0   cpu-migrations      #  0.000 K/sec<br>
>>       191,225   page-faults        #  0.104 M/sec<br>
>>Â Â Â Â Â Â Â Â Â ( +-Â 0.00% )<br>
>>    5,517,484,340   cycles          #  3.002 GHz<br>
>>Â Â Â Â Â Â Â Â Â ( +-Â 0.01% )<br>
>>    3,036,583,530   stalled-cycles-frontend  #  55.04% frontend<br>
>> cycles idle   ( +- 0.02% )<br>
>>Â Â Â <not supported>Â Â Â stalled-cycles-backend<br>
>>    6,004,436,870   instructions       #  1.09 insns per<br>
>> cycle<br>
>>Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â #Â Â 0.51Â stalled<br>
>> cycles per insn ( +- 0.00% )<br>
>>    1,250,685,716   branches         # 680.465 M/sec<br>
>>Â Â Â Â Â Â Â Â Â ( +-Â 0.00% )<br>
>>     26,539,486   branch-misses       #  2.12% of all<br>
>> branches     ( +- 0.00% )<br>
>><br>
>>Â Â Â Â Â 1.839759787 seconds time elapsed<br>
>>Â Â Â Â Â Â ( +-Â 0.01% )<br>
>><br>
>><br>
>> Cheers,<br>
>> Rafael<br>
>><br>
><br>
><br>
> --<br>
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by<br>
> the Linux Foundation<br>
><br>
</div></div></blockquote></div><br></div>