[lld] r287140 - Reduce number of tasks in parallel_for_each.
Rafael EspĂndola via llvm-commits
llvm-commits at lists.llvm.org
Wed Nov 16 11:42:56 PST 2016
LGTM
On 16 November 2016 at 14:27, Rui Ueyama via llvm-commits
<llvm-commits at lists.llvm.org> wrote:
> Author: ruiu
> Date: Wed Nov 16 13:27:33 2016
> New Revision: 287140
>
> URL: http://llvm.org/viewvc/llvm-project?rev=287140&view=rev
> Log:
> Reduce number of tasks in parallel_for_each.
>
> TaskGroup has a fairly high overhead, so we don't want to partition
> tasks into too small tasks. This patch partition tasks into up to
> 1024 tasks.
>
> I compared this patch with the original LLD's parallel_for_each.
> I reverted r287042 locally for comparison.
>
> With this patch, time to self-link lld with debug info changed from
> 6.23 seconds to 4.62 seconds (-25.8%), with -threads and without -build-id.
> With both -threads and -build-id, it improved from 11.71 seconds
> to 4.94 seconds (-57.8%). Full results are below.
>
> BTW, GNU gold takes 11.65 seconds to link the same binary.
>
> NOW
>
> --no-threads --build-id=none
> 6789.847776 task-clock (msec) # 1.000 CPUs utilized ( +- 1.86% )
> 685 context-switches # 0.101 K/sec ( +- 2.82% )
> 4 cpu-migrations # 0.001 K/sec ( +- 31.18% )
> 1,424,690 page-faults # 0.210 M/sec ( +- 1.07% )
> 21,339,542,522 cycles # 3.143 GHz ( +- 1.49% )
> 13,092,260,230 stalled-cycles-frontend # 61.35% frontend cycles idle ( +- 2.23% )
> <not supported> stalled-cycles-backend
> 21,462,051,828 instructions # 1.01 insns per cycle
> # 0.61 stalled cycles per insn ( +- 0.41% )
> 3,955,296,378 branches # 582.531 M/sec ( +- 0.39% )
> 75,699,909 branch-misses # 1.91% of all branches ( +- 0.08% )
>
> 6.787630744 seconds time elapsed ( +- 1.86% )
>
> --threads --build-id=none
> 14767.148697 task-clock (msec) # 3.196 CPUs utilized ( +- 2.56% )
> 28,891 context-switches # 0.002 M/sec ( +- 1.99% )
> 905 cpu-migrations # 0.061 K/sec ( +- 5.49% )
> 1,262,122 page-faults # 0.085 M/sec ( +- 1.68% )
> 43,116,163,217 cycles # 2.920 GHz ( +- 3.07% )
> 33,690,171,242 stalled-cycles-frontend # 78.14% frontend cycles idle ( +- 3.67% )
> <not supported> stalled-cycles-backend
> 22,836,731,536 instructions # 0.53 insns per cycle
> # 1.48 stalled cycles per insn ( +- 1.13% )
> 4,382,712,998 branches # 296.788 M/sec ( +- 1.33% )
> 78,622,295 branch-misses # 1.79% of all branches ( +- 0.54% )
>
> 4.621228056 seconds time elapsed ( +- 1.90% )
>
> --threads --build-id=sha1
> 24594.457135 task-clock (msec) # 4.974 CPUs utilized ( +- 1.78% )
> 29,902 context-switches # 0.001 M/sec ( +- 2.62% )
> 1,097 cpu-migrations # 0.045 K/sec ( +- 6.29% )
> 1,313,947 page-faults # 0.053 M/sec ( +- 2.36% )
> 70,516,415,741 cycles # 2.867 GHz ( +- 0.78% )
> 47,570,262,296 stalled-cycles-frontend # 67.46% frontend cycles idle ( +- 0.86% )
> <not supported> stalled-cycles-backend
> 73,124,599,029 instructions # 1.04 insns per cycle
> # 0.65 stalled cycles per insn ( +- 0.33% )
> 10,495,266,104 branches # 426.733 M/sec ( +- 0.41% )
> 91,444,149 branch-misses # 0.87% of all branches ( +- 0.83% )
>
> 4.944291711 seconds time elapsed ( +- 1.72% )
>
> PREVIOUS
>
> --threads --build-id=none
> 7307.437544 task-clock (msec) # 1.160 CPUs utilized ( +- 2.34% )
> 3,128 context-switches # 0.428 K/sec ( +- 4.37% )
> 352 cpu-migrations # 0.048 K/sec ( +- 5.98% )
> 1,354,450 page-faults # 0.185 M/sec ( +- 2.20% )
> 22,081,733,098 cycles # 3.022 GHz ( +- 1.46% )
> 13,709,991,267 stalled-cycles-frontend # 62.09% frontend cycles idle ( +- 1.77% )
> <not supported> stalled-cycles-backend
> 21,634,468,895 instructions # 0.98 insns per cycle
> # 0.63 stalled cycles per insn ( +- 0.86% )
> 3,993,062,361 branches # 546.438 M/sec ( +- 0.83% )
> 76,188,819 branch-misses # 1.91% of all branches ( +- 0.19% )
>
> 6.298101157 seconds time elapsed ( +- 2.03% )
>
> --threads --build-id=sha1
> 12845.420265 task-clock (msec) # 1.097 CPUs utilized ( +- 1.95% )
> 4,020 context-switches # 0.313 K/sec ( +- 2.89% )
> 369 cpu-migrations # 0.029 K/sec ( +- 6.26% )
> 1,464,822 page-faults # 0.114 M/sec ( +- 1.37% )
> 40,668,449,813 cycles # 3.166 GHz ( +- 0.96% )
> 18,863,982,388 stalled-cycles-frontend # 46.38% frontend cycles idle ( +- 1.82% )
> <not supported> stalled-cycles-backend
> 71,560,499,058 instructions # 1.76 insns per cycle
> # 0.26 stalled cycles per insn ( +- 0.14% )
> 10,044,152,441 branches # 781.925 M/sec ( +- 0.19% )
> 87,835,773 branch-misses # 0.87% of all branches ( +- 0.09% )
>
> 11.711773314 seconds time elapsed ( +- 1.51% )
>
> Modified:
> lld/trunk/include/lld/Core/Parallel.h
>
> Modified: lld/trunk/include/lld/Core/Parallel.h
> URL: http://llvm.org/viewvc/llvm-project/lld/trunk/include/lld/Core/Parallel.h?rev=287140&r1=287139&r2=287140&view=diff
> ==============================================================================
> --- lld/trunk/include/lld/Core/Parallel.h (original)
> +++ lld/trunk/include/lld/Core/Parallel.h Wed Nov 16 13:27:33 2016
> @@ -283,9 +283,20 @@ void parallel_for_each(Iterator begin, I
> #else
> template <class Iterator, class Func>
> void parallel_for_each(Iterator begin, Iterator end, Func func) {
> + // TaskGroup has a relatively high overhead, so we want to reduce
> + // the number of spawn() calls. We'll create up to 1024 tasks here.
> + // (Note that 1024 is an arbitrary number. This code probably needs
> + // improving to take the number of available cores into account.)
> + ptrdiff_t taskSize = std::distance(begin, end) / 1024;
> + if (taskSize == 0)
> + taskSize = 1;
> +
> TaskGroup tg;
> - for (; begin != end; ++begin)
> - tg.spawn([=, &func] { func(*begin); });
> + while (taskSize <= std::distance(begin, end)) {
> + tg.spawn([=, &func] { std::for_each(begin, begin + taskSize, func); });
> + begin += taskSize;
> + }
> + std::for_each(begin, end, func);
> }
> #endif
> } // end namespace lld
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
More information about the llvm-commits
mailing list