<div dir="ltr">I think it is not beautiful to take care of granularity of tasks on caller side, and it can be resolved with this. It works for me. What do you think?<div><br></div><div><div>template <class Iterator, class Func></div><div>void parallel_for_each(Iterator begin, Iterator end, Func func) {</div><div>  ptrdiff_t taskSize = std::distance(begin, end) / 1024;</div><div>  if (taskSize == 0)</div><div>    taskSize = 1;</div><div><br></div><div>  TaskGroup tg;</div><div>  while (taskSize <= std::distance(begin, end)) {</div><div>    tg.spawn([=, &func] { std::for_each(begin, begin + taskSize, func); });</div><div>    begin += taskSize;</div><div>  }</div><div>  std::for_each(begin, end, func);</div><div>}</div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Nov 16, 2016 at 5:47 AM, Rafael Espíndola <span dir="ltr"><<a href="mailto:rafael.espindola@gmail.com" target="_blank">rafael.espindola@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span>> If you don't want to revert it, how about this.<br>

><br>

> The problem in the original code is the task size is fixed to 1024. We can<br>

> make it adaptive to the size of the input, so that we will always have<br>

> reasonable number of tasks.<br>

<br>

</span>So, there is quite a bit of work to be done if we get serious about<br>

threads. We have to investigate why the pool executor has such a high<br>

overhead and figure out the right way to split work so that each<br>

thread is not stepping over each other. We should very likely also<br>

create one thread per core, not one per SMT.<br>

<br>

For example of possible improvement, when outputting the file it is<br>

probably profitable to make write do nothing but write and partition<br>

the input sections over all the file so that each thread can allocate<br>

local memory, relocate there and then write its work to the correct<br>

output offset with a single contiguous write call.<br>

<br>

So in general finding the correct granularity is something that I<br>

think should be explicitly done in the caller.<br>

<br>

Given that we still have a lot of work before threading becomes a<br>

priority, how about the attached compromise. It just writes each<br>

output thread in parallel. In my testcase it brings the linker back to<br>

the previous performance when not using --block-id.<br>

<br>

Cheers,<br>

Rafael<br>

</blockquote></div><br></div></div></div>