[PATCH] D142318: [Support] Add PerThreadBumpPtrAllocator class.

Mon Apr 17 03:07:40 PDT 2023

avl added a comment.

In D142318#4271261 <https://reviews.llvm.org/D142318#4271261>, @dexonsmith wrote:

> In D142318#4270943 <https://reviews.llvm.org/D142318#4270943>, @avl wrote:
>
>> This check will help for pure users of getThreadIndex() but will not help users of PerThreadBumpPtrAllocator as it calls "detail::Executor::getDefaultExecutor()->getThreadsNum();" in the constructor. Thus any call to getThreadIndex() after PerThreadBumpPtrAllocator is created will have HasDefaultExecutor == true.
>
> I see; I've been missing that detail.
>
> You could introduce a free function, `getDefaultThreadsNum()` to Parallel.h/cpp, which in assertions mode calls `hardware_concurrency().compute_thread_count()` (redundantly) without calling `getDefaultExecutor()`. WDYT?

Originally, I tried to avoid such duplication. As there is no guarantee that DefaultExecutor() uses exactly hardware_concurrency().compute_thread_count() threads. If the default executor strategy would be changed it is easy to forget to update getDefaultThreadsNum() function. Probably, it could be verified by adding more asserts though.
Anyway, I tend to lean towards initializing threadIndex with -1:

  thread_local unsigned threadIndex = -1;
  inline unsigned getThreadIndex() {
    assert((threadIndex != -1) || (parallel::strategy.ThreadsRequested == 1));
    return threadIndex;
  }

This solution guarantees that getThreadIndex() is used with only threads created by ThreadPoolExecutor.
It also helps to increase the level of parallelization by allowing parallel execution of llvm::parallelFor().
Currently, there is a limitation to not having two TaskGroups running in parallel:

  // Latch::sync() called by the dtor may cause one thread to block. If is a dead
  // lock if all threads in the default executor are blocked. To prevent the dead
  // lock, only allow the first TaskGroup to run tasks parallelly. In the scenario
  // of nested parallel_for_each(), only the outermost one runs parallelly.
  TaskGroup::TaskGroup() : Parallel(TaskGroupInstances++ == 0) {}

This is done to avoid nested llvm::parallelFor() but it also avoids parallel llvm::parallelFor():

  ThreadPool Pool(parallel::strategy);

  Pool.async([&]() { llvm::parallelFor(); });
  Pool.async([&]() { llvm::parallelFor(); });

  Pool.wait();

The above solution allows running llvm::parallelFor() parallelly without having a deadlock.
Using threadIndex will make only nested llvm::parallelFor() to be sequential and allow parallel llvm::parallelFor():

  TaskGroup::TaskGroup() : Parallel(threadIndex == -1) {}

Another thing, is that using assertion to check threadIndex will forbid the calling of getThreadIndex() on the main thread.
It will break this code lld/ELF/Relocations.cpp:

  // Deterministic parallellism needs sorting relocations which is unsuitable
  // for -z nocombreloc. MIPS and PPC64 use global states which are not suitable
  // for parallelism.
  bool serial = !config->zCombreloc || config->emachine == EM_MIPS ||
                config->emachine == EM_PPC64;
  parallel::TaskGroup tg;
  for (ELFFileBase *f : ctx.objectFiles) {
    auto fn = [f]() {
      RelocationScanner scanner;
      for (InputSectionBase *s : f->getSections()) {
        if (s && s->kind() == SectionBase::Regular && s->isLive() &&
            (s->flags & SHF_ALLOC) &&
            !(s->type == SHT_ARM_EXIDX && config->emachine == EM_ARM))
          scanner.template scanSection<ELFT>(*s);
      }
    };
    if (serial)
      fn();   <<<<< called on the main thread, it calls getThreadIndex() inside.
    else
      tg.execute(fn);
  }

  // Both the main thread and thread pool index 0 use getThreadIndex()==0. Be
  // careful that they don't concurrently run scanSections. When serial is
  // true, fn() has finished at this point, so running execute is safe.

To solve this problem it is possible to add a sequential mode to TaskGroup, so that tasks marked with
"sequential" flags are always executed on Thread0. It makes them always executed in the same order.
It also will allow executing scanSection concurrently as no indexes overlap now.

  tg.execute(fn, serial);

List of changes shortly:

1. initialiase threadIndex with -1 and add assertion:

  thread_local unsigned threadIndex = -1;
  inline unsigned getThreadIndex() {
    assert((threadIndex != -1) || (parallel::strategy.ThreadsRequested == 1));
    return threadIndex;
  }

2. remove optimization for NumItems to avoid executing on the main thread.

  void llvm::parallelFor(size_t Begin, size_t End,
                         llvm::function_ref<void(size_t)> Fn) {
    // If we have zero or one items, then do not incur the overhead of spinning up
    // a task group.  They are surprisingly expensive, and because they do not
    // support nested parallelism, a single entry task group can block parallel
    // execution underneath them.
  #if LLVM_ENABLE_THREADS
    auto NumItems = End - Begin;
    if (NumItems > 1 && parallel::strategy.ThreadsRequested != 1) {  <<< delete check for NumItems

3. Use threadIndex to avoid nested parallelization:

TaskGroup::TaskGroup() : Parallel(threadIndex == -1) {}

4. Add sequential mode to TaskGroup forcing marked tasks to be executed on the same thread in sequential order.

What do you think about that approach? It looks like it could give a good level of checking: no other threads are allowed to call getThreadIndex() and helps to get more parallelization.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D142318/new/

https://reviews.llvm.org/D142318