[Lldb-commits] [PATCH] D122975: parallelize module loading in DynamicLoaderPOSIXDYLD()

Tue Apr 5 02:47:43 PDT 2022

labath added a comment.

In D122975#3426731 <https://reviews.llvm.org/D122975#3426731>, @llunak wrote:

> In D122975#3426575 <https://reviews.llvm.org/D122975#3426575>, @labath wrote:
>
>> I'd like to understand what is the precise thing that you're trying to optimise. If there is like a single hot piece of code that we want to optimise, then we might be able to come up with an different approach (a'la PrefetchModuleSpecs) to achieve the same thing.
>
> The scenario is a repeated lldb start with index cache enabled (and so everything necessary cached), the debugged app is debug build of LibreOffice Calc (~100 medium-sized C++ libraries). 90% of the CPU time is spent in LoadFromCache() calls, because 70% of the CPU time is spent in ConstString (and that's after my ConstString optimizations). I don't think there's any other way to make it faster other than parallelizing it, although parallelizing only LoadFromCache() should be enough and I'd expect that to be limited enough to be safe if you expect the lldb codebase is not generally thread-safe. In this patch I parallelized higher because it was simple to do it up there, and I don't know how to easily get at all the LoadFromCache() calls (especially the one called from SymbolFileDWARFDebugMap is relatively deep). If you know how to parallelize only that, that works for me too.

OK, got it. So, for this case, I think the best approach would be to extract and paralelize the `PreloadSymbols` calls. They are not deep (=> relatively easy to extract), optional (they are just a performance optimization, so nothing will break if they're skipped), and completely isolated (they only access data from the single module).

In fact we already have a sort of an existing place to do this kind of group module actions. The reason that the `ModulesDidLoad` (line 621) call does not happen inside `GetOrCreateModule` is because we want to send just one load event instead of spamming the user with potentially hundreds of messages. I don't think it would be unreasonable to move the `PreloadSymbols` call from to `ModulesDidLoad`.

Of course, we still need to figure out the parallelization story.

In D122975#3427243 <https://reviews.llvm.org/D122975#3427243>, @llunak wrote:

> In D122975#3427008 <https://reviews.llvm.org/D122975#3427008>, @clayborg wrote:
>
>> I had tried something similar with the thread pools when trying to parallelize similar stuff. The solution I made was to have a global thread pool for the entire LLDB process, but then the LLVM thread pool stuff needed to be modified to handle different groups of threads where work could be added to a queue and then users can wait on the queue. The queues then need to be managed by the thread pool code. Queues could also be serial queues or concurrent queues. I never completed the patch, but just wanted to pass along the ideas I had used. So instead of adding everything to a separate pool, the main thread pool could take queues. The code for your code above would look something like:
>
> I got a similar idea too, but I think it wouldn't work here in this threadpool-within-threadpool situation. If you limit the total number of threads, then the "outer" tasks could allocate all the threads and would deadlock on "inner" tasks waiting for threads (and if you don't limit threads then there's no need to bother). That's why I came up with the semaphore idea, as that would require threads to acquire slots and "outer" tasks could temporarily release them when spawning "inner" tasks (I didn't consider pending threads to be a problem, but if it is, then the supposed ThreadPool queues presumably could do that somehow too).

Pending/suspended threads are less of a problem then threads actively contending for cpu time, but still less than ideal. On my machine I could end up with over 2k threads. At 8MB per thread, that's 16GB just for stack (most of it probably unused, but still...). And some machines have more (sometimes a lot more) CPUs than I do.

Properly implementing thread pools is tricky, and I don't consider myself an expert, but I think that, instead of using semaphores, you could detect that the case when `wait()` is being called from inside a thread pool thread, and then, instead of passively waiting for the task to finish, start eagerly evaluating it on the same thread.

I'm pretty sure I didn't come up with this idea, so it's possible that something like this is already implemented in llvm, and I got the idea from there.

> But if lldb code is not considered thread-safe to parallelize the module loading at this level, then this is all irrelevant. If it's required to parallelize only cache loading, then that removes the threadpool-within-threadpool situation. I don't know if I understand the code enough to try that though.

Yeah, I'd rather avoid parallelizing the logic for populating the target's module and load lists. Even if the result would be always correct, just the mere fact that the module lists (and the output of e.g. `image list`) would be in nondeterministic order is not ideal.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D122975/new/

https://reviews.llvm.org/D122975