[PATCH] D129089: [BOLT] Fix concurrent hash table modification in the instrumentation runtime

Mon Jan 2 17:38:20 PST 2023

michoecho added a comment.

> Also it looks like since visitIndCallCounter is called by forEachElement with mutex lock the CallFlowTable->get() would never be able to lock the mutex

But these are two different mutexes, aren't they? The lock taken by the `forEachElement` surrounding `visitIndCallCounter` is in `GlobalIndCallCounters[I]`. The lock in `CallFlowTable->get()` sits in `Ctx.CallFlowTable`. Unless you mean some other locks?

> I observe that one of the random GlobalIndCallCounters might become locked and when it comes to the writeIndirectCallProfile->forEachElement it becomes deadlocked.

It seems that `GlobalIndCallCounters` are only locked in `__bolt_instr_data_dump --> writeIndirectCallProfile --> forEachElement`, in `__bolt_instr_clear_counters --> resetCounters`, and in `__bolt_instr_indirect_{tail}call --> instrumentIndirectCall --> incrementVal --> get`. So they shouldn't lock each other out. Unless... `__bolt_instr_data_dump` is called from a signal handler.
I see that `Lock` takes care to mask signals while it's locked, but `TryLock` doesn't. So if a signal comes while a `TryLock` is held in `__bolt_instr_indirect_{tail}call --> instrumentIndirectCall --> incrementVal --> get`, and the signal handler calls `__bolt_instr_data_dump`, then `__bolt_instr_data_dump` will get deadlocked, just as you describe.

So, are your deadlocks happening in a signal handler? If yes, then case closed. (But TryLock should be fixed regardless of the answer.)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D129089/new/

https://reviews.llvm.org/D129089