[PATCH] D109632: [clang] de-duplicate methods from AST files

Fri Oct 22 13:42:05 PDT 2021

vsapsai added a comment.

We are targeting the use case where in impl.m we have

  @import ImmediateDep1;
  @import ImmediateDep2;
  ...
  @import ImmediateDepN;

and each of ImmediateDep has `@import SharedDep;`.

For simplicity we'll consider that we are working with a single selector as each of them is processed separately and independently. When clang encounters a selector, it wants to collect methods corresponding to this selector from all modules to a GlobalMethodPool. As in GlobalMethodPool  we store methods in a list, adding a new method requires traversing this list. It is possible to change the list to something else but this change isn't about that, in `O(L*X)` complexity we are reducing `X` (which can be ~100000) and not `L` (which is closer to ~1000).

In the baseline METHOD_POOL for each "ImmediateDep" contains methods from "SharedDep". When we add methods to the GlobalMethodPool, we try to add methods from all "ImmediateDep". As the result we iterate through GlobalMethodPool method list multiple time for each method from "SharedDep" as they are available in each "ImmediateDep".

Richard's idea is to put a DenseSet of encountered methods in front of GlobalMethodPool method list. This way duplicates from "SharedDep" can be detected and discarded quickly so we traverse a list only for each unique method and not for duplicates.

My idea is not to store any duplicates in METHOD_POOL. This is achieved by each module storing only their own methods and not storing any [transitively] imported methods. In turn, populating GlobalMethodPool requires traversing the full dependency graph once and loading methods from METHOD_POOL in corresponding modules compared to traversing only immediate dependencies as in the baseline.

I believe MultiOnDiskHashTable isn't really applicable to this performance issue. Even if we use it to store methods, it doesn't affect the amount of methods we process. And right now processing all the duplicates is the expensive part.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632