[PATCH] D109632: [clang] de-duplicate methods from AST files

Mon Oct 25 11:13:59 PDT 2021

dexonsmith added a comment.

In D109632#3081580 <https://reviews.llvm.org/D109632#3081580>, @vsapsai wrote:

> We are targeting the use case where in impl.m we have
>
>   @import ImmediateDep1;
>   @import ImmediateDep2;
>   ...
>   @import ImmediateDepN;
>
> and each of ImmediateDep has `@import SharedDep;`.
>
> For simplicity we'll consider that we are working with a single selector as each of them is processed separately and independently. When clang encounters a selector, it wants to collect methods corresponding to this selector from all modules to a GlobalMethodPool. As in GlobalMethodPool  we store methods in a list, adding a new method requires traversing this list. It is possible to change the list to something else but this change isn't about that, in `O(L*X)` complexity we are reducing `X` (which can be ~100000) and not `L` (which is closer to ~1000).
>
> In the baseline METHOD_POOL for each "ImmediateDep" contains methods from "SharedDep". When we add methods to the GlobalMethodPool, we try to add methods from all "ImmediateDep". As the result we iterate through GlobalMethodPool method list multiple time for each method from "SharedDep" as they are available in each "ImmediateDep".
>
> Richard's idea is to put a DenseSet of encountered methods in front of GlobalMethodPool method list. This way duplicates from "SharedDep" can be detected and discarded quickly so we traverse a list only for each unique method and not for duplicates.
>
> My idea is not to store any duplicates in METHOD_POOL. This is achieved by each module storing only their own methods and not storing any [transitively] imported methods. In turn, populating GlobalMethodPool requires traversing the full dependency graph once and loading methods from METHOD_POOL in corresponding modules compared to traversing only immediate dependencies as in the baseline.

Thanks for the explanation; this was helpful.

What's the relative storage cost between the two proposals? I assume small or it would have been mentioned.

But another benefit of not double-storing transitively imported methods is that it makes the PCMs more independent, tacking slightly closer to "ImmediateDep1.pcm" being reproducible even when "SharedDep.pcm" adds a method to the global pool. This is a nice property if it's not too expensive. Looking at the numbers above, it doesn't look expensive; the relative performance for @rmaz's motivating use case seems pretty small.

@rmaz, will your goals be achieved by taking @vsapsai's approach? If so, I'm leaning slightly that way.

> I believe MultiOnDiskHashTable isn't really applicable to this performance issue. Even if we use it to store methods, it doesn't affect the amount of methods we process. And right now processing all the duplicates is the expensive part.

Thanks for explaining.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109632/new/

https://reviews.llvm.org/D109632