[PATCH] D59300: [clangd] Tune the fuzzy-matching algorithm

Tue Mar 19 02:48:24 PDT 2019

ilya-biryukov added a comment.

Hi Jan,

Sure! And sorry for posting these metrics for a while (we had other patches mentioning them) without proper explanation.

We simulate a bunch of completions at random points in random files from our internal codebase.  We assume the desired completion item is the one written in the code.
Intuitively, the higher it's ranked the better. In an attempt to measure this, we compute the following metrics:

- MRR <https://en.wikipedia.org/wiki/Mean_reciprocal_rank>
- `Top-N` - percentage of completions where the searched element is among the first `n` items.

We also independently calculate those metrics for interesting groups of completions:

- `OVERALL`. All completions.
- `INITIALISMS`. Completions with query (what the user typed) matching first characters of each segment in the desired completion item, e.g. `SI` or `SIC` for `SomeInterestingClass`.
- `EXPLICIT_MEMBER_ACCESS`. Desired completion item is a class member and the completion is in a member access expression, e.g. `vector().^push_back()`.
- `WANT_LOCAL`. Desired completion item is in the same file as the completion itself.
- `CROSS_NAMESPACE`. Simulated completion removes the namespace prefix, in addition to the identifier, e.g. we expect to complete `std::vector` not just `vector`.
- `WITH EXPECTED_TYPE`. Only completions in a context where expected type is available, e.g. `int* a = ^`.

For each of the picked positions in a file, we try to complete a prefix of the desired completion item of length up to `5` and the full identifier (except initialisms, more on them below).
E.g. for the following source code:

  int test() {
    std::vector<int> vec;
    vec.^push_back(10); // say, simulation runs here
  }

We would try run simulation for the following completions: `vec.^`, `vec.p^`, `vec.pu^`, `vec.pus^`, `vec.push^` and `vec.push_^`.
You can see the breakdown of the metrics for each of the prefix lengths in each of the completion groups.
Individual metrics for a fixed length of the prefix are written in the `Filter length 0-5` sections.
We also try completion with the full identifier (e.g. `vec.push_back^`), metrics for those are written in the `Full identifiers` section.
Aggregated metrics for all completions in a group are written in the `All measurements` section.

The "initialisms" groups is special, for those we use first chars of the segments inside the desired completion item rather than the prefix, e.g. `vec.p^`, `vec.pb^`.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59300/new/

https://reviews.llvm.org/D59300