[PATCH] D52808: [cland] Dex: fix/simplify trigram generation

Sam McCall via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Thu Oct 4 04:16:52 PDT 2018


sammccall marked an inline comment as done.
sammccall added a comment.

**TL;DR**: i'm no longer convinced of my conclusions for short-query, look at the proposal below.

In https://reviews.llvm.org/D52808#1254899, @ioeric wrote:

> > Generate more short-query trigrams, e.g. "AbcDefGhi" now yields "d" and "ag".
>
> I am concerned about the impact on the size of posting lists (we can measure) and retrieval quality by adding more incomplete trigrams.


There is indeed an impact here. Before: 22309480, After: 22531144.
About 1/3 of this is posting lists, so 1% to overall, 3% on posting lists.
I haven't measured quality (don't have a good way to do that for dex currently). It's true that the new results we're admitting are probably worse as they don't start with the right letter.

>> This is effectively required by LSP, having "ag" not match but "agh" match will lead to glitches due to client-side filtering.
> 
> It seems hard to make index behave the same as LSP clients, considering all the ranking signals we have; there can be other reasons that cause `having "ag" not match but "agh" match`.

No, there's only one such reason: we truncated the result list (the symbol matched, but it wasn't one of the best N).
This reason is accounted for by LSP: client-side filtering only occurs when there was no truncation.

> And it seems reasonable to assume that you would get better symbols as you type more characters (i.e. stronger relevance signal).

The problem is LSP clients are free to assume that the result list is complete (unless marked as incomplete) and therefore will never retrieve the better symbols.

>> Drop leading-punctuation short-query trigrams. Nice idea, but current implementation is broken (competes with non-punctuation short query trigrams).
> 
> Could you elaborate how this is broken? We should probably fix it instead of removing it. `__` not matching `__some_macro` sounds like a regression.

`fuzzyFind` considers `_ptr` to match `unique_ptr`, so it needs to consider `_` to match it too.

We will still match `__some_macro`, but the posting lists will match ~everything and then postfilter. It may make sense to do optimizations or tweak overall trigram generation for this case, but the way it's currently done seems unprincipled and a little broken.

---

OK, so after some thought on the tradeoffs here, what about this alternative design:

- for query length <3, we support really restrictive matches: exact prefix, or first char + next head char. We
- we get around the LSP restrictions by always marking such result sets as incomplete


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D52808





More information about the cfe-commits mailing list