[PATCH] D50517: [clangd] Generate incomplete trigrams for the Dex index

Eric Liu via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Fri Aug 10 07:26:40 PDT 2018


ioeric added inline comments.


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:33
+
+void insertChars(DenseSet<Token> &UniqueTrigrams, std::string Chars) {
+  const auto Trigram = Token(Token::Kind::Trigram, Chars);
----------------
This is probably neater as a lambda in `generateIdentifierTrigrams `, e.g.

```
auto add = [&](std::string Chars) {
  trigrams.insert(Token(Token::Kind::Trigram, Chars));
}
...
add("abc"); 
```


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:95
+
+      // Generate trigrams only if the first character is the segment start.
+      // Example: "StringStartsWith" would yield "st$", "ss$" but not "tr$".
----------------
Do you mean *bigrams* here?


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:96
+      // Generate trigrams only if the first character is the segment start.
+      // Example: "StringStartsWith" would yield "st$", "ss$" but not "tr$".
+      if (Roles[I] == Head) {
----------------
Should we also check `Roles[J] == Head `?

As bigram posting lists would be significantly larger than those of trigrams, I would suggest being even more restrictive. For example, for "AppleBananaCat", the most common short queries would be "ap" and "ab" (for `AB`).


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:119
 
-// FIXME(kbobyrev): Similarly, to generateIdentifierTrigrams, this ignores short
-// inputs (total segment length <3 characters).
+// FIXME(kbobyrev): Correctly handle cases like "u_". In this case, we would
+// like to match "u" only. This might also work better with word bounds for
----------------
Couldn't we generate a bigram "u_$" in this case? I think we can assume prefix matching in this case, if we generate bigram "u_" for identiifers like "u_*".

> In this case, we would like to match "u" only.
Why? If user types "_", I would expect it to be a meaning filter. 


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:141
+  // sufficient, generate a single incomplete trigram for query.
+  if (ValidSymbolsCount < 3) {
+    std::string Symbols = {{END_MARKER, END_MARKER, END_MARKER}};
----------------
For queries like `__` or `_x`,  I think we can generate tokens "__$" or `_x$`. 


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:49
+/// result. Incomplete trigrams contain END_MARKER ('$') at the end. The result
+/// also contains bigrams which are either two subsequent Head symbols or Head
+/// and subsequent Tail. This means that for short queries Dex index would do
----------------
nit: the term "symbol" here is confusing. Do you mean "character"?


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:59
+/// * Bigram with the first two symbols (if availible), same logic applies.
+///
 /// Note: the returned list of trigrams does not have duplicates, if any trigram
----------------
I think the comment can be simplified a bit:
```
This also generates incomplete trigrams for short query scenarios:
  * Empty trigram: "$$$"
  * Unigram: the first character of the identifier.
  * Bigrams: a 2-char prefix of the identifier and a bigram of the first two HEAD characters (if it exists).
```


https://reviews.llvm.org/D50517





More information about the cfe-commits mailing list