[PATCH] D50517: [clangd] Generate incomplete trigrams for the Dex index
Eric Liu via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Fri Aug 10 07:26:40 PDT 2018
ioeric added inline comments.
================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:33
+
+void insertChars(DenseSet<Token> &UniqueTrigrams, std::string Chars) {
+ const auto Trigram = Token(Token::Kind::Trigram, Chars);
----------------
This is probably neater as a lambda in `generateIdentifierTrigrams `, e.g.
```
auto add = [&](std::string Chars) {
trigrams.insert(Token(Token::Kind::Trigram, Chars));
}
...
add("abc");
```
================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:95
+
+ // Generate trigrams only if the first character is the segment start.
+ // Example: "StringStartsWith" would yield "st$", "ss$" but not "tr$".
----------------
Do you mean *bigrams* here?
================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:96
+ // Generate trigrams only if the first character is the segment start.
+ // Example: "StringStartsWith" would yield "st$", "ss$" but not "tr$".
+ if (Roles[I] == Head) {
----------------
Should we also check `Roles[J] == Head `?
As bigram posting lists would be significantly larger than those of trigrams, I would suggest being even more restrictive. For example, for "AppleBananaCat", the most common short queries would be "ap" and "ab" (for `AB`).
================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:119
-// FIXME(kbobyrev): Similarly, to generateIdentifierTrigrams, this ignores short
-// inputs (total segment length <3 characters).
+// FIXME(kbobyrev): Correctly handle cases like "u_". In this case, we would
+// like to match "u" only. This might also work better with word bounds for
----------------
Couldn't we generate a bigram "u_$" in this case? I think we can assume prefix matching in this case, if we generate bigram "u_" for identiifers like "u_*".
> In this case, we would like to match "u" only.
Why? If user types "_", I would expect it to be a meaning filter.
================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:141
+ // sufficient, generate a single incomplete trigram for query.
+ if (ValidSymbolsCount < 3) {
+ std::string Symbols = {{END_MARKER, END_MARKER, END_MARKER}};
----------------
For queries like `__` or `_x`, I think we can generate tokens "__$" or `_x$`.
================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:49
+/// result. Incomplete trigrams contain END_MARKER ('$') at the end. The result
+/// also contains bigrams which are either two subsequent Head symbols or Head
+/// and subsequent Tail. This means that for short queries Dex index would do
----------------
nit: the term "symbol" here is confusing. Do you mean "character"?
================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:59
+/// * Bigram with the first two symbols (if availible), same logic applies.
+///
/// Note: the returned list of trigrams does not have duplicates, if any trigram
----------------
I think the comment can be simplified a bit:
```
This also generates incomplete trigrams for short query scenarios:
* Empty trigram: "$$$"
* Unigram: the first character of the identifier.
* Bigrams: a 2-char prefix of the identifier and a bigram of the first two HEAD characters (if it exists).
```
https://reviews.llvm.org/D50517
More information about the cfe-commits
mailing list