[all-commits] [llvm/llvm-project] 001e88: [clangd] Performance improvements and cleanup

Mon Apr 11 08:21:55 PDT 2022

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 001e88ac83b5c3a4d4f4e61480953ebcabc82b88
      https://github.com/llvm/llvm-project/commit/001e88ac83b5c3a4d4f4e61480953ebcabc82b88
  Author: Kadir Cetinkaya <kadircet at google.com>
  Date:   2022-04-11 (Mon, 11 Apr 2022)

  Changed paths:
    M clang-tools-extra/clangd/index/SymbolCollector.cpp
    M clang-tools-extra/clangd/index/SymbolCollector.h
    M clang-tools-extra/clangd/index/SymbolID.cpp
    M clang-tools-extra/clangd/index/SymbolID.h
    M clang-tools-extra/clangd/unittests/SymbolCollectorTests.cpp

  Log Message:
  -----------
  [clangd] Performance improvements and cleanup

- Inline SymbolID hashing to header
- Don't collect references for symbols without a SymbolID
- Store referenced symbols, rather than separately storing decls and
  macros.
- Don't defer ref collection to end of translation unit
- Perform const_cast when updating reference counts (~0.5% saving)
- Introduce caching for getSymbolID in SymbolCollector. (~30% saving)
- Don't modify symbolslab if there's no definition location
- Don't lex the whole file to deduce spelled tokens, just lex the
  relevant piece (~8%)

Overall this achieves ~38% reduction in time spent inside
SymbolCollector compared to baseline (on my machine :)).

I'd expect the last optimization to affect dynamic index a lot more, I
was testing with clangd-indexer on clangd subfolder of LLVM. As
clangd-indexer runs indexing of whole TU at once, we indeed see almost
every token from every source included in the TU (hence lexing full
files vs just lexing referenced tokens are almost the same), whereas
during dynamic indexing we mostly index main file symbols, but we would
touch the files defining/declaring those symbols, and lex complete files
for nothing, rather than just the token location.

The last optimization is also a functional change (added test),
previously we used raw tokens from syntax::tokenize, which didn't
canonicalize trigraphs/newlines in identifiers, wheres
Lexer::getSpelling canonicalizes them.

Differential Revision: https://reviews.llvm.org/D122894