[cfe-dev] Adding indexing support to Clangd

Wed May 17 15:38:49 PDT 2017

Hi,

I’ve been thinking about how to add features to Clangd requiring an index, i.e. features that need a database containing information of all source files (Go to definition, find references, etc). I’d like to share with you my thoughts on how things are and what approaches could be taken before getting too deep into implementing something.

My understanding of the current Clang indexing facilities is as follow:
  - It is part of the libclang so it is meant to have a stable API which can be limiting because it does not expose the full Clang C/C++ API
  - It does not have persistence. I.e. the index cannot be reloaded from disk at a later time after it is built.
  - There is no header caching mechanism in order to allow faster reparsing when a source file changes but its included headers haven’t (a common occurrence during code editing).

-- Other indexing solutions --

I have done a very high level exploration of some other projects using Clang for indexing, you can find some notes here:
https://docs.google.com/document/d/1Z0pDZpUlhyRkw1yB9frVVeb_xgSb5PuXD0-aeUtKkpo/edit?usp=sharing
(Feel free to add your own notes if you’d like!)

>From what I gathered:
  - Some projects are using libclang, others use the Clang C++ APIs (AST) directly because of libclang limitations
  - Some projects have a custom index formats on disk, others use RDMS (PostgreSql, Sqlite) or other already available solutions (Elastic Search, etc).
  - I didn’t notice any projects based on Clang doing header caching, although perhaps I missed it. Ilya Biryukov wrote that JetBrains CLion does header caching but it’s not clear how they are stored or if it is using Clang. On the Eclipse CDT side, Clang is not used but there is header caching by storing the semantic model in the index (not plain AST). Then the source files can be parsed reusing that cached information.

Possible approach for Clangd:
  - I suggest using Clang libraries directly and not using libclang in order to not have any limitations. I think that using a stable API is not as important since Clangd resides in the same tree and is built and tested in coordination with Clang itself. The downside is that it will not reuse some of the work already done in libclang such as finding references, etc.
  - I think introducing a big dependency such as PostgreSql is not acceptable for Clangd (correct me if I’m wrong!). So a custom tailored file format for the index make more sense to me.
  - For header caching, I wonder if it is possible to reuse the precompiled header support in Clang. There would be some logic that would decide whether or not a precompiled header could be used depending on the preprocessing context (same macro definitions, etc).

-- The Index model --

Here’s what the data model could look like. For sure it’s partial and I expect it will evolve quite a bit. But it should be enough to communicate the general idea.

Index: Represents the model of the code base as a whole
  - IndexFile []

IndexFile: Represents an indexed file
  - URI path
  - IndexFile includedBy [ ]
  - IndexName [ ]
  - Last modified timestamp, checksum, etc

IndexName: Represents a declaration, definition, macro, etc
  - Source Location
  - IndexReference [ ]

IndexNameReference: Reference to a name
  - Source Location
  - Access (read, write, read/write)

IndexTypeName extends IndexName: represents classes, structs, etc
  - IndexTypeName bases [ ]

IndexFunctionName extends IndexName: represents functions, methods, etc
  - IndexFunctionName callers [ ]

Note that a lot of information probably doesn’t need to be modeled because a lot of information only needs to be available with an opened file for which we can have access to the full AST.

-- The persisted file format -- 

All elements in the model mentioned above could have a querying interface which could be implemented for an “in memory” database (simpler to debug and fast for small projects) and also for an on-disk database. From my experience in Eclipse CDT, the index on disk was stored in the form of a BTree which worked quite well. The BTree is made out of chunks. Chunks can be cached in memory and fetched from disk as required. Every information in the model is fetched from the database (from cache otherwise from disk). A similar approach could be used for Clangd if it’s deemed suitable.

In summary, I’m proposing for Clangd an index on disk stored in the form of a BTree that is populated using Clang’s C++ API (not libclang). Any concerns or input would be greatly appreciated. Just as a side note, I’m aware that this is just one line of thinking and others could be considered. 

Best regards,
Marc-André Laperle