[cfe-dev] RFC: Upstreaming index-while-building

Dmitri Gribenko via cfe-dev cfe-dev at lists.llvm.org
Fri Mar 1 14:41:31 PST 2019


Hi Argyrios,

On Fri, Mar 1, 2019 at 5:36 PM Argyrios Kyrtzidis <akyrtzi at gmail.com> wrote:
>
> Hi Dmitri,
>
> Could you clarify, it is my impression that clangd is using the same indexing symbol generation mechanism as what IWB (index-while-building) is using as source (the AST visitation of lib/Index and related index consumer). I assume clangd is using that as source of index symbols to process and then generate its higher-level data structures, is this correct ?

Yes, I think it uses the same index consumer.

clangd stores a "static index" on disk.  Static index can be generated
either by a standalone indexing tool
(clang-tools-extra/clangd/indexer/IndexerMain.cpp), or by clangd
itself, when it is started with the `-background-index` command line
option.  The static index is stored on disk per source file.  For
example, if we have lib.h, and foo.cpp, bar.cpp both include lib.h,
the static index will also have three files,
`.clangd/index/{lib.h.$HASH,foo.cpp.$HASH,bar.cpp.$HASH}`.  $HASH is
the hash of the file contents.  Indexing information about lib.h is
not emitted into index files for foo.cpp and bar.cpp.  The static
index is generated from each TU in parallel, using all available cores
-- just like IWB.  The first indexing action that indexes a TU that
uses a certain header, writes the indexing information for that
header.

Each file with indexing information is more or less raw indexing data
scraped from the file.  See
clang-tools-extra/clangd/index/Serialization.h, struct IndexFileIn,
struct IndexFileOut.

clangd builds a merged index over the whole project only in memory.
Therefore, the data that clangd writes to disk is *semantically*
equivalent to what IWB can write.

> IWB aims to be essentially just an efficient serialization mechanism for that same data, to generate the same raw data during a build with minimal overhead. It purposefully doesn’t do any higher level processing of the symbols, e.g. anything that would include merging of index data across files, that would be a non-starter to do during building.

To be clear, I'm not proposing that IWB builds a merged index across
the whole project.  clangd does even write a merged an index to disk.
clangd's indexing information from LLVM+Clang+clang-tools-extra is
less than 100 Mb on disk, and it can be quickly loaded during clangd
startup, after that clangd builds a merged index in memory.

> The design is that IWB serializes the same data, as what lib/Index generates for a file, during a build and then a higher-level indexing mechanism can use that raw data as a source for more sophisticated processing (e.g. clangd’s data structures or a database for cross-file queries).
>
> What seems to me as a great thing to explore would be that clangd uses the raw data that IWB generates as a source of index symbols, so that it can take advantage of the data getting generated during a build and not have to create and process all the translation unit ASTs from the user’s project separately to create its data structures.
> What do you think, does this make sense ?

I think clangd and IWB are very aligned on the high level data flow
already.  IWB is an optimization over the standalone indexing tool
(clang-tools-extra/clangd/indexer/IndexerMain.cpp), that allows
indexing information to be written out during the build instead of
having to run an extra tool.

What I'm asking is that IWB could use the same data format for the
per-file, non-merged, indexing information that clangd already uses in
the standalone indexing tool and in background indexing.

Dmitri

-- 
main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
(j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/



More information about the cfe-dev mailing list