[cfe-dev] Adding indexing support to Clangd

Thu Aug 10 15:11:18 PDT 2017

Apologies for the delay, Nathan (CC’ed) will be preparing a document describing how it works in detail, and will bring it soon to cfe-dev for discussion.

> On Aug 9, 2017, at 1:28 PM, Marc-André Laperle <marc-andre.laperle at ericsson.com> wrote:
> 
> No concrete plan yet but it's something we'd definitely like to have. I haven't seen in detail how the "indexing while building" is going to work, maybe I missed it? I assume there will be some kind of indexer-agnostic hook in the same spirit of the refactoring work that was proposed. I'll keep an eye on this and make sure our efforts are compatible. In any case, I think there will be enough benefit to warrant major rewrites/refactorings if needed.
> 
> Marc-André
> 
> From: Benjamin Kramer <benny.kra at gmail.com <mailto:benny.kra at gmail.com>>
> Sent: Wednesday, August 9, 2017 8:54:03 AM
> To: Marc-André Laperle
> Cc: via cfe-dev; zeratul976 at hotmail.com <mailto:zeratul976 at hotmail.com>; Doug Schaefer; Alex Lorenz; Duncan P. N. Exon Smith; Argyrios Kyrtzidis
> Subject: Re: [cfe-dev] Adding indexing support to Clangd
>  
> Awesome, indexing in clangd would be super cool. Do you have a plan on
> how to combine this with the "indexing while building" stuff Apple
> folks are going to contribute? That will be important when we want to
> use clangd with larger projects.
> 
> On Tue, Aug 8, 2017 at 7:52 PM, Marc-André Laperle via cfe-dev
> <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
> > Hi!
> >
> >
> > I want to give a little update on the indexing prototype for Clangd I've
> > been working on. I wanted to share the actual code on Github before I went
> > on vacation but I ran out of time! Sorry about that!
> >
> >
> > Here's a short summary of the several components there now and what's in
> > progress:
> >
> >
> > --ClangdIndexStorage--
> >
> > malloc-like interface that allocates/frees data blocks of variable sizes on
> > disk. The blocks can contain ints, shorts, strings, pointers (i.e. file
> > offsets), etc. The data is cached in 4K pieces so that local and repeated
> > accesses are all done quickly, in memory.
> >
> > Clangd mallocs and writes its index model objects using this.
> >
> >
> > --BTree--
> >
> > A pretty classic BTree implementation. Used to speed up lookup (symbol
> > names, file names). It allocates its nodes using ClangdIndexStorage
> > therefore it is stored on disk. Keys are actually records in
> > ClangdIndexStorage so you can really think of the BTree as a collection of
> > sorted pointers (sorted according to a provided comparator).
> >
> >
> >
> >
> > The index model is not very rich yet but the idea is that lower level
> > building blocks (ClangdIndexStorage, BTree) will be there so that we can
> > start iterating.
> >
> >
> > --ClangdIndexFile--
> >
> > Path + information about inclusion in order to be able to represent an
> > include graph.
> >
> > The include graph is used to know which files need reindexing for example
> > when a header file changes and also which compilation database entry to use
> > when opening a header in the editor. The first source file including the
> > header file is used to look up the entry in the compilation database. This
> > will also be used for the "Include Browser" feature in the future.
> >
> >
> > --ClangdIndexSymbol--
> >
> > USR + location (pointer to a ClangdIndexFile + start/end offsets)
> >
> > This only represents definitions in source files for now. This is part of
> > the indexing-based feature to "Open Definition".
> >
> > This is likely to have more changes to represent declaration vs definition,
> > references (Find references feature), call graph, etc.
> >
> >
> > --ClangdIndex--
> >
> > Owns a BTree of ClangdIndexSymbol sorted by USR for fast lookup (used by
> > Open Definition for now). getSymbol(USR) returns a ClangdIndexSymbol.
> >
> > Also owns a BTree of ClangdIndexFile sorted by Path for fast lookup. As
> > explained above, used to find proper compilation database entry and to react
> > to header changes. getFile(Path) returns a ClangdIndexFile.
> >
> >
> > # Building the index
> >
> >
> > When Clangd gets the initialize request from the client, it is given a
> > rootURI. It starts indexing each source files under this root, creating
> > ClangdIndexFiles and ClangdIndexSymbols. This is done with the help of
> > index::IndexDataConsumer.
> >
> >
> > At the moment, the main covered scenarios are building the index from
> > scratch and adding files. Support for modifying files and removing them is
> > not fully implemented yet (but this is a must of course!).
> >
> >
> > In case you are wondering, there is no fancy caching of headers or preamble
> > used while indexing, so the process is quite slow. I have been focusing on
> > the model and persistence of the index versus the input (parsing). This will
> > have to be addressed too.
> >
> >
> > # Examples of using the index
> >
> >
> > When the user does the "Open Declaration" command, it retrieves the
> > ClangdIndexSymbol from the ClangdIndex using the USR at the requested offset
> > (sped up by the BTree). The location of the ClangdIndexSymbol (if found) is
> > then returned to the editor.
> >
> >
> > When the user opens a header file, it retrieves the ClangdIndexFile from the
> > ClangdIndex using the path of the header (sped up by the BTree). Then it
> > recursively finds which file includes it until there is no more, at this
> > point chances are that this is a source file. Use this source file path to
> > find a potential entry in the compilation database (so that we gets proper
> > compiler flags, etc).
> >
> >
> >
> >
> >
> > This is just to give you a taste of what I have in mind and what kind of
> > progress is being made. I'd like to have the "lower-level" parts ready for
> > review soon after I come back from vacation (Aug 24th). I was thinking that
> > ClangdIndexStorage and BTree can go in first as they are quite isolated and
> > unit-testable. The rest of the code will also be available on Github to show
> > more concrete usage of them if necessary.
> >
> >
> > Regards,
> >
> > Marc-André
> >
> > ________________________________
> > From: cfe-dev <cfe-dev-bounces at lists.llvm.org <mailto:cfe-dev-bounces at lists.llvm.org>> on behalf of Vladimir
> > Voskresensky via cfe-dev <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>>
> > Sent: Thursday, June 1, 2017 3:10:55 PM
> > To: cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
> >
> > Subject: Re: [cfe-dev] Adding indexing support to Clangd
> >
> >
> >
> >
> > On 06/ 1/17 06:26 PM, Ilya Biryukov via cfe-dev wrote:
> >
> > Other IDEs do that very similarly to CDT, AFAIK. Compromising correctness,
> > but getting better performance.
> > Reusing modules would be nice, and I wonder if it could also be made
> > transparent to the users of the tool (i.e. we could have an option 'pretend
> > these headers are modules every time you encounter them')
> > I would expect that to break on most projects, though. Not sure if people
> > would be willing to use something that spits tons of errors on them.
> > Interesting direction for prototyping...
> >
> > As Doug mentioned, surprisingly the tricks with headers in the majority of
> > projects give pretty good results :-)
> >
> > In NetBeans we have similar to CDT headers caching approach.
> >
> > The only difference is that when we hit #include the second time we only
> > check if we can skip indexing,
> > But we always do "fair lightweight preprocessing" to keep fair context of
> > all possible inner #ifdef/#else/#define directives (because they might
> > affect the current file).
> > For that we use APT (Abstract Preprocessor Tree) per-file which is constant
> > for the file and is created once - similar to clang's PTH (Pre-Tokenized
> > headers).
> >
> > Visiting file's APT we can produce different output based on input
> > preprocessor state.
> > It can be visited in "light" mode or "produce tokens" mode, but it is always
> > gives correct result from the strict compiler point of view.
> > We also do indexing in parallel and the APT (being immutable) is easily
> > shared by index-visitors from all threads.
> > Btw stat cache is also reused from all indexing threads with appropriate
> > synchronizations.
> >
> > So in NetBeans we observe that using this tricks (which really looks like
> > multi-modules per header file) the majority of projects are in very good
> > accuracy + I can also confirm that it gives ~10x speedup.
> >
> > Hope it helps,
> > Vladimir.
> >
> >
> > On Thu, Jun 1, 2017 at 5:14 PM, David Blaikie <dblaikie at gmail.com <mailto:dblaikie at gmail.com>> wrote:
> >>
> >> Not sure this has already been discussed, but would it be
> >> practical/reasonable to use Clang's modules support for this? Might keep the
> >> implementation much simpler - and perhaps provide an extra incentive for
> >> users to modularize their build/code which would help their actual build
> >> tymes (& heck, parsed modules could even potentially be reused between
> >> indexer and final build - making apparent build times /really/ fast)
> >>
> >> On Thu, Jun 1, 2017 at 8:12 AM Doug Schaefer via cfe-dev
> >> <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
> >>>
> >>> I thought I’d chip in and describe Eclipse CDT’s strategy with header
> >>> caching. It’s actually a big cheat but the results have proven to be pretty
> >>> good.
> >>>
> >>> CDT’s hack actually starts in the preprocessor. If we see a header file
> >>> has already been indexed, we skip including it. At the back end, we
> >>> seamlessly use the index or the current symbol table when doing symbol
> >>> lookup. Symbols that get missed because we skipped header files get picked
> >>> up out of the index instead. We also do that in the preprocessor to look up
> >>> missing macros out of the index when doing macro substitution.
> >>>
> >>> The performance gains were about an order of magnitude and it magically
> >>> works most of the time with the main issue being header files that get
> >>> included multiple times affected by different macro values but the effects
> >>> of that haven’t been major.
> >>>
> >>> With clang being a real compiler, I had my doubts that you could even do
> >>> something like this without adding hooks in places the front-end gang might
> >>> not like. Love to be proven wrong. It really is very hard to keep up with
> >>> the evolving C++ standard and we could sure use the help clangd could offer.
> >>>
> >>> Hope that helps,
> >>> Doug.
> >>>
> >>> From: cfe-dev <cfe-dev-bounces at lists.llvm.org <mailto:cfe-dev-bounces at lists.llvm.org>> on behalf of Ilya Biryukov
> >>> via cfe-dev <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>>
> >>> Reply-To: Ilya Biryukov <ibiryukov at google.com <mailto:ibiryukov at google.com>>
> >>> Date: Thursday, June 1, 2017 at 10:52 AM
> >>> To: Vladimir Voskresensky <vladimir.voskresensky at oracle.com <mailto:vladimir.voskresensky at oracle.com>>
> >>> Cc: via cfe-dev <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>>
> >>>
> >>> Subject: Re: [cfe-dev] Adding indexing support to Clangd
> >>>
> >>> Thanks for the insights, I think I get the gist of the idea with the
> >>> "module" PCH.
> >>> One question is: what if the system headers are included after the user
> >>> includes? Then we abandon the PCH cache and run the parsing from scratch,
> >>> right?
> >>>
> >>> FileSystemStatCache that is reused between compilation units? Sounds like
> >>> a low-hanging fruit for indexing, thanks.
> >>>
> >>> On Thu, Jun 1, 2017 at 11:52 AM, Vladimir Voskresensky
> >>> <vladimir.voskresensky at oracle.com <mailto:vladimir.voskresensky at oracle.com>> wrote:
> >>>>
> >>>> Hi Ilia,
> >>>>
> >>>> Sorry for the late reply.
> >>>> Unfortunately mentioned hacks were done long time ago and I couldn't
> >>>> find the changes at the first glance :-(
> >>>>
> >>>> But you can think about reusable chaned PCHs in the "module" way.
> >>>> Each system header is a module.
> >>>> There are special index_headers.c and index_headers.cpp files which
> >>>> includes all standard headers.
> >>>> These files are indexed first and create "module" per #include.
> >>>> Module is created once or several times if preprocessor contexts are
> >>>> very different like C vs. C++98 vs. C++14.
> >>>> Then reused.
> >>>> Of course it could compromise the accuracy, but for proof of concept was
> >>>> enough to see that expected indexing speed can be achieved theoretically.
> >>>>
> >>>> Btw, another hint: implementing FileSystemStatCache gave the next
> >>>> visible speedup. Of course need to carefully invalidate/update it when file
> >>>> was modified in IDE or externally.
> >>>> So, finally we got just 2x slowdown, but the accuracy of "real"
> >>>> compiler. And then as you know we have started Clank :-)
> >>>>
> >>>> Hope it helps,
> >>>> Vladimir.
> >>>>
> >>>>
> >>>> On 29.05.2017 11:58, Ilya Biryukov wrote:
> >>>>
> >>>> Hi Vladimir,
> >>>>
> >>>> Thanks for sharing your experience.
> >>>>
> >>>>> We did such measurements when evaluated clang as a technology to be
> >>>>> used in NetBeans C/C++, I don't remember the exact absolute numbers now, but
> >>>>> the conclusion was:
> >>>>>
> >>>>> to be on par with the existing NetBeans speed we have to use different
> >>>>> caching, otherwise it was like 10 times slower.
> >>>>
> >>>> It's a good reason to focus on that issue from the very start than.
> >>>> Would be nice to have some exact measurements, though. (i.e. on LLVM).
> >>>> Just to know how slow exactly was it.
> >>>>
> >>>>> +1. Btw, may be It is worth to set some expectations what is available
> >>>>> during and after initial index phase.
> >>>>> I.e. during initial phase you'd probably like to have navigation for
> >>>>> file opened in editor and can work in functions bodies.
> >>>>
> >>>> We definitely want diagnostics/completions for the currently open file
> >>>> to be available. Good point, we definitely want to explicitly name the
> >>>> available features in the docs/discussions.
> >>>>
> >>>>> As to initial indexing:
> >>>>> Using PTH (not PCH) gave significant speedup.
> >>>>>
> >>>>> Skipping bodies gave significant speedup, but you miss the references
> >>>>> and later have to reindex bodies on demand.
> >>>>> Using chainged PCH gave the next visible speedup.
> >>>>>
> >>>>> Of course we had to made some hacks for PCHs to be more often
> >>>>> "reusable" (comparing to strict compiler rule) and keep multiple versions.
> >>>>> In average 2: one for C and one for C++ parse context.
> >>>>> Also there is a difference between system headers and projects headers,
> >>>>> so systems' can be cached more aggressively.
> >>>>
> >>>> Is this work open-source? The interesting part is how to "reuse" the PCH
> >>>> for a header that's included in a different order.
> >>>> I.e. is there a way to reuse some cached information(PCH, or anything
> >>>> else) for <map> and <vector> when parsing these two files:
> >>>> ```
> >>>> // foo.cpp
> >>>> #include <vector>
> >>>> #include <map>
> >>>> ...
> >>>>
> >>>> // bar.cpp
> >>>> #include <map>
> >>>> #include <vector>
> >>>> ....
> >>>> ```
> >>>>
> >>>> --
> >>>> Regards,
> >>>> Ilya Biryukov
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Regards,
> >>> Ilya Biryukov
> >>>
> >>> _______________________________________________
> >>> cfe-dev mailing list
> >>> cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
> >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
> cfe-dev -- Clang Front End for LLVM Developers' List <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
> lists.llvm.org <http://lists.llvm.org/>
> To see the collection of prior postings to the list, visit the cfe-dev Archives. Using cfe-dev: To post a message to all the list members, send email ...
> 
> >
> >
> >
> >
> > --
> > Regards,
> > Ilya Biryukov
> >
> >
> > _______________________________________________
> > cfe-dev mailing list
> > cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
> cfe-dev -- Clang Front End for LLVM Developers' List <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
> lists.llvm.org <http://lists.llvm.org/>
> To see the collection of prior postings to the list, visit the cfe-dev Archives. Using cfe-dev: To post a message to all the list members, send email ...
> 
> >
> >
> >
> > _______________________________________________
> > cfe-dev mailing list
> > cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
> cfe-dev -- Clang Front End for LLVM Developers' List <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
> lists.llvm.org <http://lists.llvm.org/>
> To see the collection of prior postings to the list, visit the cfe-dev Archives. Using cfe-dev: To post a message to all the list members, send email ...
> 
> >

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20170810/5cb7af5c/attachment.html>