[cfe-dev] Adding indexing support to Clangd

Fri Sep 8 14:10:05 PDT 2017

> This sounds very like bdb.  Is there a reason that we’re reimplementing a large chunk of bdb, rather than just using it (or using something like sqlite for the index storage)?

I'm not very familiar with bdb, but perhaps it could work. I've seen 
LMDB mentioned a few times and the license seems compatible so I'll 
research this a bit. My initial reasons for not going with SQL 
databases/other libraries:
- From my CDT experience with the C/C++ index model, tables don't seem 
like a good match
- I was under the impression that adding a third party library in llvm 
repositories was not straightforward or desired. But perhaps a small one 
is OK (i.e. not MySQL, PostgreSQL, etc).
- With several options available, I thought doing something similar to 
what I knew worked well in CDT would be a safe way to go. With a similar 
data storage strategy, the index model could be also inspired from CDT 
and give similar features.
- At the LLVM developer meeting in March, there seemed to be an 
understanding among Clangd contributors that a "tailored" format was the 
way to go

I'm happy to consider other existing solutions if they make more sense. 
It's always good not to have to maintain additional code. I'll look a 
bit more into LMDB in the next days to have a better picture. One thing 
that would be good is if we can have "pointer values" that can point to 
other entries in the database (in case someone knows already).

> Clang shouldn't query by name when looking up references, instead USRs 
> should be used. The USRs deal with scopes correctly (Hence the U in USR). 

Yes, querying by USR seem to work well so far!

Regards,
Marc-André

On 2017-08-11 12:16 PM, Alex L wrote:
>
>
> On 11 August 2017 at 17:09, Doug Schaefer via cfe-dev 
> <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
>
>     Sorry gang, Outlook isn’t treating this thread very well. I’ll
>     have to top post my response.
>
>     The CDT indexer and database were written in 2005 so the data is
>     long gone and we’re talking about different database technologies
>     anyway. All I remember was it was slower by multiple orders of
>     magnitude on writes. And that stopped my investigation right
>     there. At the time I was trying to get a full index of the Firefox
>     source down from an hour to a few minutes and incremental indexing
>     times down to less than a second (thanks to the header caching).
>     The DB was taking me in the other direction.
>
>     I’m just not sure architecting for simple lookups is the right
>     approach. Just consider probably the most common use case for
>     clangd, content assist. Given an offset into the file, determine
>     the context/scope of the content and a prefix. You then need to
>     find all matches for that prefix. That’s by no means a simple
>     lookup given all the declarations, type hierarchies, scope
>     hierarchies in play in that context. Finding references is another
>     good one that also requires scope analysis to make sure you’re
>     actually returning references to that exact definition and not
>     just something that happens to have the same name. The CDT index
>     has links back and forth between these things.
>
>
> Clang shouldn't query by name when looking up references, instead USRs 
> should be used. The USRs deal with scopes correctly (Hence the U in USR).
>
>     Again, I’m just relaying my experience haven built one of these
>     things before. And it was a long time ago. All avenues are worth
>     exploring.
>
>     Doug.
>
>     *From:*Manuel Klimek [mailto:klimek at google.com
>     <mailto:klimek at google.com>]
>     *Sent:* Friday, August 11, 2017 10:59 AM
>
>
>     *To:* Doug Schaefer <dschaefer at blackberry.com
>     <mailto:dschaefer at blackberry.com>>; David Chisnall
>     <David.Chisnall at cl.cam.ac.uk
>     <mailto:David.Chisnall at cl.cam.ac.uk>>; Marc-André Laperle
>     <marc-andre.laperle at ericsson.com
>     <mailto:marc-andre.laperle at ericsson.com>>
>     *Cc:* via cfe-dev <cfe-dev at lists.llvm.org
>     <mailto:cfe-dev at lists.llvm.org>>
>     *Subject:* Re: [cfe-dev] Adding indexing support to Clangd
>
>     On Fri, Aug 11, 2017 at 4:50 PM Doug Schaefer
>     <dschaefer at blackberry.com <mailto:dschaefer at blackberry.com>> wrote:
>
>         *From:*Manuel Klimek [mailto:klimek at google.com
>         <mailto:klimek at google.com>]
>         *Sent:* Friday, August 11, 2017 4:54 AM
>         *To:* Doug Schaefer <dschaefer at blackberry.com
>         <mailto:dschaefer at blackberry.com>>; David Chisnall
>         <David.Chisnall at cl.cam.ac.uk
>         <mailto:David.Chisnall at cl.cam.ac.uk>>; Marc-André Laperle
>         <marc-andre.laperle at ericsson.com
>         <mailto:marc-andre.laperle at ericsson.com>>
>         *Cc:* via cfe-dev <cfe-dev at lists.llvm.org
>         <mailto:cfe-dev at lists.llvm.org>>
>
>
>         *Subject:* Re: [cfe-dev] Adding indexing support to Clangd
>
>         On Thu, Aug 10, 2017 at 5:57 PM Doug Schaefer via cfe-dev
>         <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
>
>             > -----Original Message-----
>             > From: Dr D. Chisnall [mailto:dc552 at hermes.cam.ac.uk
>             <mailto:dc552 at hermes.cam.ac.uk>] On Behalf Of David
>             > Chisnall
>             > Sent: Thursday, August 10, 2017 6:10 AM
>             > To: Marc-André Laperle <marc-andre.laperle at ericsson.com
>             <mailto:marc-andre.laperle at ericsson.com>>
>             > Cc: via cfe-dev <cfe-dev at lists.llvm.org
>             <mailto:cfe-dev at lists.llvm.org>>; zeratul976 at hotmail.com
>             <mailto:zeratul976 at hotmail.com>; Doug
>             > Schaefer <dschaefer at blackberry.com
>             <mailto:dschaefer at blackberry.com>>
>             > Subject: Re: [cfe-dev] Adding indexing support to Clangd
>             >
>             > On 8 Aug 2017, at 18:52, Marc-André Laperle via cfe-dev
>             <cfe-
>             > dev at lists.llvm.org <mailto:dev at lists.llvm.org>> wrote:
>             > >
>             > > --ClangdIndexStorage--
>             > > malloc-like interface that allocates/frees data blocks
>             of variable sizes on
>             > disk. The blocks can contain ints, shorts, strings,
>             pointers (i.e. file offsets),
>             > etc. The data is cached in 4K pieces so that local and
>             repeated accesses are all
>             > done quickly, in memory.
>             > > Clangd mallocs and writes its index model objects
>             using this.
>             > >
>             > > --BTree--
>             > > A pretty classic BTree implementation. Used to speed
>             up lookup (symbol
>             > names, file names). It allocates its nodes using
>             ClangdIndexStorage therefore
>             > it is stored on disk. Keys are actually records in
>             ClangdIndexStorage so you
>             > can really think of the BTree as a collection of sorted
>             pointers (sorted
>             > according to a provided comparator).
>             >
>             > This sounds very like bdb.  Is there a reason that we’re
>             reimplementing a
>             > large chunk of bdb, rather than just using it (or using
>             something like sqlite for
>             > the index storage)?
>
>             It looks like Marc-Andre is following our work on the
>             Eclipse CDT indexer. We found databases to be just too slow.
>
>         That is somewhat surprising. Which db's did you benchmark against?
>
>         This was many years ago and we were trying Apache Derby (since
>         Eclipse is written in Java). And at the time the biggest issue
>         was write performance totally destroying any gains we made
>         with header caching. But that may have just been Derby.
>
>         Though I think the bigger issue is that the language model is
>         a graph which doesn’t fit naturally in tables. You’ll end up
>         spending more time fighting that than building out your own
>         data structures. But that’s just my experience.
>
>     Yea, the language model not fitting into tables well does match my
>     intuition, but for simple queries, I'd have expected a db to be
>     fast enough with room to spare. Would be curious about more
>     details / numbers.
>
>             Also, forcing the C++ language model to fit into a bunch
>             of tables or key/value pairs just didn't make that much
>             sense in the end. The data is very complex.
>
>             Instead it's better to think of this as a big graph data
>             structure that just happens to be backed by memory mapped
>             file store. Clients generally deal directly with pointers
>             into that data structure and follow pointers in the graph.
>             The BTrees help speed up that navigation at various nodes
>             in the graph. It's super fast.
>
>             Doug.
>             _______________________________________________
>             cfe-dev mailing list
>             cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
>             http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>             <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
>
>
>     _______________________________________________
>     cfe-dev mailing list
>     cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20170908/b4372d7c/attachment.html>