[cfe-dev] [RFC] A C++ pseudo parser for tooling

Tue Nov 9 10:02:10 PST 2021

On Tue, Nov 9, 2021 at 5:42 PM Demi Marie Obenour via cfe-dev <
cfe-dev at lists.llvm.org> wrote:

> On 11/9/21 12:09 PM, Sam McCall wrote:
> > On Tue, Nov 9, 2021 at 4:38 PM Demi Marie Obenour via cfe-dev <
> > cfe-dev at lists.llvm.org> wrote:
> >
> >>>> * I think you are better off spending your time on optimizing the
> >> correct
> >>>> parser infrastructure.  I'm sure more can be done - particularly in
> >> terms
> >>>> of caching, persisting and resusing state (think like PCH and modules
> >> etc).
> >>>>
> >>> We have worked on projects over several years to improve these things
> >> (and
> >>> other aspects such as error-resilience). We agree there's more that can
> >> be
> >>> done, and will continue to work on this. We don't believe this approach
> >>> will get anywhere near a 100x latency improvement, which is what we're
> >>> looking for.
> >>
> >> What about pushing the state to a server?  Have a server that has the
> >> entire
> >> index, and keeps it up to date whenever a VCS commit is made to the main
> >> branch.
> >
> > We have this for clangd's index:
> https://clangd.llvm.org/guides/remote-index
> > It works great (try it out with LLVM!) but needing to deploy a server
> means
> > 90% of users won't ever touch it.
> >
> > (A shared repository of serialized ASTs *is* something we're considering
> in
> > a tightly controlled corp environment but the barriers are pretty huge:
> > size, security and it only works if everyone uses the same precise
> version
> > of the tool. And it only makes sense at all if you're sure you can
> download
> > 300MB in less than 30 seconds!)
> Indeed that is not going to be useful outside of an on-premises corporate
> environment with extremely fast network connectivity.  And the security
> considerations are stringent, especially since clangd is written in C++ and
> I am not sure it can be trusted with untrusted input.
>
> Could an index be persisted to disk, reloaded at startup, and incrementally
> changed as the user edits?  That would avoid having to have a daemon
> constantly running.
>

Such an indexer architecture is what most people are using for codebases
less than a few million lines, which is almost all of them.  Only a handful
of companies like Google, Bloomberg, etc have to deal with the scalability
issues of codebases larger than that.  In those outlier settings I'd say a
common practice is to break off just the million or so lines around what
you're working on (ie it and its dependencies and some dependents), and
just index that subpart.  There are then other centralized shared tools
that regularly index the checked-in state across the entire codebase, when
you need to venture outside of that subpart (eg to look up external refs).
The latter doesn't sync with your unchecked-in changes, but that is a small
price to pay compared to using an inferior approximate parser that is
always getting things wrong (go-to-definition/autocomplete/call-graph/etc
regularly failing).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20211109/1d304da8/attachment.html>