[cfe-dev] A "good" semantic tagging system?

Tue Jun 15 14:25:57 PDT 2010

This comes primarily from experiences with Emacs, but other editors/IDEs
are in roughly the same situation.  Most projects have several external
dependencies, and especially during development, we may be frequently
switching between several build configurations.  I usually have
something like

  foo/include                       # public headers
  foo/src                           # source tree
  foo/build-a                       # a build directory, object files and binaries go here
  foo/build-a/include/fooconfig.h   # written at configure time
  foo/build-{b,c,d,e,f}             # other builds with different external dependencies, precision, debugging

The build system of course knows about the different external
dependencies, but IDEs generally can't pick this up automatically.  I'm
not willing to use "managed build" features because the software has to
be much more portable than that typically permits.  Furthermore,
switching between build configurations usually takes a long time or
leaves stale semantic information.  In Emacs land, I've tried etags,
semantic, global, and cscope, but none of these was always correct
(etags just indexes symbols so it's always "incorrect", semantic was
probably the most correct but also hopelessly slow on medium and large
projects), they require manual setup for the different build
configurations, compiler extensions (like appear in emmintrin.h usually
cause problems).  I also tried the Xref trial, but it's been spinning at
100% CPU for 5 hours on a 300 kLOC C project.  Experiments with Eclipse
CDT, NetBeans, KDevelop4, and Qt Creator showed many of the same
problems although multithreading helped hide some of the slowness
(Emacs' fatal flaw).

It occured to me that CLang is supposed to be highly extensible, so
perhaps it would be possible to get it, through a command-line argument
or environment variable, to dump the semantic information into a SQLite
database during the build.  By using a separate database per
configuration, we would be able to quickly switch between build
configurations, the semantic data would always be at least as current as
the last build, and I'm reasonably confident, based on e.g. firefox and
chrome autocompletion, that the database would be able to respond to
queries acceptably fast even for large projects with many external
dependencies.

Certainly this sort of thing has been discussed, and you may have better
ways to implement it, or perhaps even a working implementation.  But my
(mistaken?) understanding is that current thoughts on gathering this
information has been more intrusive and use/IDE specific.  I think there
is a strong case for something as unintrusive and generic as

  CLANG_SEMANTICDB=/path/to/semantic.sqlite make  # or tup/waf/scons/jam/etc

Is there some work in this direction that I could look at?  If not, does
anyone have implementation advice?

Jed