[cfe-dev] source code database

James K. Lowden jklowden at schemamania.org
Tue Feb 28 20:29:18 PST 2012

The "open clang projects" page refers to some potential uses of clang
for tool-building.  A few of them require metadata from the
lexer or parser.  

I'm interested in creating a framework for searching and reporting on
large C++ code trees.  I wonder what work has already been done, and if
the information I want is currently available from the clang front
end.  I would begin by capturing the token metadata in SQLite, thereby
making them accessible to a variety of applications.  


Back when the VAX dinosaur was knee-high to a mammal, I used DEC's
Source Code Analyzer (SCA)[1].  To this day, I have never seen or heard
of anything as good.  ISTM clang could be used to create something

What is "as good", and what would be better?  

SCA let the user:

1.  analyze arbitrary subsets of a source code tree
2.  dynamically restrict the range of queries on that subset
3.  distinguish among read, write, invoke, reference, and dereference
4.  define  "interesting" cases for repeated use, including reports

	Current Tools Fail

Microsoft's tool lacks all these features.  cscope has some of them,
but only for C.  (For example, cscope cannot search for a
destructor or anything with a scope operator.)  VS parses C++, but the
user cannot search for uses of e.g. operator<<.  

The free tools I've looked at share don't really parse C++.   They
parse the nonlanguage "C/C++".  Consequently they cannot hope to
answer #3 above; they can't even distinguish between ::B and A::B.
They also lack any kind of scripting language, preventing #4 and
severely restricting the capability of #2.  

These problems are all answered by clang+SQL.  Or, might be, if clang
is up to the job.  

	Required Metadata

I'm sure the following is incomplete and that it is more
comprehensive than what is available from any existing tool at any
price.  Is it covered by clang at present?  


For any token

1.  namespace
2.  enclosing class/struct
3.  const, static
4.  linkage
5.  public, protected, or private (or none)
6.  declare, define, or use
7.  translation unit (file) and line number

It should be possible to say in which lines of a file a given token
is visible.  

For types

1.  class, struct, or enum
2.  derived from
3.  derived how (public/protected/private)

For typedefs, the above must be available for all components of the

For variables

1.  read, write, invoke, reference, and dereference
    (A variable may be invoked if it holds a pointer to a function.)
2.  type: class, struct, typedef, or builtin
3.  const, static, or automatic
4.  (overrides can be derived)
5.  for uses, discarded Koenig lookups

For functions

1.  for each parameter and return type, cf. "for variables", above
2.  invoke or reference
3.  (overrides can be derived)
4.  for invocations, discarded Koenig lookups  

For operators

1.  declare, define, reference, or invoke
2.  friendship (1 : many)
3.  for invocations, discarded Koenig lookups  

For the preprocessor

1.  define or use
2.  scope
3.  post-processing interpretation, as above


As I said, I would like to know if the above information is accessible
from the clang "kit" and what, if anything, has been undertaken in this
vicinity heretofore.  If clang can provide the information, the project
I have in mind -- of writing a tool to collect it and keep it in a
database -- is both useful and feasible.  

It's a big question, I know.  You can appreciate I'd want to know the
feasibility first, before diving in.  

Thank you for your time.  



P.S.  Prior to posting, I tried to read the mailing list archives.  I
must not be the first to notice they're almost impossible to read
because the text doesn't wrap in the browser.  

More information about the cfe-dev mailing list