[cfe-dev] CompilationDatabase support added to libclang and python binding

Mon Jun 11 04:06:22 PDT 2012

On Mon, Jun 11, 2012 at 10:32 AM, Arnaud de Grandmaison
<arnaud.allarddegrandmaison at parrot.com> wrote:
> Hi Manuel,
>
> I am done adding the CompilationDatabase functionality to libclang
> (first patch)
> and the python binding (second patch).
>
> With respect to our last email exchange :
>  - comments have been modified to be futureproof with respect to
> database actual format.
>  - the python binding is using exceptions for operations which have failed
>  - tests have been added for the libclang part as well as the python binding
>
> The tests are making an assumption I am not totally happy with : in case
> a file has several CompileCommands, the tests expect to get the
> CompileCommands in the same order they appear in the database.
>
> In c-index-test, I have added a 'dirname' function for use on Windows
> platforms, assuming that if 'basename' is not available, than 'dirname'
> would not. I do not know if this is necessary, and I have not tested it
> as I have no windows platform available.

Yay - new functionality! Overall a great patch. I can't wait to see
this integrated.

I do have a number of comments though.

I am lacking clout on the C front, so take that feedback with a grain
of salt. I'm a maintainer of the Python bindings though, so that must
mean something.

I've tried to fight this battle before and I'll say it again: I don't
like the existing naming convention for the libclang APIs.
Specifically, there isn't a consistent one. And, this patch continues
that trend (not your fault - you were basing it on existing code,
which is the right thing to do).

Let's look at the new API (in source order):

CXString clang_tooling_CompileCommand_getDirectory(CXCompileCommand);
unsigned clang_tooling_CompileCommand_getNumArgs(CXCompileCommand);
CXString clang_tooling_CompileCommand_getArg(CXCompileCommand, unsigned i);
void clang_tooling_disposeCompilationDb(CXCompilationDb);
CXCompilationDb clang_tooling_getCompilationDb_fromDirectory(const
char *buildDir);
void clang_tooling_disposeCompileCommands(CXCompileCommands);
unsigned clang_tooling_CompileCommands_getSize(CXCompileCommands);
CXCompileCommand
clang_tooling_CompileCommands_getCommand(CXCompileCommands, unsigned
i);
CXCompileCommands clang_tooling_getCompileCommands(CXCompilationDb
CDb, const char *completeFileName);

First, some nits. I would change the source order of the declarations
so related APIs are grouped together then sorted from highest to
lowest level. Start with all the functions operating on
CXCompilationDb. Then define everything operating on
CXCompileCommands. Finally CXCompileCommand. This will increase
readability, IMO. Also, always place the "constructors" and
"destructors" next to each other and reference the other in the
doxygen blocks. This increases the chances that the destructors are
called appropriately.

The use of "Db" rather than "DB" looks unpleasant to me. I think you
should capitalize both letters.

Everything is prefixed with "clang_tooling_." The "tooling" bit is new
to libclang. I personally like prefixing this way as it draws a clear
line between underlying modules. But, I know others feel differently
and may argue for its elimination. If we decide to keep it, perhaps a
new header, Tooling.h, would be the appropriate place to declare these
APIs since they are clearly separate from "index." This would extend
to the Python API as well.

I really like your general naming convention of
clang_tooling_<thing>_<action>. IMO, this is a proper naming
convention. Related functions are naturally grouped together and the
name communicates what they operate on and the action being performed.
Unfortunately, the rest of the libclang API isn't like this.
Furthermore, your patch isn't consistent in this style. e.g.
"clang_tooling_disposeCompilationDb" should be
"clang_tooling_CompilationDb_dispose" and
"clang_tooling_getCompilationDb_fromDirectory" should be
"clang_tooling_CompilationDb_getFromDirectory" or even
"clang_tooling_CompilationDb_fromDirectory," etc. Since the existing
API doesn't use my preferred naming convention, I guess this means you
don't have to adopt it. Whatever you end up doing, please make it
consistent because an inconsistent convention is worse than a
consistently "bad" one.

I wonder if a clang_tooling_CompileCommand_getArgs(CXCompileCommand,
CXString **, size_t *size) (or similar) would be helpful. This API
would allocate a new array of CXString or CXString* in one go instead
of requiring a function call for each argument. Or, you could have the
caller pre-allocate the array and the function would fill. Semantics.
I think the majority of users are interested in most or all arguments
and this API would save some function call overhead (which could be
noticeable in places like the Python bindings). I can make the same
argument for having a
clang_tooling_CompileCommands_getCommands(CXCompileCommands,
CXCompileCommand **, size_t *).

I really don't like the magic of compilation databases being
constructed from a directory with the auto-discovery of
"compile_commands.json." I know this is how the underlying API works,
so this isn't the fault of this patch. I think the underlying API and
libclang should be supplemented with the ability to construct the
database from an arbitrary file or possibly even a buffer with an
associated compilation database type (CXCompilationDBType_JSON).

While we're on the subject of limitations in the underlying API, the
lack of the ability to extract *all* compilation commands is a serious
downer. I can think of a number of use cases where one would crawl all
compilation databases and operate on all commands. Are those users
expected to implement their own parsers for the compilation database
and not go through the tooling/libclang API?

Whew, that was a lot. Again, I don't have clout in C land, so don't go
integrating my feedback before others have time to rebuke.

On to the Python!

Nit: Include space after comma between function arguments. This is
missing throughout.

Nit: Again, I like "CompilationDB" not "CompilationDb." I may like
"CompilationDatabase" even more.

Nit: I like wrapping the method summary on the same line as """. i.e.
don't waste lines.

> @property
> def getDirectory(self):

This produces code that looks like:

> dir = command.getDirectory

This looks weird. To most Python programmers, it looks like you are
retrieving the underlying method. Rename your properties so they are
nouns, not verbs. e.g. "directory," "commandLine."

For the CompileCommand arguments, I would make it a "arguments"
property or a "getArguments" method. To me, "command line" implies the
full string, not a list of strings.

On a related note, there is no need to create inner iterator classes.
Instead, just use yield, which does all the magic for you. e.g.

> def getArguments(self):
>     length = CompileCommand_getNumArgs(self.cl)
>     for i in xrange(length):
>         yield CompileCommand_getArg(self.cl, i)

Of course, you lose the ability to do efficient len() and subscript
access. Callers can gain this back by doing list(foo.getArguments()),
etc. But, that buffers everything, which could have performance
side-effects. That could be a problem for cursor iteration. I don't
think it is a problem for compilation databases. Although, it /might/
be if we ever expose an API to retrieve all CompileCommands. The good
news is you can swap in a custom iterator later and the API is
backwards-compatible (yield returns an iterator magically under the
hood). My vote is to optimize for fewer lines of code now. Complexity
can always be added later if needed for performance.

The current Python API is currently inconsistent with its use of
iterators in terms of using properties or methods. I /think/ I like
methods more. But, I'm fine either way as long as things are
documented.

In CompilationDb.fromDirectory, the ptr validation can be done in a
from_result function registered on the ctypes library. I would prefer
this, as that ensures all call sites have the same behavior (nevermind
there will only likely be 1 call site).

+    try:
+        cmds = cdb.getCompileCommands('file_do_not_exist.cpp')
+    except KeyError:
+        assert True
+    else:
+        assert False

You can do this?! I guess you do learn something new every day!

+    cmdline = ' '.join(cmds[0].getCommandLine)
+    assert cmdline == 'clang++ -o project.o -c
/home/john.doe/MyProject/project.cpp'

Please compare lists instead. Every time you compare derived entities,
you introduce a point for subtle failures.

OK. That's a lot of feedback. I think I'm done :)

Again, great patch and thank you for putting the time into it. Despite
the length of this reply, things are looking good. Most of my comments
are cosmetic in nature, so they should be trivial to address. Again,
expect conflicting feedback on my C comments.

Gregory