[cfe-dev] Python bindings overhaul

Fri Jan 27 14:41:07 PST 2012

I thought that the Python binding to libclang could use some love, so
I've been working on a number of enhancements to it.

You can find my work at [1]. Changes can be seen at [2]. Things are
still a work in progress and there are a number of outstanding issues
I'd like to address before I formally submit a patch. But, I think
things are ready for critique.

Major changes:

  * Enumerations split out into clang.cindex.enumerations module
  * Objects hold references to parent (i.e. worry-free memory management)
  * Support for Token API
  * More APIs supported around the board. Pretty much everything from
    libclang is supported except the high-level Indexing component.

Other improvements:

  * Python style improvements. pylint and pyflakes output is now
    tolerable and the code mostly follows Python best practices.
  * Improved documentation
  * Increased test coverage
  * Added test to ensure enumerations are up-to-date with Index.h (uses
Tokens API)
  * Introduced decorator to cache/lazy load properties. This should cut
down on C function calls.

Despite the major changes, API compatibility is mostly preserved. I did
break APIs. However, I actively tried to only break things that I
thought were internal to the module. If external consumers were using
the things that changed, IMO they deserved to break because they weren't
using things in an intended manner. This is my intent anyway. I haven't
thoroughly audited to ensure I stuck to this principle. If the lack of
changes to the existing tests is any indication, I think I did a good
job though.

That said, there are numerous API breaking changes I would like to make.
Here are some examples:

* Various APIs are using camel case instead of underscores for method
names, breaking the Python convention.
* Diagnostic exposes iterable structures as properties. IMO properties
  should return the same object every call. Currently, they return
  different instances. This is confusing, IMO.
* Code completion has a different style from the rest of the module.

If someone gives me a green light, I'd love to have a go at it and
produce a consistent API.

Gregory Szorc
gregory.szorc at gmail.com

[1] https://github.com/indygreg/clang/tree/python_features/bindings/python
[2] https://github.com/indygreg/clang/compare/master...python_features

The remaining content in this message is essentially my developer notes
and explains why I did what I did. If you will be commenting, please
read it first.

------------------------------------------------------------------------

The patch is huge and touches pretty much the entire binding. Don't
attempt to grok the diff. Instead, just apply the patch and look at the
new state of the world.

I refactored enumeration definitions into a separate module for a few
reasons. First, it makes them centrally located. The cindex module is
already pretty large. Locating the enumerations within it was kinda
painful. Now that they are all in one module. They are easy to locate
and update, especially for core/C++ developers who just want to ensure
their new enumeration is defined without having to grok through a bunch
of Python. Second, having the enumerations in a separate file enforces a
separation of code and data. I think this is a good practice.

As part of the enumeration refactoring, I changed how they are
implemented. The old method of creating a sparse list isn't very
Pythonic. I changed things to use a dictionary indexed by numeric value.
Also, having a constructor have the side effect of registering an
enumeration didn't feel right to me. I feel the new way of having a
"register" static method is more proper. There is some setattr() magic
involved, but the class members are still registered, the pydoc output
is proper, and the API is preserved, so I think everyone should be happy.

I refactored a number of the top-level classes so they no longer derive
from ctypes.Structure. Instead, the top-level data class derives from
__builtin__.object and contains an inner class which defines the CX*
structure. This does create an extra Python object for every instance,
but I feel the separation of logic and data makes more sense now. And,
there were FIXMEs in the source that implied this is the direction
someone wanted to take, so I went ahead and did it. Technically these
changes broke the API. However, I feel that anybody accessing things
that changed was using the module in a non-advertised manner and thus
breaking the API was acceptable. This was only likely being done in
places where the official bindings didn't fully support libclang. And,
since the patch pretty much fully supports libclang, they should be able
to use the new API rather easily.

As part of the class refactoring, I vastly overhauled the __init__
methods. The new world provides many more and convenient avenues to
creating objects through their constructors. The new world relies much
more on __init__ and much less on static class methods. See
SourceLocation for a good example. Again, I probably broke API
compatibility with a few constructors. And again, this was probably due
to using the API in a non-compatible manner.

As static object creation methods were deprecated in favor of __init__,
I changed some to emit a DeprecationWarning through warnings.warn. This
is easy enough to silence in clients, but it will require a code change.
This should be documented in the upgrade notes.

I consolidated all the ctypes function prototype foo to a single
function. As part of this, I got rid of the extra separate module symbol
referencing the function. Now, consumers reference the functions
directly on the loaded library. I also reordered the function list
alphabetically. While the libclang Doxygen docs separate functions by
modules, attempting to replicate this list in Python land was
cumbersome. For example, functions dealing primarily with curors were
scattered over a number of chunks. The new world is sorted purely
alphabetically and removes all doubts about where to insert a new
function in the list.

Please spend extra time on the docstrings. I'm still not too
familiar with the C/C++ internals, so I might have made some factually
incorrect statements. I welcome enlightenment.

I tried to ensure that references to parent objects are created
everywhere they are necessary. While I'm confident things are better
than they were, I can't guarantee things are perfect. Please review the
reference tracking and ensure that all proper objects are tracked.

A lot of newly implemented functionality is currently lacking unit
tests. I haven't even verified all functions work, so it is possible
some of it is horribly broken. Unfortunately, some of the tests I just
don't know how to write, as they deal with concepts I'm not familiar with.