[cfe-dev] Announcing Crange

Fri May 9 07:36:13 PDT 2014

On 09/05/2014 14:08, Anurag wrote:
> Announcing Crange: https://github.com/crange/crange
>
> Summary
> -------
>
> Crange is a tool to index and cross-reference C/C++ source code. It
> can be used to generate tags database that can help with:
>
> * Identifier definitions
> * Identifier declaraions
> * References
> * Expressions
> * Operators
> * Symbols
> * Source range
>
> The source metadata collected by Crange can help with building tools
> to provide cross-referencing, syntax highlighting, code folding and
> deep source code search.
>
>
> Rationale
> ---------
>
> I was looking for tools that can extract and index identifiers present
> in C/C++ source code and can work with large code bases.
>
> Considering the amount of data Clang can generate while traversing
> very large C/C++ projects (like, Linux), I decided against using
> ctags/etags style tags database. Crange uses SQLite based tags
> database to store identifiers and metadata, and uses SQLite's bulk
> insert capabilities wherever possible.
>
> I've used python's multiprocessing library to parallelize translation
> unit traversal and metadata extraction from identifiers. Its possible
> to control the number of jobs using -j command line option.
>
>
> Usage example
> -------------
>
> Generating tags database for Linux 3.13.5
>
>    $ cd linux-3.13.5
>    $ crtags -v -j 32 .
>    Parsing fs/xfs/xfs_bmap_btree.c (count: 1)
>    Indexing fs/xfs/xfs_bmap_btree.c (nodes: 379, qsize: 0)
>    ...
>    Parsing sound/soc/codecs/ak4641.h (count: 34348)
>    Generating indexes
>
> This would create a new file named tags.db containing all the
> identified tags.
>
> Search all declarations for identifier named device_create
>
>    $ crange device_create
>
> Search all references for identifier named device_create
>
>    $ crange -r device_create
>
> Not all command line options are available though (-b, -k etc.), as
> the tool is still in development.
>
> Performance
> -----------
>
> Running crtags on Linux kernel v3.13.5 sources (containing 45K files,
> size 614MB) took a little less than 7 hours (415m10.974s) on 32 CPU
> Xeon server with 16GB of memory and 32 jobs. The generated tags.db
> file was 22GB in size and contained 60,461,329 unique identifiers.
>
> Installation
> ------------
>
>    $ sudo python setup.py install
> or
>    $ sudo pip install crange

Yes, this looks interesting.

As Renato said, the run-time is critical. Some people may suggest to 
implement this in C/C++. However, starting with python is probably a 
good choice to try this out and also to understand the performance 
implications.

Some ideas:

	- You may want to look at the compilation database support in
           libclang to retrieve the set of files to process as well as
           the corresponding command lines

	- I wonder how quick querying the database is. In fact,
           if those queries are quick (less than 50ms) even for big
           databases, this would extremely interesting as you could in
           an editor for example add missing includes as you type.

Ah, and it does actually only work if I don't use this worker stuff, but 
apply this patch:

-        pool.map(worker, worker_params)
+
+        for p in worker_params:
+            worker(p)

Otherwise the process gets stuck (even on a single file) and I can not 
abort it. Instead, I get
^CProcess PoolWorker-1:
Traceback (most recent call last):
   File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in 
_bootstrap
     self.run()
   File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
     self._target(*self._args, **self._kwargs)
   File "/usr/lib/python2.7/multiprocessing/pool.py", line 113, in worker
     result = (True, func(*args, **kwds))
   File "/home/grosser/Projects/crange/crange/bin/crtags", line 12, in 
dbkeeper
     ast = queue.get()
   File "<string>", line 2, in get
   File "/usr/lib/python2.7/multiprocessing/managers.py", line 759, in 
_callmethod
     kind, result = conn.recv()
KeyboardInterrupt
^CProcess PoolWorker-4:
Traceback (most recent call last):
   File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in 
_bootstrap
     self.run()
   File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
     self._target(*self._args, **self._kwargs)
   File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
     task = get()
   File "/usr/lib/python2.7/multiprocessing/queues.py", line 376, in get
     return recv()
KeyboardInterrupt
^CProcess PoolWorker-5:
Traceback (most recent call last):
   File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in 
_bootstrap
     self.run()
   File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
     self._target(*self._args, **self._kwargs)
   File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
     task = get()
   File "/usr/lib/python2.7/multiprocessing/queues.py", line 376, in get
     return recv()
KeyboardInterrupt
^CProcess PoolWorker-6:
Traceback (most recent call last):
   File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in 
_bootstrap
     self.run()
   File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
     self._target(*self._args, **self._kwargs)
   File "/usr/lib/python2.7/multiprocessing/pool.py", line 102, in worker
     task = get()
   File "/usr/lib/python2.7/multiprocessing/queues.py", line 376, in get
     return recv()
KeyboardInterrupt

Cheers,
Tobias