[cfe-dev] [GSoC] "Improving Clang's AST source-fidelity" proposal

Wed Apr 6 07:11:19 PDT 2011

Hello everybody. This is my other proposal for the Google Summer of Code.
Since the student application deadline is very near, I've already submitted it
(and the other one about lambdas) to melange.
Comments and suggestions are really appreciated.

--
Title: Improving Clang's AST source-fidelity

1. Introduction:
Among the many strengths of clang, a most noticeable one is the excellent
support for client applications different from traditional compilers.
When compared to other systems, clang behaves spectacularly
better as far as source-code fidelity is concerned and is going to become
the reference choice for source-code based applications such as refactoring
tools, pretty printers and beautifiers, coding-style checkers, etc.

Most of this is a direct result of the accuracy with which the syntactic
structure of the parsed sources is represented in the clang AST:
different syntactic construct, even though semantically equivalent,
are provided with a precise AST representation that almost allows for
a faithful reconstruction of the original sources. Besides this, the AST
nodes encode accurate source location information to be used, e.g.,
when issuing diagnostic messages.

Even though clang does an excellent job, its AST representation
is not yet as accurate as it should be (i.e., not perfect).
Here is a list of current issues that can affect the precision of
source-code based applications using clang as front-end. This list of
issues has been obtained in strict cooperation with Roberto Bagnara,
Abramo Bagnara and Enea Zaffanella, who actively work with clang on
source-based applications and would be very happy to mentor this idea.

2. Issues:
*) Coherent encoding of expressions occurring in initializers.

The AST representation adopted by clang, besides encoding the syntactic
constructs explicitly occurring in the source code, also represents
those semantic constructs that are only implicitly occurring.
The classical examples are those of implicit type casts, default argument
expressions in C++ function calls, implicit declarations of special
member functions in C++ classes, etc. That is, the syntactic structures
get appropriately "dressed" by semantic info so that, for instance, type
information is always accurate.

For the special case of initialization lists, a somewhat different
approach is taken: clang makes a strong distinction between the syntactic
and the semantic forms of the initializer list. Even though the syntactic
form appropriately represents those initializer expressions that were
present in the source code, these may or may not be "dressed" with the
semantic info. On the other hand, the initializers in the semantic form
are always appropriately dressed, but here we typically see initializers
that were not actually present in the original sources.
The issue is discussed in the following thread:

http://lists.cs.uiuc.edu/pipermail/cfe-dev/2011-January/013037.html

We propose two changes to the handling of initializer lists.

The first change (which is of interest for source-code based applications)
is to modify the syntactic form of init list by requiring that all
expressions occurring in it are properly "dressed" by semantic info:
this will enhance the coherence of the AST representation for the targeted
applications.

The second change is an optimization of the representation of the
semantic form of initializer lists so as to increase node sharing and
hence greatly improve the memory footprint in special cases (again, see
http://lists.cs.uiuc.edu/pipermail/cfe-dev/2011-January/013037.html).
Besides improving AST node sharing, the idea is to also allow for
range designators inside the semantic form of init lists.
An example of code that currently behaves badly is the following:

int a[1000000] = { [0] = 1, [999999] = 1 };

---
*) Completing work on attributed types.

Currently, clang is sub-optimal when handling the following kinds
of declarations:

typedef __attribute__((mode(byte))) int small; // attribute disappears
small s1;
int __attribute__((mode(byte))) s2; // attribute is ignored

In the first case, the attribute is "consumed" during parsing, producing
a typedef referring to the semantic type 'signed char', with no indication
that, on the syntactic side, an attribute was used.
In the second case, the attribute is just ignored.

Recently, John McCall added new AST node AttributedType in the Type
hierarchy, in order to correctly model situations such as those above.

http://llvm.org/viewvc/llvm-project?view=rev&revision=122942

This new AST node is not yet integrated into the Parse/Sema routines.

---
*) AST representation for explicit instantiation of templates.

clang does not differentiate between the (syntactic) AST nodes declaring
an explicit instantiation of a template and the (semantic) instantiation
of the template. This can be observed by dumping the AST for the following
example:

template <typename T> struct s { };
extern template class s<double>;
extern template class s<int>;
extern template class s<int>;

obtaining something like:

template <typename T> struct s {};
struct s {};
struct s {};
class s;

Apart from minor pretty printing issues (such as the lack of the
"extern template" keywords and the lack of template argument lists),
we can see that the first instantiations for s<double> and s<int>
are provided with a class body, because they reuse the semantic
instantiation node. This also has the side effect, for instance,
of printing "struct" instead of "class" as the tag keyword.
Compare this with the last declaration, which is a re-declaration of s<int>
and is therefore provided with a new, syntactic node (in this case, the AST
structure is fine, we only need to improve the pretty printer).

---
*) Friend declarations

One of the problems with friend declarations, maybe also related to the
issue above, occurs when a template function that was declared to be a
friend of a class gets later instantiated, as in the following code:

template <class T>
int foo() { return 0; }

class S {
friend int foo<S>();
};

void f() {
foo<S>();
}

The instantiation of foo inside f() causes the instantiation of the very
same function declaration that is referred by the friend declaration
in class S. Hence, the friend declaration becomes a definition,
causing troubles to source-based code tools.
Rather, the instantiation should result in the production of a new,
non-syntactic function declaration (i.e., a re-declaration of the
function declared as friend).

Another issue with friends has to do with so-called FriendTemplateDecl
AST nodes. According to current documentation, this AST node is never
produced by Parse/Sema:

01929 /// template <typename T> class A {
01930 /// friend class MyVector<T>; // not a friend template
01931 /// template <typename U> friend class B; // not a friend template
01932 /// template <typename U> friend class Foo<T>::Nested; // friend template
01933 /// };
01934 /// NOTE: This class is not currently in use. All of the above
01935 /// will yield a FriendDecl, not a FriendTemplateDecl.

---
*) Missing TypeSourceInfo for exception specifiers in FunctionProtoTypeLoc.

As an example, the AST representation for the exception specifier
in the following function declaration

void foo() throw(int (*)[2+1]);

is mapped into the semantic type where the array size is computed:

void foo() throw(int (*)[3]);

---
*) Missing flag for the presence of (optional) 'template' keywords
in nested name specifiers (NestedNameSpecifierLoc).

When parsing the following chunk of code

namespace BAR {

template <class U> static int foo();

template <typename T>
void f() {
int (*g)();
g = &BAR::template foo<int>;
g = &BAR::template foo<T>;
}

} // namespace BAR

we obtain an AST representation with no info about the presence/absence
of the 'template' keyword.

---------
3. Why I'm interested in this project:
My main academic interest area is in the development of compilers, and for this reason
I've followed and used the LLVM and Clang project for the last two years.
Since I plan to graduate this year, the GSoC timeline fits quite well in my plans and
I would like this project to be my undergraduate thesis work.
This specific project has been proposed to me by Prof. Roberto Bagnara, from
University of Parma, which is very interested in this project and would be my
thesis co-relator.

4. How the project will be useful for LLVM:
Clang's AST code-fidelity is crucial for applications based on source code manipulation.
Having the most coherent and precise AST as possible is very important to reach Clang's
design goals.

5. Prior knowledge in compilers and LLVM:
As I said, in my academic life I've been interested primarily in compilers
implementation and design. For this reason, I'm following the Compilers course
at the University as part of my undergraduate degree in CS, and I've been interested
in LLVM and Clang projects for the last two years.
I've used the LLVM backend for a small personal project last year, and I've sent a couple
of trivial patches to Clang in 2009.

6. Academic, Industry and other experiences:
I'm an undergraduate student of the Computer Science course at the University of Udine, Italy
(http://www.uniud.it)

I've worked a lot with C++ in the last five years, including the development of two complex custom
C++ software systems for a quite big local industry, regarding industrial production and product
testing automation.

In 2009, I've successfully applied to the Google Summer of Code for the KDE project, implementing
KAuth, the authentication and authorization framework used by KDE desktops applications.

6. Contact information
e-mail: nicola.gigante at gmail.com
IRC nickname: gigabytes at FreeNode and OFTC
personal blog: http://www.gigabytes.it (In italian, sorry)

--

Bye,
Nicola

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20110406/87604a81/attachment.html>