[cfe-dev] [GSoC] "Improving Clang's AST source-fidelity" proposal

Thu Apr 7 09:34:18 PDT 2011

Hi Nicola,

I think this is a good proposal.  My main concern about this project is the potential impact it will have on compile-time performance and memory footprint of the compiler.  I'd like to see in the proposal a plan on measuring regressions on these fronts (which are bound to happen by retaining more information in the ASTs) as well as what is deemed an acceptable level for performance to slide.  For example, if compile time regresses by 10%, I would strongly object to the changes.  Even 5% would probably be too high.

The other thing to keep in mind is that any AST changes will also require changes to the PCH format.  It's critical that PCH remain efficient, and that whatever changes we make to the AST does not adversely impact how much data is pulled from the PCH file (thus slowing down compile times).  For most of the topics you outline I don't think this is an issue, but this is definitely something to keep in mind.

I think the proposal covers a bunch of areas, and it is entirely possible that you won't have time to do them all.  If you were to tackle this project, my preference is that you work on one area to completion, and then proceed to the next one.

Cheers,
Ted

On Apr 6, 2011, at 7:11 AM, Nicola Gigante wrote:

> Hello everybody. This is my other proposal for the Google Summer of Code.
> Since the student application deadline is very near, I've already submitted it
> (and the other one about lambdas) to melange.
> Comments and suggestions are really appreciated.
> 
> --
> Title: Improving Clang's AST source-fidelity
> 
> 1. Introduction:
> Among the many strengths of clang, a most noticeable one is the excellent
> support for client applications different from traditional compilers.
> When compared to other systems, clang behaves spectacularly
> better as far as source-code fidelity is concerned and is going to become
> the reference choice for source-code based applications such as refactoring
> tools, pretty printers and beautifiers, coding-style checkers, etc.
> 
> Most of this is a direct result of the accuracy with which the syntactic
> structure of the parsed sources is represented in the clang AST:
> different syntactic construct, even though semantically equivalent,
> are provided with a precise AST representation that almost allows for
> a faithful reconstruction of the original sources. Besides this, the AST
> nodes encode accurate source location information to be used, e.g.,
> when issuing diagnostic messages.
> 
> Even though clang does an excellent job, its AST representation
> is not yet as accurate as it should be (i.e., not perfect).
> Here is a list of current issues that can affect the precision of
> source-code based applications using clang as front-end. This list of
> issues has been obtained in strict cooperation with Roberto Bagnara,
> Abramo Bagnara and Enea Zaffanella, who actively work with clang on
> source-based applications and would be very happy to mentor this idea.
> 
> 2. Issues:
> *) Coherent encoding of expressions occurring in initializers.
> 
> The AST representation adopted by clang, besides encoding the syntactic
> constructs explicitly occurring in the source code, also represents
> those semantic constructs that are only implicitly occurring.
> The classical examples are those of implicit type casts, default argument
> expressions in C++ function calls, implicit declarations of special
> member functions in C++ classes, etc. That is, the syntactic structures
> get appropriately "dressed" by semantic info so that, for instance, type
> information is always accurate.
> 
> For the special case of initialization lists, a somewhat different
> approach is taken: clang makes a strong distinction between the syntactic
> and the semantic forms of the initializer list. Even though the syntactic
> form appropriately represents those initializer expressions that were
> present in the source code, these may or may not be "dressed" with the
> semantic info. On the other hand, the initializers in the semantic form
> are always appropriately dressed, but here we typically see initializers
> that were not actually present in the original sources.
> The issue is discussed in the following thread:
> 
> http://lists.cs.uiuc.edu/pipermail/cfe-dev/2011-January/013037.html
> 
> We propose two changes to the handling of initializer lists.
> 
> The first change (which is of interest for source-code based applications)
> is to modify the syntactic form of init list by requiring that all
> expressions occurring in it are properly "dressed" by semantic info:
> this will enhance the coherence of the AST representation for the targeted
> applications.
> 
> The second change is an optimization of the representation of the
> semantic form of initializer lists so as to increase node sharing and
> hence greatly improve the memory footprint in special cases (again, see
> http://lists.cs.uiuc.edu/pipermail/cfe-dev/2011-January/013037.html).
> Besides improving AST node sharing, the idea is to also allow for
> range designators inside the semantic form of init lists.
> An example of code that currently behaves badly is the following:
> 
> int a[1000000] = { [0] = 1, [999999] = 1 };
> 
> ---
> *) Completing work on attributed types.
> 
> Currently, clang is sub-optimal when handling the following kinds
> of declarations:
> 
> typedef __attribute__((mode(byte))) int small; // attribute disappears
> small s1;
> int __attribute__((mode(byte))) s2; // attribute is ignored
> 
> In the first case, the attribute is "consumed" during parsing, producing
> a typedef referring to the semantic type 'signed char', with no indication
> that, on the syntactic side, an attribute was used.
> In the second case, the attribute is just ignored.
> 
> Recently, John McCall added new AST node AttributedType in the Type
> hierarchy, in order to correctly model situations such as those above.
> 
> http://llvm.org/viewvc/llvm-project?view=rev&revision=122942
> 
> This new AST node is not yet integrated into the Parse/Sema routines.
> 
> ---
> *) AST representation for explicit instantiation of templates.
> 
> clang does not differentiate between the (syntactic) AST nodes declaring
> an explicit instantiation of a template and the (semantic) instantiation
> of the template. This can be observed by dumping the AST for the following
> example:
> 
> template <typename T> struct s { };
> extern template class s<double>;
> extern template class s<int>;
> extern template class s<int>;
> 
> obtaining something like:
> 
> template <typename T> struct s {};
> struct s {};
> struct s {};
> class s;
> 
> Apart from minor pretty printing issues (such as the lack of the
> "extern template" keywords and the lack of template argument lists),
> we can see that the first instantiations for s<double> and s<int>
> are provided with a class body, because they reuse the semantic
> instantiation node. This also has the side effect, for instance,
> of printing "struct" instead of "class" as the tag keyword.
> Compare this with the last declaration, which is a re-declaration of s<int>
> and is therefore provided with a new, syntactic node (in this case, the AST
> structure is fine, we only need to improve the pretty printer).
> 
> ---
> *) Friend declarations
> 
> One of the problems with friend declarations, maybe also related to the
> issue above, occurs when a template function that was declared to be a
> friend of a class gets later instantiated, as in the following code:
> 
> template <class T>
> int foo() { return 0; }
> 
> class S {
> friend int foo<S>();
> };
> 
> void f() {
> foo<S>();
> }
> 
> The instantiation of foo inside f() causes the instantiation of the very
> same function declaration that is referred by the friend declaration
> in class S. Hence, the friend declaration becomes a definition,
> causing troubles to source-based code tools.
> Rather, the instantiation should result in the production of a new,
> non-syntactic function declaration (i.e., a re-declaration of the
> function declared as friend).
> 
> Another issue with friends has to do with so-called FriendTemplateDecl
> AST nodes. According to current documentation, this AST node is never
> produced by Parse/Sema:
> 
> 01929 /// template <typename T> class A {
> 01930 /// friend class MyVector<T>; // not a friend template
> 01931 /// template <typename U> friend class B; // not a friend template
> 01932 /// template <typename U> friend class Foo<T>::Nested; // friend template
> 01933 /// };
> 01934 /// NOTE: This class is not currently in use. All of the above
> 01935 /// will yield a FriendDecl, not a FriendTemplateDecl.
> 
> ---
> *) Missing TypeSourceInfo for exception specifiers in FunctionProtoTypeLoc.
> 
> As an example, the AST representation for the exception specifier
> in the following function declaration
> 
> void foo() throw(int (*)[2+1]);
> 
> is mapped into the semantic type where the array size is computed:
> 
> void foo() throw(int (*)[3]);
> 
> ---
> *) Missing flag for the presence of (optional) 'template' keywords
> in nested name specifiers (NestedNameSpecifierLoc).
> 
> When parsing the following chunk of code
> 
> namespace BAR {
> 
> template <class U> static int foo();
> 
> template <typename T>
> void f() {
> int (*g)();
> g = &BAR::template foo<int>;
> g = &BAR::template foo<T>;
> }
> 
> } // namespace BAR
> 
> we obtain an AST representation with no info about the presence/absence
> of the 'template' keyword.
> 
> ---------
> 3. Why I'm interested in this project:
> My main academic interest area is in the development of compilers, and for this reason
> I've followed and used the LLVM and Clang project for the last two years.
> Since I plan to graduate this year, the GSoC timeline fits quite well in my plans and
> I would like this project to be my undergraduate thesis work.
> This specific project has been proposed to me by Prof. Roberto Bagnara, from
> University of Parma, which is very interested in this project and would be my
> thesis co-relator.
> 
> 4. How the project will be useful for LLVM:
> Clang's AST code-fidelity is crucial for applications based on source code manipulation.
> Having the most coherent and precise AST as possible is very important to reach Clang's
> design goals.
> 
> 5. Prior knowledge in compilers and LLVM:
> As I said, in my academic life I've been interested primarily in compilers
> implementation and design. For this reason, I'm following the Compilers course
> at the University as part of my undergraduate degree in CS, and I've been interested
> in LLVM and Clang projects for the last two years.
> I've used the LLVM backend for a small personal project last year, and I've sent a couple
> of trivial patches to Clang in 2009.
> 
> 6. Academic, Industry and other experiences:
> I'm an undergraduate student of the Computer Science course at the University of Udine, Italy
> (http://www.uniud.it)
> 
> I've worked a lot with C++ in the last five years, including the development of two complex custom
> C++ software systems for a quite big local industry, regarding industrial production and product
> testing automation.
> 
> In 2009, I've successfully applied to the Google Summer of Code for the KDE project, implementing
> KAuth, the authentication and authorization framework used by KDE desktops applications.
> 
> 6. Contact information
> e-mail: nicola.gigante at gmail.com
> IRC nickname: gigabytes at FreeNode and OFTC
> personal blog: http://www.gigabytes.it (In italian, sorry)
> 
> --
> 
> Bye,
> Nicola
> 
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev