[cfe-dev] Cleaning up the representation of Decls in the AST

Fri Aug 22 14:08:47 PDT 2008

Over the last couple of weeks Daniel Dunbar and I have had a series of
conversations about various serious issues with the current design and
implementation of Decls in Clang.  We've come up with some proposed
changes that we think will resolve most of the problems that we know
about, and I thought I would sketch out these ideas here so that we
can openly discuss these ideas and debate their value.

I'm going to start by outlining some of the problems we are trying to
solve and then discuss our solution from both a design and  
implementation
level.

We're hoping to start implementing these changes very soon (next week).

Any comments or feedback you have are appreciated!

PROBLEMS WE'RE TRYING TO ADDRESS

(1) AMBIGUOUS OWNERSHIP AND LOSS OF INFORMATION

There are many cases where it is not clear what is the owning
reference to a Decl.  This poses problems for both AST serialization
(which is needed for PCH support), but is also needed for refactoring
clients that wish to modify the AST.  Similarly, the ASTs don't retain
information about the source code that would be vital for accurately
representing key elements of the original source code (necessary for
refactoring), or even being able to pretty-print valid ASTs to code
that can compile.

For example, consider the following code:

[EXAMPLE 1]

void f() {
   typedef struct x { int y; } foo;
}
struct x { int y; };
void g() {
   typedef struct x foo;
}

clang -ast-print currently pretty-prints this as:

void f() {
   typedef struct x foo;
}
...
void g() {
   typedef struct x foo;
}

Currently both f() and g() appear the same during printing because
their is no reference from either the TypedefDecl or the DeclStmt to
the RecordDecl for 'struct x'.  Only the RecordType for 'struct x'
refers to the RecordDecl, which means in this case the RecordType
actually *owns* the RecordDecl.  This is horrible, not only because a
Type object should not own Decls, but that a RecordDecl can sometimes
be owned by a DeclStmt or be a top-level declaration:

[EXAMPLE 2]

struct x { int y; }; // GOOD: The TranslationUnit owns the RecordDecl.
struct z { int y; } z; // BAD: The RecordType owns the RecordDecl.

The lack of clear ownership semantics, along with the loss of
information that it implies, makes it very difficult to always
faithfully pretty-print code that actually compiles, or compiles with
the same semantics:

[EXAMPLE 3]

void f() {
   struct { int x; } a, b, *c;
}

clang -ast-print:

void f() {
   struct <anonymous> a;
   struct <anonymous> b;
   struct <anonymous> *c;
}

clang -ast-dump:

void f()
(CompoundStmt 0xc070f0 <<stdin>:3:12, line:5:3>
   (DeclStmt 0xc070d0 <line:4:5>
     0xc06ff0 "struct <anonymous> a"
     0xc07050 "struct <anonymous> b"
     0xc070a0 "struct <anonymous> *c")

As before, in this case the RecordType owns the RecordDecl.  Not only
that, but -ast-print pretty-prints the variable declaration in a
completely broken way.  While --ast-print can be enhanced to do a
better job in this case, fundamentally the Decl for the anonymous
struct needs to referenced from the AST nodes themselves.

(2) NO DISTINCTION BETWEEN A 'DECL' AND A 'GROUP OF DECLS'

The way we represent "groups of Decls" is both not uniform and poorly
abstracted.

For multi-variable declarations (the most common case of "groups of
Decls"), we represent them as a single ScopedDecl* that points to the
first ScopedDecl* in a chain.  The problem is that many clients assume
that when they are handed a single Decl* they are operating on a
single Decl, not a group of Decls.  This continues to bite us over and
over.  Here are two examples.

The first is the AST pretty-printer, which only pretty-prints the
first Decl in a multi-variable declaration that appears at the
top-level:

$ clang -ast-print
int a, b, c;
^d
typedef char *__builtin_va_list;
Read top-level variable decl: 'a'

Clearly this can be fixed, but basically this client was written to
assume that when it was handed a Decl* it was just reasoning about a
single Decl.  Clients have to do downcasts and isa<> interrogations of
Decls to determine if a Decl* is referring to a single Decl or a group
of Decls.  In the particular case of the pretty-printer, I counted
about four or five locations in StmtPrinter.cpp where
DeclStmt::getDecl() is called (which returns a chain of ScopedDecls)
and only the first Decl is processed.

A second example of where this use to bite us is codegen support in
Clang.  Until recently, when we saw "int a, b, c;" we only did the
codegen for 'a'.  This was fixed, but obviously future clients will
continue to make the same mistake over and over, and their failure
modes might not be as dramatic to immediately notice they are getting
it wrong.

Clearly a disambiguation between Decls and 'groups of Decls' would be
useful, not only for clients, but also for cleanliness within the ASTs
themselves.  This would help resolve numerous ownership issues as
well.  For example, if we had DeclStmt::getDeclGroup() instead of
DeclStmt::getDecl() (or something like that) that returned a group of
Decls then this kind of programming error wouldn't be possible because
the type system with prohibit it.  Clients would explicitly reason
about 'groups of Decls'.  This would prohibit many errors, but also
would simplify code in many cases.  Specifically, in the cases where a
single Decl* is passed to a client the client can assume they are
dealing with a single Decl.

(3) ScopedDecl CHAIN: CONFLATED IMPLEMENTATION AND LACK OF ABSTRACTION

The decl chain in ScopedDecl is currently used for different purposes
in different contexts, which muddles up the ASTs and makes it often
very unclear what is going on.

For example, an EnumDecl represents its enum constants by chaining
them together using a ScopedDecl chain.  This same chain is used to
represent multi-variable declarations as well as chaining together
NamespaceDecls.  In each of these cases the ownership semantics with
regards to the chain are slightly different, and conceptually they
represent very different things.

For C++ namespaces, we can still chain them together (as we do with
FunctionDecls), but not used the ScopedDecl chain because it is
semantically something very different.

Further, most clients that do successfully walk the ScopedDecl chain
walk the chain directly.  This is bad because it fuses us to a
particular implementation instead of having clients reason about an
abstraction.  Instead we should use an iterator interface to walk a
group of decls.

In generally, we believe that the ScopedDecl chain should just be
removed.  It is used in logically inconsistent ways and almost every
client that has used it has gotten things wrong at least once.

PROPOSED SOLUTION

We propose the following solution to address these issues:

- EnumDecl and NamespaceDecl no longer used the getNextDeclarator()
  chain in ScopedDecl. Both can use a chain (as they do now), or some
  alternate representation. However, the implementation chosen will be
  private to each class and an iterator interface should suffice to
  allow clients to ignore the details of the implementation.

- Introduce the notion of "DeclGroup" that represents a group of
  (comma-separated) Decls.

Essentially, DeclGroup would have the following grammar:

DeclGroup -> TypeDecl Decl+
           -> Decl

DeclGroup would *own* the Decls it references.  The "TypeDecl"  
represents an optional extra type declaration associated with the  
DeclGroup.

For example, to represent:

struct {} x, y;

The DeclGroup consists of:

(1) A RecordDecl (the "TypeDecl") for the anonymous struct.
(2) Two VarDecls for x and y.

This also works for typedefs:

typedef struct x {} foo, bar;

becomes a DeclGroup consisting of:

(1) A RecordDecl for 'struct x' (the "TypeDecl").
(2) Two TypedefDecls for 'foo' and 'bar'.

Note that this completely resolves the ownership issues with anonymous
structs and similar constructs.  It also means that Types no longer
own Decls (a good thing).

For the common case where a DeclGroup contains a single Decl (and no
extra "TypeDecl" is needed), DeclGroup would have an optimized
implementation so that using DeclGroup has no additional space
overhead compared to our current implementation. Our idea is to
essentially implement DeclGroup as a smart pointer, and for this
common case it just contains the pointer to the solitary Decl (no
additional overhead).

Moreover, by using a smart-pointer DeclGroup instead of a chain of  
ScopedDecls, we actually will save the space used by the  
NextDeclarator field in ScopedDecl (since it's NULL most of the time).

- DeclStmt uses DeclGroup instead of a ScopedDecl*.

- *All* top-level Decls would be replaced by DeclGroups.

Conceptually this unifies how DeclStmt and the top-level represent
groups of Decls.

ParseAST would be modified to call
ASTConsumer::HandleTopLevelDeclGroup instead of HandleTopLevelDecl. We
can still have HandleTopLevelDecl, and have the default implementation
of HandleTopLevelDeclGroup just call HandleTopLevelDecl for each of
its constitute Decls.

- RecordDecls would represent their fields as a group of DeclGroups.
  This enables us to accurately represent:

struct x {
   struct y { int z; } a, b;
   int c;
};

A field_iterator interface can provide the necessary abstraction to
clients that just want to iterate over the FieldDecls of a struct.

- The arguments of a function would be represented as a list of
  DeclGroups.  This is because the following is legal (note the
  anonymous struct):

int foo(int x, struct { int a; } y);

Similarly, an iterator interface provides clients with a clean way to
iterate over the ParmVarDecls in a function prototype.

- Similar changes everywhere else where we have groups of
  (comma-separated) Decls that can contain both type and variable
  declarations.

PROS AND IMPLEMENTATION

We believe these changes could be made very incrementally to the
parse/Sema/ASTs/ASTConsumers.

The main benefits we see from these changes are:

- Cleaner API: Clients always know when they are reasoning about Decls
  or groups of Decls.

- Source-Fidelity: We capture much more of the source information in
  the ASTs.  This is great for refactoring clients.

- Object-Ownership: Most of the known object ownership issues in the
  ASTs go away, making it easier for clients to manipulate
  (e.g. refactoring) as well as unblocking full AST serialization
  (needed for PCH).

- Abstraction: Given the proper iterator interfaces, most clients will
  not care about DeclGroups.  They can just use the interfaces to
  iterate over Decls.  The default implementation of
  ASTConsumer::HandleTopLevelDeclGroup would just call
  ASTConsumer::HandleTopLevelDecl for each Decl* in a top-level
  DeclGroup.  Clients that care about top-level DeclGroups can
  override HandleTopLevelDeclGroup.

- ASTs better reflect the grammar of C: We believe these changes would
  make it clearer in many cases what source constructs a particular
  AST node can represent.  This is invaluable for clients.

Thoughts?