[cfe-dev] new design of RecursiveASTVisitor

Fri Jun 4 16:35:37 PDT 2010

Hi,

While working on improving the RecursiveASTVisitor class, some of us saw an
opportunity for a different design that's cleaner and easier to understand
and use.  Here's the proposal.  Comments and suggestions welcome!
Clang AST Visitor Design

This document discusses the design of a customizable visitor class for the
AST generated by the Clang C++ parser.

 Background

Clang comes with two AST visitor classes: ASTVisitor and
RecursiveASTVisitor. The ideas behind the two are similar. Both are
implemented using the CRTP pattern. The former is considered an internal
class and not for use in tools developed on top of Clang. The latter is
supposed to be a public API.

At this time, neither class is complete (i.e. they'll miss some AST nodes)
and there are known bugs (e.g. some AST nodes may be visited more than
once). We are working on improving RecursiveASTVisitor. Our experience shows
that its design has much space for improvement. Therefore it would be a
worthwhile exercise to see what we can come up with if we do it from
scratch, without being constrained or influenced by the current
RecursiveASTVisitor design.

 Concepts and Terminology

Muddy concepts lead to poor, confusing designs. Before we explore the design
space, it's important to clarify some key concepts.

   - An *AST* is a tree-shaped data structure. A node in an AST can have 0
   or many other nodes as its children.
   - There a *three categories* of AST nodes in Clang: statements,
   declarations/definitions, and types. They are derived from class Stmt, Decl,
   and TypeLoc, respectively. In general, a node may have children of any or
   all of the three categories. The class hierarchies for Stmt, Decl, and
   TypeLoc can be found here <http://clang.llvm.org/doxygen/hierarchy.html>.

   - The AST visitor has *three* distinct tasks:
      1. traverse the entire AST (i.e. go to every node),
      2. at each node, walk up the class hierarchy, starting with the
      dynamic type of the node,
      3. for a given (node, type) combination (where 'type' is some base
      class of the dynamic type of 'node'), call a user-overridable function to
      actually "visit" the node.

  Design Goals

   - easy-to-understand API
   - performs a
depth-first<http://www.google.com/url?sa=D&q=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FTree_traversal%23Depth-first_Traversal>,
   preorder traversal on the AST (no child of a node can be visited until all
   work on the node has been done)
   - complete coverage: each node in the AST must be visited
   - no duplication: a node cannot be visited more than once
   - common use cases should be easy
   - fast

  Our Design

The distinction between the AST visitor's three tasks (see the "Concepts and
Terminology" section above) is important for understanding how the AST
visitor works and using it effectively. Therefore we use naming conventions
to highlight the role of a method of the AST visitor:

   - TraverseFoo(Foo* x) does task #1. It traveses (i.e. recursively visits)
   the sub-tree rooted at x, where Foo is *x's dynamic type (it's the
   caller's responsibility to ensure the correct type).
   - Traverse(Decl* x) simply dispatches (i.e. forwards) to TraverseFoo(Foo*
   x) where Foo is the dynamic type of *x. Traverse(Stmt* x) and
Traverse(TypeLoc*
   x) work similarly.
   - WalkUpFromFoo(Foo* x) (better name welcome) does task #2. It does not
   try to visit any child of x. Instead, it first calls WalkUpFromBar(x)where
   Bar is the direct parent class of Foo (unless Foo has no parent), and
   then calls VisitFoo(x) (see the next bullet).
   - VisitFoo(Foo* x) does task #3.

 Here's some pesudo code to show what these methods' implementation looks
like:

// A function returns false if it wants the traversal to abort (note that
// the current RecursiveASTVisitor class returns true for abortion - we
// feel that 'false' is much more intuitive).
// Therefore, each caller needs to check the return code and propagate
// it up appropriately.  Such plumbing code is omitted here for clarify.

// The entry point for visiting an AST rooted at x.
bool Traverse(Decl* x) {

  switch(x->getKind()) {

  case kFoo:
    TraverseFoo((Foo*)x);

    break;
  case kBar:
    TraverseBar((Bar*)x);

    break;
  ...
  }
}

bool TraverseFoo(Foo* x) {

  WalkUpFromFoo(x);
  for each child node n of x {

    Traverse(n)
  }
}

bool WalkUpFromFoo(Foo* x) {

  WalkUpFromBar(x);  // Bar is Foo's parent.

  VisitFoo(x);
}

bool WalkUpFromDecl(Decl* x) {

  // Decl has no parent, so no more walking up.
  VisitDecl(x);
}

// All Visit*() methods do nothing by default.
bool VisitFoo(Foo* x) { return true; }

 Usually, a user shouldn't override a Traverse*() or WalkUpFrom*() function
- they are the "skeleton" of the AST visitor and do the mundane plumbing
that a normal user is not interested in. We recommend that only people truly
understand how the traversal works can override them.

A user is expected to override the Visit*() methods as needed. The default
implementation of them does nothing.

Returning a bool to indicate whether the traversal needs to be aborted
requires much tedious, error-prone plumbing in the implementation.
Exceptions are the most natural solution here, but unfortunately they are
banned from Clang code.

One advantage of this design over the existing RecursiveASTVisitor is that
when a user overrides VisitFoo(), he only needs to do what he's interested
in, without worrying about calling VisitFoo() or any other methods in the
base AST visitor class (so less chance to shoot his own feet).

With this design, all Visit*() methods on an AST ndoe are called before any
Visit*() method is called on a child node of it.  So the trace of the
Visit*() calls is easy to understand and reason about.

One thing people might not like about the design is that it uses more
methods. I actually think it's a good thing that we have separate families
of methods for separate roles. RecursiveASTVisitor's approach of lumping all
logic (traversal, walking up the type hierarchy, and the custom visiting
logic) in the same Visit*() method makes the implementation more tricky and
error-prone - that's probably why there was so much confusion when we were
trying to improve RecursiveASTVisitor.

To ensure performance, we would use CRTP as the existing AST visitors do.
That means two things:

   1. No need for virtual functions.
   2. The entire code is templated s.t. the compiler can easily inline
   trivial functions and do cross-function optimization.

  Use Cases

Let's see how some typical use cases can be implemented using our AST
visitor.

 Dump the AST

Only 3 methods need to be overridden:

VisitDecl(Decl* x) {

  print information about x;
}

VisitStmt(Stmt* x) {

  print information about x;
}

VisitTypeLoc(TypeLoc* x) {

  print information about x;
}

 Search for an AST Node of Type Foo

The tool would override VisitFoo(Foo* x) to return false if x is the node
it's looking for.

 Find the Parents of an AST Node

The tool would maintain a stack of the nodes that lead to the current node.
It can build the stack by overriding exactly 3 methods (plumbing of the
return code omitted):

Traverse(Decl* d) {

  Push(d);
  Base::Traverse(d);

  Pop();
}

Traverse(Stmt* s) {

  Push(s);
  Base::Traverse(s);

  Pop();
}

Traverse(TypeLoc* t) {

  Push(t);
  Base::Traverse(t);

  Pop();
}

-- 
Zhanyong
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20100604/78563818/attachment.html>