[cfe-dev] AST Representation of Conversions

Tue Jul 28 12:03:36 PDT 2009

Hi,

One thing I have started once and then aborted, but which Argiris
recently contacted me off-list about, is the AST representation of
conversions and casts. Currently, we have absolutely minimal
information: CastExpr (the base of all conversions) stores only the
operand. ImplicitCastExpr stores only whether the result is an lvalue.
ExplicitCastExpr (the base of all explicit casts) stores the target type
as written.
None of these store any of the information Sema has worked very hard to
acquire.
What kind of cast is it? Bitcast? Truncation? Extension? C++ has a large
variety of things that a conversion, especially a C-style cast, can do:
convert with constructor; convert with conversion operator; do a
hierarchy cast, potentially to a virtual base, which could mean adding
an offset to the pointer or dereferencing a pointer; do a raw bitcast
(reinterpret_cast is good at that); do an integer or floating point
extension/truncation; and even weirder things (member pointer casts,
explicit cast of the address of an overloaded function).

Obviously we need to save some information about the cast in the AST.
The question is what, and where.

Sema needs more detailed information about conversions than anyone else,
because it has to order them for function overloading. It needs to know
when an lvalue-to-rvalue conversion is performed, when a qualifier
conversion is performed, and precisely what conversions are done
in-between. Doug Gregor has implemented this, and it works.
CodeGen needs far less information, but still can use a lot. In the C
case, it currently recreates the necessary information by inspecting the
types again, but this approach is not tenable in the C++ case.
(Currently, an attempt to codegen a C++-specific conversion will
probably crash.) CodeGen needs to distinguish:
- a raw bitcast (reinterpret_cast of pointers and pointer/integer pairs,
reinterpret_cast of lvalues to references)
- a floating point truncation (double -> float)
- a floating point extension (float -> double)
- an integer truncation (int -> short)
- an integer extension (short -> int)
- a static hierarchy cast without virtual bases (add an offset to the
pointer)
- a static hierarchy cast with virtual bases (fetch the pointer to the
virtual base, and then add an offset)
- a dynamic hierarchy cast (emit calls to support library)
- a user-defined conversion via constructor (call that constructor)
- a user-defined conversion via conversion operator (call that operator)
- a static hierarchy cast of a member object pointer (adjust the value
of that pointer)
- a static hierarchy cast of a member function pointer (I have no idea
how that works)
- function and array decay
- GCC aggregate casts in various forms
- vector and extvector casts
- Objective-C casts
I think that's everything. In short, CodeGen also cares about pretty
much everything.

I don't know what other clients would need. The Index library definitely
wants to know about implicitly called functions (conversion operators
and constructors). The static analyzer would probably want the same
information as CodeGen. Other static code introspection tools probably
want all information too.

Essentially, I think, we will have to enhance or wrap
ImplicitConversionSequence from SemaOverload.h to also be able to
represent conversions that are only explicitly possible. Then we put it
into the AST library and give CastExpr one of those.
The problem with this approach is that it is heavy.
ImplicitConversionSequence is a heavy object (40 bytes on 32-bit without
considering alignment, 80 bytes on 64-bit if alignment works the way I
think it does), and every single ImplicitCastExpr (think of all the
"usual integral conversions" in C) would bear this weight, as would
casts that don't need this information, like const_cast (noop to
codegen), dynamic_cast (always runtime calls) and reinterpret_cast
(always bitcast).
An option would be to rearrange the hierarchy, but this makes it reflect
the implementation instead of the logical grouping. Currently the
hierarchy makes sense to programmers:
http://clang.llvm.org/doxygen/classclang_1_1CastExpr.html
If we were to rearrange it to fit the needs of data storage, CastExpr
would be the direct base of CXXConstCastExpr, CXXDynamicCastExpr,
CXXReinterpretCastExpr and ComplexCastExpr. ComplexCastExpr would hold
the conversion sequence and be the base of CXXStaticCastExpr,
CXXFunctionalCastExpr, CStyleCastExpr and ImplicitCastExpr. Not pretty.

Does anyone else have suggestions on how to solve this problem?

Sebastian