[cfe-dev] [RFC] ConstExprPreter — clang constexpr interpreter

Wed Jul 3 10:48:53 PDT 2019

TL;DR: Fast interpreter for C++ constexpr to replace the existing tree evaluator

I am a PhD student at the University of Cambridge currently interning at Apple, 
working with JF Bastien and Michael Spencer on improving constexpr performance. 
Constexpr evaluation is presently slow and difficult since it relies on a 
monolithic AST walker. We propose replacing the tree evaluator with a bytecode 
interpreter to improve compilation times. The tree evaluator also poses 
significant maintenance and scalability problems, which we intend to ameliorate 
using the interpreter. While being generally faster, the interpreter does 
present two critical issues: a slightly increased memory footprint and added 
complexity to the compiler. This tradeoff is justified, as efficient constexpr 
evaluation could prove to be valuable as the language evolves.

We would like to integrate this interpreter into clang. This RFC details the 
benefits of an interpreter and describes an initial implementation, along with 
a roadmap for replacing almost all of the tree evaluator with the interpreter. 
Even without optimizations, the performance of the interpreter matches that of 
the tree evaluator, thus the short-term focus is on feature completeness, not 
evaluation speed, as reflected by known inefficiencies in the current 
implementation. We would highly appreciate comments related to integration into 
clang and our roadmap towards replacing the evaluator, as well as feedback on 
the initial patch. This project serves mostly as a prototype in order to 
determine what kind of bytecode and compiler is required to efficiently evaluate 
code and emit useful diagnostic messages, paving the way for a future fast, 
potentially JIT-ed, interpreter.

What?

The ConstExprPreter is an interpreter for C++ constexpr embedded into the clang 
frontend: instead of evaluating constant expressions by walking the AST, the 
constexpr interpreter compiles C++ to safe bytecode and executes the bytecode 
in accordance with the constexpr semantics, emitting all appropriate 
diagnostics. It aims to replace the existing AST walker, which is less efficient 
and does not scale well in complexity as the constexpr subset of the C++ 
language is expected to increase in the future.

Why?

The present constexpr evaluator is a 12.000 LOC monolith defined in 
clang/lib/AST/ExprConstant.cpp and poses a performance and maintenance problem. 
The tree interpreter limits the complexity of constexpr evaluated in a module by 
bounding recursion depth (-fconstexpr-depth=) and bounding the number of 
execution steps (-fconstexpr-steps). This severely limits the complexity of 
expression which can be evaluated in a given time budget. Furthermore, because 
of complexity, the implementation of certain potential future features on top 
of the evaluator, such as exception handling, pose serious difficulties. An 
efficient constexpr interpreter is expected to be faster and easily extensible, 
ameliorating some of the limitations of the tree evaluator, especially regarding 
performance. By improving the evaluation speed of constexpr, we expect to enable 
C++ users to replace instances of automatically generated code with complex 
constexpr, simplifying and improving the reliability of builds.

Proposed Roadmap

* Commit the initial patch which embeds a simple bytecode compiler and 
  interpreter into clang, alongside the existing constexpr evaluator. This 
  interpreter only supports a subset of constexpr and is disabled by default.

* Add features to the interpreter, reaching a point where it supports all the 
  features of the existing evaluator.

* Turn the interpreter on by default. 

* Move the entry point of the interpreter from function calls to arbitrary 
  toplevel expressions. In order to avoid performance regressions, a small 
  subset of features will be evaluated  without entering the bytecode compiler 
  and the VM: for example, frequent integer constant definitions, such as 
  constexpr int kMask = 1 << sizeof(T). This strategy requires keeping parts of 
  the existing Int evaluator, but allows the removal of the LValue, Complex, 
  etc. evaluators, significantly reducing the complexity of ExprConstant.cpp.

* Remove most of the toplevel evaluator, minus the parts required to interpret 
  simple expressions. Roles will be reversed: if the evaluator encounters an 
  unsupported feature, it falls back to the interpreter.

* Remove the flags enabling/forcing the interpreter.

Initial Implementation

The initial contribution is available in D64146. This only contains the 
minimal interpreter required to compile a basic example. Further patches have 
been developed, implementing pointers, which will be submitted for review 
afterwards.

The implementation of the interpreter resides in lib/AST/ExprVM, in the 
clang::vm namespace. The sole entry point is from ExprConstant.cpp, in the 
HandleFunctionCall method. When this method is invoked in order to evaluate a 
function application, the vm::Context attached to each module is retrieved from 
ASTContext and an attempt is made to compile and interpret the function with 
the given parameters. If this fails, the tree interpreter is used as a fallback, 
unless the command line flags explicitly ask for a diagnostic pointing to 
unimplemented features. If the interpreter succeeds, the result is converted to 
a format compatible with the tree evaluator.

The bytecode compiler simply walks the AST of each function generating a 
vm::Function, linking them together into a vm::Program. Certain peephole 
optimizations are performed in the compiler in order to optimize 
local/parameter accesses and remove redundant instructions, avoiding the use of 
pointers whenever possible. The compiler heavily relies on type information: 
the Classify method in vm::Context is used to approximate a type from the AST 
with a base type supported by the interpreter. The types and the necessary 
utilities are defined in Type.cpp. Presently, only 32-bit integers are 
supported, however 8, 16 and 64 bit integers will be added and a fallback onto 
APSInt is planned for 128-bits and beyond.

Internally, the compiler relies on a type classification–PrimType–to decide what 
opcodes to emit for a particular operation. The Context::Classify method relies 
on target information to decide what internal types to map C/C++ types to, 
returning llvm::Optional<PrimType>. In the future, the classification is 
expected to be complete, removing a large number of the nullopt-checks from the 
compiler.

The opcodes supported by the VM are defined in Opcodes.inc, a header used to 
generate most of the interpreter, as well as the disassembler. The VM is stack 
based, since such bytecode is fast to generate and the high upfront cost of 
register allocation is avoided. In order to accurately emit diagnostics, the VM 
needs to cooperate with the tree interpreter—this is achieved by isolating 
diagnostics into the vm::State class, inherited by both EvalInfo and 
InterpState. Stack frames now use vm::Frame as their base class, inherited by 
InterpFrame and CallStackFrame. This allows the stack frame to be traced through 
both the VM and the tree walker effectively. The present focus is on 
correctness — a path is kept open to optimize the interpreter and improve 
instruction dispatch and memory layout, however this is not the current 
priority. The interpreter could also be specialized into two variants: one that 
only detects problems, falling back to a slower version which tracks 
significantly more metadata for informative diagnostics.

The interpreter needs to detect pointer accesses that would result in Undefined 
Behavior: dereferences and arithmetic on pointers which point to invalid memory 
locations. To achieve this, each allocation site has a unique descriptor, 
containing the metadata required to emit a diagnostic message. Each allocation 
creates a block tracking the descriptor, along with all live pointers to that 
block. Whenever a block goes out of scope, all the pointers are invalidated: 
instead of pointing to the block, they now point to the descriptor. If such a 
pointer is used, a diagnostic is generated and execution stops. This scheme 
adds an overhead to pointer arithmetic, as the intrusive linked list of 
pointers to a block needs to be maintained. If pointers correctly track the 
lifetime of stack objects, no additional cost is paid at deallocation sites as 
there are no pointers to invalidate. In the future, we might investigate 
different schemes, such as the one in asan (shadow memory + epoch counters), or 
garbage collection to keep invalid blocks alive as long as there are pointers 
to them.

Usage

The proposed patch adds two new flags to clang:

-fexperimental-constexpr-interpreter:

 Enables the interpreter, but falls back to the tree evaluator when encountering 
 language features which are not supported.

-fexperimental-constexpr-force-interpreter:

 Forces the use of the interpreter, generating an error diagnostic whenever an 
 unsupported feature is encountered.

The behaviour of the compiler and the format of emitted diagnostics remains 
unchanged, unless a bug in clang’s diagnostics is identified. In such a case, 
we emit what we consider to be the correct diagnostic.

Performance

Since the present evaluator is not optimised for performance, a fair comparison 
is difficult to obtain, but we expect the interpreter to outperform the tree 
evaluator on most repetitive structures. Benchmarks on 1.000.000 iterations of 
the sum_pow function in test/ExprVM/loop.cpp show a ~10× improvement in the 
running time of the evaluator. Further tests are required in order to compare 
the performance of the interpreter to the tree evaluator when evaluating 
smaller, non-repetitive expressions. Once a reasonable subset of constexpr is 
implemented in the VM, performance benchmarks on open-source constexpr code 
will help us compare the cost of compilation and interpretation to that of 
direct evaluation. Since the interpreter is significantly faster, the currently 
tight limits on the number of statements and the number of steps can be relaxed.

Memory Usage

The constexpr interpreter requires a significant amount of metadata in order to 
detect cases of undefined behavior and correctly report diagnostics. The 
existing evaluator imposes a massive overhead, since all integers are 
represented as APInt and all pointers keep a significant amount of metadata. 
This overhead is lowered in the bytecode interpreter by better exploiting type 
information: integer constants are fixed width and require no additional 
information, while the size of pointer metadata is significantly reduced - 3x 
increase in pointer size and a 16-byte overhead per allocation (the interpreter 
tracks actual pointers for fast dereferencing, while the evaluator maintains an 
lvalue and a path into that lvalue for structure fields, incurring a massive 
overhead). Since the present implementation focuses on maintainability and 
feature completeness, this can be further reduced in the future.

The compiled bytecode, quite dense due to the type-specialized opcodes, is 
maintained in memory as long as the AST is live, which currently happens to be 
live throughout all compilation phases. This adds to the total memory used by 
the compiler. In the future, mitigations might be required. Given that the 
ground truth—the AST—is present in memory, compiled functions could be 
discarded and recompiled on demand, reducing peak memory usage.

Complexity

The interpreter duplicates the functionality of the existing evaluator, 
presently adding significant complexity to the frontend. Unlike the monolithic 
ExprConstant.cpp implementation, the constexpr interpreter is significantly more 
modular, spreading the complexity across the compiler and the interpreter. It 
will be possible to test the compiler separately from the interpreter, allowing 
for easier maintenance. We expect the frontend to become simpler and more 
maintainable after the AST walker is removed.

The full implementation of the interpreter is expected to involve significant 
engineering effort. While development is in progress, an implementation of 
reduction rules is required in both the tree walker and interpreter, adding 
redundancy.

Extensibility

The interpreter should evolve alongside the language in the future, allowing for 
new features included in future standards to be supported. We refrain from 
performing any optimizations that would hinder the implementation of additional 
features, such as constexpr support for alloca and exception handling.

Thanks for reading!

Nandor Licker