[LLVMdev] [Valgrind-developers] [GSoC 2014] Using LLVM as a code-generation backend for Valgrind

Wed Feb 26 07:16:58 PST 2014

On 02/26/2014 12:23 PM, Kirill Batuzov wrote:

I tend to agree with Kirill.  It would be great to make Valgrind/Memcheck
faster, and there are certainly ways to do that, but using LLVM is not
one of them.

> Second, in DBT you translate code in small portions like basic blocks,
> or extended basic blocks. They have very simple structure. There is no
> loops, there is no redundancy from translation high level language to
> low level. There is nothing good sophisticated optimizations can do
> better then very simple ones.

Yes.  One of the problems of the "Let's use LLVM and it'll all go much
faster" concept is that it lacks a careful analysis of what makes Valgrind
(and QEMU, probably) run slowly in the first place.

As Kirill says, the short blocks of code that V generates make it
impossible for LLVM to do sophisticated loop optimisations etc.
Given what Valgrind's JIT has to work with -- straight line pieces
of code -- it generally does a not-bad job of instruction selection
and register allocation, and I wouldn't expect that substituting LLVM's
implementation thereof would make much of a difference.

What would make Valgrind faster is

(1) improve the caching of guest registers in host registers across
    basic block boundaries.  Currently all guest registers cached in
    host registers are flushed back into memory at block boundaries,
    and no host register holds any live value across the boundary.
    This is simple but very suboptimal, creating large amounts of
    memory traffic.

(2) improve the way that the guest program counter is represented.
    Currently it is updated before every memory access, so that if an
    unwind is required, it is possible.  But this again causes lots of
    excess memory traffic.  This is closely related to (1).

(3) add some level of control-flow if-then-else support to the IR, so
    that the fast-case paths for the memcheck helper functions
    (helperc_LOADV64le etc) can be generated inline.

(4) Redesign Memcheck's shadow memory implementation to use a 1 level
    map rather than 2 levels as at present.  Or something more
    TLB-like.

I suspect that the combination of (1) and (2) causes processor write
buffers to fill up and start stalling, although I don't have numbers
to prove that.  What _is_ very obvious from profiling Memcheck using
Cachegrind is that the generated code contains much higher proportion
of memory references than "normal integer code".  And in particular
it contains perhaps 4 times as many stores as "normal integer code".
Which can't be a good thing.

(3) is a big exercise -- much work -- but potentially very beneficial.
(4) is also important if only because we need a multithreaded
implementation of Memcheck.  (1) and (2) are smaller projects and would
constitute a refinement of the existing code generation framework.

> In conclusion I second what have already been said: this project sounds
> like fun to do, but do not expect much practical results from it.

The above projects (1) .. (4) would also be fun :-) and might generate more
immediate speedups for Valgrind.

J