[llvm-dev] [RFC] A value-tracking LiveDebugValues implementation

Thu Jun 18 08:35:19 PDT 2020

Hi debuginfo-cabal,

tl;dr: Let's please consider using a new implementation of LiveDebugValues
that produces richer information, might be slightly faster, but mostly will
support the instruction referencing and value tracking paradigm from my RFC [0]
rather than the location tracking that LiveDebugValues does today.

In that RFC, the main motivator is to treat variable locations a bit more like
SSA, having variable location metadata refer to the instructions that define a
value, and track the value moving around registers from there. This
prototype [1] is an implementation of the "track" mechanism, where we produce
value numbers for every 'def' in a function, and track them being moved around.
Variable locations are then resolved and extended by following where the value
numbers go.

This matches, to an extent, the "Register Data Flow" / RDF graph analysis
that recently moved from the Hexagon backend to CodeGen. I had a look, and
it appears RDF computes much richer information than LiveDebugValues needs
(specifically, all uses of values) and allocates for each value number, which
is quite a bit of overhead. I decided it'd be simpler to make a new
LiveDebugValues implementation, as we've had performance problems with
LiveDebugValues in the past, and we don't need to track additional metadata
for each value, only number them uniquely.

There are few independent reasons to adopt this new implementation, and I'm
self-conscious about dropping a code-bomb on people. My questions for this
email are:
 * Does the new implementation (described below) make sense, and
 * If so, how would I go about landing this?

We could wait until there's a complete suite of patches for the instruction
referencing work (I've got a working prototype, but not production-quality).
On the other hand, this new LiveDebugValues implementation is probably ready
for review now, works independently, and is very much the hardest part of
instruction referencing, so it'd be nice to get it under-way earlier. My
preferred outcome would be adding it as a parallel implementation to the
current LiveDebugValues, enabled by a runtime switch, then start to land
the rest of the instruction referencing work around it. This reduces the
scope for further bit-rot, and as Vedant mentioned [2] this would all need
to sit under a runtime-switch for a transition period anyway.

# Overview

Today's LiveDebugValues (explained in the file docu-comment [3]) concerns
itself with variable /locations/ only, not the values that they contain. The
implementation is very good at its job: it detects when a location is
invalidated (i.e. clobbered), control flow joins are a simple matter of "Do
the predecessors agree on the location?", and not much more work is needed.

Switching to tracking variable values is more complicated. Following the
original proposal, for this pseudo-MIR code sequence:

    $rax = somedef
    DBG_INSTR_REF reference-to-somedef
    $rbx = COPY killed $rax
    ; Variable location is $rax, then moves to $rbx

We need to track that:
 a) $rax contains a value defined by the "somedef" instruction,
 b) The DBG_INSTR_REF needs a location for the value somedef defines.
 c) That value from "somedef" is copied to rbx,

The new LiveDebugValues implementation doesn't have DBG_INSTR_REF yet, so
I've written it to work with DBG_VALUEs too, to prove that it works just as
well as the current implementation. It deals with code sequences such as:

    $rax = somedef
    DBG_VALUE $rax
    $rbx = COPY killed $rax
    ; Variable location is $rax, then moves to $rbx

In which the same tracking has to occur, but the DBG_VALUE "reads" a value
number from $rax. As the variable locations are determined by the value number
that the variable has, it doesn't matter whether that was specified by a
DBG_VALUE or DBG_INSTR_REF.  The tracking is sufficiently good to produce an
identical binary to the output of current LiveDebugValues (details below).

There are some unrelated benefits to consider: because the new implementation
tracks all locations that a value may be in, we can handle the following
problem, that the current implementation cannot:

  entry:
    $rax = somedef
    DBG_VALUE $rax
    SPILL_TO_STACK %somestackslot, $rax
  loop:
    [... some code...]
    $rax = RESTORE_FROM_STACK %somestackslot
    do_things_with_rax
    brcc loop
  exit:
    ret

Here, there's a variable spilt to the stack, that's temporarily restored
in the loop. At the end of the "loop" block, the value has two locations:
$rax and the stack slot. The current LiveDebugValues always follows the
restore from the stack: it decides the variable is on the stack at the end
of "entry", and in $rax at the end of "loop". Because these locations don't
agree on entry to "loop", the location in "loop" is dropped completely.
Having LiveDebugValues ignore or prefer the restore just changes what kind of
circumstances locations get dropped in: the underlying problem is that the
value can be in multiple places, but we can track only one of those places.

The new implementation has enough information to solve this, as it tracks all
values in all machine locations. Unfortunately it doesn't make much difference
to the location statistics for clang RelWithDebInfo builds. An additional
benefit is that we would be better placed to produce saner locations lists [4].

In terms of performance, the new implementation takes on average just as long
as the current implementation. A full stage2 build of clang may be 0.1% slower,
but it's in the noise on my machine. All the algorithms are much more localised
than the current implementation (see below), and I reckon there's still scope
for further optimisation as it's easier to make decisions such as "Don't try to
join these variable locations in this block, they're immediately re-stated by
DBG_VALUEs".

# Implementation

The new implementation strictly does more work than the old one: it tracks every
value in the function, not just those that are associated with a variable
location. It also has to detect places where control flow merges and produces
PHI values, where the old implementation could just check that the live-out
variable location in predecessors agreed. To address this, I've separated the
implementation into two smaller problems, that can be solved almost
independently and with much greater locality.

Both of the problems below are sort-of-like SSA construction (they have a set
of defs, and a large set of uses), but instead of calculating new PHI
placements, we have to infer where PHIs _were_ placed.

Problem 1: Machine value numbering. Every def in the function is given a
number, which is then condensed into a transfer function for each block,
specifying which registers (or spill locs) contain which values. It's
genuinely just a map of machine location to value. Using the same sort of
dataflow model that LiveDebugValues currently uses, values are propagated to
compute the live-ins and live-out values of each block, with one significant
modification: we assign a new value number when control flow merges different
values, essentially re-computing PHI nodes. This leads to some ungraceful
code: see "Ugly PHI stuff" below. This problem is kept localised by only
considering the registers and spill slots that are accessed, making the
problem size proportionate to the product of the max working set of the
function and the number of blocks. The number of variables used is irrelevant.

Problem 2: Variable value assignment. By using the solution for the first
problem, of "what value number does each machine location contain" at each
program point, every DBG_VALUE can be interpreted as assigning a value to a
variable, not a location. This gets turned into a transfer function too, and
we can then perform another dataflow analysis, propagating what _value_ a
variable has in a block, and computing the set of live-in variable values for
each block. A specific machine location can be picked after this problem is
solved, possibly from several options if multiple locations have the same
value. This isn't a completely clean abstraction: predecessors with different
variable values that are in the same location should be propagated, but the
new implementation has to find and validate a PHI value number that represents
those merged values (see "More ugly PHI stuff" below).

I'm keeping problem 2 localised by propagating variable values for one lexical
scope at a time, and only over the blocks that are in that scope. As well as
localising by turning the main problem into a large number of small problems,
this reduces the amount of work too. The current implementation filters out
variable locations that are out of lexical scope every time a block is
processed, which now no longer needs to happen. Variables in the outermost
scopes won't be re-processed when LiveDebugValues goes around an inner-scope
loop several times; likewise variables in inner scopes won't get re-analysed
when an outer scope variable needs another pass through the function. This also
means that all lexical scopes that cover a single block (or zero) require no
dataflow analysis, which should attenuate the effects of heavy inlining a bit.

The solution to these two problems (block live-in variable values and block
live-in machine location values) can then be combined in a final step through
the function, producing DBG_VALUEs for live-ins and DBG_VALUEs for any
location transfers within a block.

# Testing and validation

The new LiveDebugValues implementation produces an identical AMD64 clang-3.4
binary to the old implementation, based on d481e59863a (Feb 2020). Well,
almost -- it proved too difficult to completely emulate the old version,
instead with the following changes [5] applied to d481e59863a, the two
implementations produce an identical binary:
 * Applied D67500 [6],
 * Ignore identity copies,
 * Force any newly created DBG_VALUEs to be inserted according to the order
   in which DebugVariables appear in the input function. So that they're
   consistent between the two implementations.
 * Disabling all tracking of stack spills.

Ignoring the final point, I think this would demonstrate that the two
implementations are equivalent. Stack spills are tricky: because value
transfers within blocks are left until the final step of the new
implementation, it's hard to work out in advance when old LiveDebugValues
would have transferred a location. It was easier to just compare binaries
where spills hadn't been followed.

IMO, the non-spill (the vast majority) locations being identical should be
sufficient to show that the overall algorithm works, and dealing with spills
is a minor extension.

The tests pass, except for the ones covered by the limitations section, and
a few DBG_VALUE-change-order issues.

# Limitations and unimplemented features

There are a couple of things that I haven't gotten around to implementing
yet, for which the tests fail:

Overlapping fragments: this is trivial to do, but at the back of my job queue,

Entry values: much more interesting. As Djordje has pointed out elsewhere,
LiveDebugValues has been assuming that every DBG_VALUE assigns a new value
to a variable [7], which means if a DBG_VALUE re-states a variable location
when it doesn't have to, we risk dropping an entry value. This will be much
easier in this new model (I haven't had the time to do it yet), because the
machine value numbering means that each entry value has a specific number,
and we can easily identify when it's no longer available. This should just
be a matter of the in-block transfers tracking code identifying that a
variable should have the entry value; it not being any machine location
currently; and issuing an appropriate DBG_VALUE.

# Ugly PHI stuff

Above, I said that this problem is a lot closer to SSA construction than other
classic compiler problems. However, it's actually even weirder: instead of
having a set of uses and defs and computing the locations of PHIs: every
instruction with a DebugLoc in lexical scope is a use, and the PHIs are already
placed just not recorded. Thus the problems above are more about recovering
the locations of PHIs after they're eliminated.

This adds a lot of complexity to the lattice structure used for each variable
location compared to the current LiveDebugValues: there used to only ever be
one candidate location for each variable at every control flow merge, and over
the course of dataflow analysis this was found to be either true or false.
However because we're iteratively discovering new PHI locations as we explore,
the new implementation's lattice consists of:
 * A single candidate def (\top),
 * A PHI value at every control flow merge on the path between between the def
   and the current block,
 * A PHI value at the current block (\bot).
Each of which potentially needs one iteration of dataflow to see whether the
join should produce that value.

This only truly needs exploring at loop heads, and the maximum number of
iterations required is the depth of loops in the function, which is the same as
the old LiveDebugValues implementation. However because all the PHI values
have to be discovered, that number of iterations is required more often than
the old implementation.

I wrote more about the lattice in the docu-comment in the new implementation
[8]. I can produce a worked example too, but this email has gotten a bit
wordy already.

# More ugly PHI stuff

The section above holds true for the variable values analysis too, where
instead of defs we have variable assignments at DBG_VALUEs. Rather than
discovering where PHIs happen and propagating that information, variable
value analysis has to do the same to discover what machine-value PHI would be
needed to merge a variable location from all predecessors, and then examine
whether such a PHI is actually available, and has the right inputs. (This is
annoyingly fiddly).

# Summary

This new implementation goes fast enough, is accurate enough, and appears
to be sound. That isn't a reason to replace the current implementation; but
eventually the instruction reference work would make use of this, so if
possible I'd like to start review on it, if that's acceptable to people.

Seeing how you've just read a lot of words, a motivating reminder: this path
can lead, slowly, to there being no debug instructions for variable locations!
And as described in my RFC sitrep-email, it can be more accurate than
location only tracking too.

[0] http://lists.llvm.org/pipermail/llvm-dev/2020-February/139440.html
[1] Branch is here:
https://github.com/jmorse/llvm-project/commits/ldvars-separation , tip
is 11450def5bc592321ece2a64ea2a8cabbc614c39 at time of sending. I
could open a PR containing the changes, but github mucks them up; best
to just look at the bare file [8].
[2] http://lists.llvm.org/pipermail/llvm-dev/2020-February/139469.html
[3] https://github.com/llvm/llvm-project/blob/3626eba11f2367da88ccb0280a34731243278034/llvm/lib/CodeGen/LiveDebugValues.cpp#L9
[4] https://bugs.llvm.org/show_bug.cgi?id=42834
[5] https://github.com/jmorse/llvm-project/commit/720876fec2a5113d52977d00e0e1f7d526b4c1f0
four commits from here
[6] https://reviews.llvm.org/D67500 , I probably should push to land this
[7] It's not a bad view, and definitely something I've been preaching,
[8] https://github.com/jmorse/llvm-project/blob/11450def5bc592321ece2a64ea2a8cabbc614c39/llvm/lib/CodeGen/LiveDebugValues.cpp#L60

--
Thanks for reading this essay,
Jeremy