[llvm-dev] [RFC] DebugInfo: A different way of specifying variable locations post-isel

Tue Feb 25 08:46:18 PST 2020

> On Feb 24, 2020, at 9:31 AM, Jeremy Morse <jeremy.morse.llvm at gmail.com> wrote:
> 
> Hi debuginfo cabal,
> 
> tl;dr: I'd like to know what people think about an alternative to
> DBG_VALUE instructions describing variable locations in registers,
> virtual or real. Before instruction selection in LLVM-IR we identify
> the _values_ of variables [0] by the instruction that computes the
> value; I believe we should be able to do the same post-isel, and it
> would avoid having to analyse register locations across regalloc and
> numerous optimisations. Or written another way: why don't we track the
> value of variables through backend codegen, and then determine a
> register location very late?
> 
> This is just an idea with no solid proposal of work. IMO this would
> reduce the amount of code and complexity involved in preserving
> variable locations. It would also help eliminate debug instructions in
> a far flung future.
> 
> Background:
> 
> In optimised LLVM-IR, we specify a variable location like so:
> 
>  %2 = someinst %1, %0
>  call @llvm.dbg.value(metadata i32 %2, ...)
> 
> A dbg.value intrinsic call specifies two things about a variable:
> * The SSA-register / otherwise that is the value of the variable, and,
> * The position in the instruction stream where that SSA-register
> becomes the variable location.
> 
> I'm using the term "machine location" and "program location"
> throughout this email to mean the two items above, respectively. This
> representation is good for LLVM-IR: the SSA-register machine location
> entirely and uniquely identifies a computation, the value of which
> should appear as the value of the variable in a debugger.
> 
> Post-isel, the same sequence is represented by:
> 
>  %2 = some-machine-inst %1, %0
>  DBG_VALUE %2, ...
> 
> Which to a large extent means the same thing. However, there are some
> subtle differences that manifest as the function proceeds through the
> codegen pipeline:
> * The specified virtual register (%0) doesn't always contain the
> value produced by "some-machine-inst". Once we leave SSA-form, there
> can be multiple def's of the vreg after PHI-elimination / register
> coalescing.
> * The vreg does not uniquely identify the value produced by
> "some-machine-inst": COPY instructions introduced during SelectionDAG
> / PHI-elimination / other passes place the value into multiple vregs,
> that can have different liveness ranges.
> 
> The problem:
> 
> Those two differences between dbg.value intrinsics and DBG_VALUE
> instructions introduce some annoying artifacts that make handling
> DBG_VALUEs harder than dbg.values:
> * Identical DBG_VALUEs at different program locations can result in
> different variable values being presented (because their vreg operand
> might refer to a different def),
> * There can be multiple ways to represent a dbg.value in DBG_VALUEs
> (as you have a choice of vregs from COPY instructions), some with
> different lifetimes.
> 
> Both of which make the movement and preservation of DBG_VALUEs much
> more context-dependent than the LLVM-IR equivalent. It's a lot easier
> to cause an incorrect value to appear in a debugger at this stage of
> compilation, or limit the range over which we preserve a variable
> location.
> 
> There are currently three instruction scheduling passes in LLVM
> (machine-scheduler, postra scheduler, SelectionDAG does some too)
> which don't have any principled approach to preserving the correctness
> of variable locations, and are vulnerable to the artifacts above. The
> first two just glue DBG_VALUEs to the preceeding machine instruction
> and move them around together (vulnerable to assignment reordering and
> referring to the wrong {v,}reg def), the latter can re-order
> assignments but also finds it hard to select the longest-living vreg,
> which I wrote up in [1]. Correctly scheduling DBG_VALUEs to always:
> * refer to the correct vreg def,
> * With the longest lifetime,
> * without re-ordering assignments,
> is sufficiently hard that no-one has attempted it to my knowledge, and
> I believe it would be really difficult to get right. Additionally, if
> we were to generate DBG_VALUE $noreg instructions when rescheduling
> (to terminate earlier variable locations), and then a subsequent
> scheduling pass undoes that rescheduling (or some part of it), we will
> lose or shorten variable locations for no reason.
> 
> Finally, being forced to always specify both the machine location and
> the program location at the same time (in a single DBG_VALUE)
> introduces un-necessary burdens. In MachineSink, when we sink between
> blocks an instruction that defines a vreg, we chose to sink DBG_VALUE
> instructions referring to that vreg too to avoid losing the variable
> location. This un-necessarily risks re-ordering assignments, and in
> some circumstances [2] you would have to examine all the instructions
> in the function to work out whether sinking a DBG_VALUE would be
> legal. In SimpleRegisterCoalescing, when we merge two vregs,
> DBG_VALUEs can only refer to the surviving vreg -- and at the
> DBG_VALUEs location that vreg might not contain the right def. There
> may be other machine locations where the correct value is available
> (it may even be rematerialized later), but searching for it is hard;
> right now we just drop variable location information in these cases.
> 

Makes sense so far.

> A solution:
> 
> [To be clear, I haven't tried to implement this idea yet as I wanted feedback,]
> 
> I'd like to suggest that we can represent variable locations in the
> codegen backend / MIR with three things:
> * The instruction that defines the value of the variable,
> * The operand of that instruction into which the value is written,
> * The position in the instruction stream where the assignment of this
> value to the variable occurs

What about constants and memory locations?

> That's effectively modifying a machine location from being a {v,}reg,
> into being a "defining instruction" and operand. This is closer to the
> LLVM-IR form of a machine location, where the SSA Value and its
> computation are synonymous. Exactly how this is represented in-memory
> and in-printed-MIR I haven't thought a lot about; probably by
> attaching metadata to instructions and having DBG_VALUE use a metadata
> operand rather than referring to a vreg. Specifying machine locations
> like this would have the following benefits:
> * Both DBG_VALUEs and defining instructions are independent and can
> be moved within the function without loss of information, and without
> needing to consider so much context,

What is the difference between attaching the DBG_VALUE to the instruction and moving the DBG_VALUE together with the preceding non-debug instruction?

What do you do with code like this:

int a = x;
int b = 23;
...
b = a;

mov rax, %x
DBG_VALUE rax, "a"
DBG_VALUE 23, "b"
... 
DBG_VALUE rax, "b"

where the "defining instruction" is far away from the DBG_VALUE?

-- adrian

> * Likewise, vregs can be rewritten / merged / deleted without the
> need to update any debug metadata. Only instruction deletion /
> morphing would need some sort of change,
> * We would never need to refer to COPYs, avoiding artifical liveness
> limitations,
> * Debug use before defs would become tolerable (see below), and
> possibly even be a good way of describing locations after
> optimisations.
> 
> This would not eliminate the risk of re-ordering variable assignments.
> 
> The three instruction scheduling passes would become significantly
> easier to deal with: they would only have to replace DBG_VALUE
> instructions in the correct order, not worry about their operands.
> Various debug facilities in SimpleRegisterCoalescing, MachineSink, and
> large amounts of LiveDebugVariables would become redundant, as we
> wouldn't need to maintain a register location through optimisations.
> 
> Finally, this design could be extended to not having any instructions
> in the instruction stream. Once machine locations aren't described
> within a MachineOperand, the most important thing a DBG_VALUE
> signifies is a position in the instruction stream, which could be
> performed in some other way (i.e., more metadata) in the future.
> 
> How then do we translate this new kind of machine location into
> DWARF/CodeView variable locations, which need to know which register
> to look in? The answer is: LiveDebugValues [3]. We already perform a
> dataflow analysis in LiveDebugValues of where values "go" after
> they're defined: we can use that to take "defining instructions" and
> determine register / stack locations. We would need to track values
> from the defining instruction up to the DBG_VALUE where that value
> becomes a variable location, after which it's no different from the
> LiveDebugValues analysis that we perform today. LiveDebugValues'
> ability to track values through stack spills and restores would become
> a critical feature (it isn't today), as we would no longer generate
> stack locations during register allocation.
> 
> I reckon debug-use-before-def's can be tolerated in this
> representation, and even be well defined and useful, reducing the work
> needed to be done earlier in the compiler. Under the model described
> above, we can specify a program location before the corresponding
> machine location containing the variable value machine location
> becomes available. Consider this code:
> 
>  DBG_VALUE output-of-this-inst ---
>  someinst1                        |
>  someinst2                        |
>  $rax = ADD32ri $rax, 0     <-----
> 
> Where the line from DBG_VALUE to ADD32ri represents some
> as-yet-undetermined way of identifying the ADD32ri instruction from
> the DBG_VALUE. We can interpret such a code sequence as the variable
> having no location across someinst1 and someinst2, which are not
> dominated by the defining instruction, then a location of $rax after
> the ADD32ri. Essentially:
> * For an instruction dominated by a DBG_VALUE but not by the defining
> instruction, the variable location is empty / undef / $noreg,
> * For an instruction dominated by both, the variable location is
> defined as it is today.
> 
> This should work across control flow, and doesn't necessitate the
> creation of DBG_VALUE $noreg's to explicitly describe unavailable
> locations when instructions move. In theory, if we were to accept
> debug use-before-defs in LLVM-IR, this would reduce analysis and mean
> fewer dbg.value(undef,...)'s would need to be created earlier in the
> compiler.
> 
> Limitations
> 
> The largest problem with this idea is that not all variable values are
> defined by instructions: PHIs are values that are defined by control
> flow. To deal with this pre-regalloc, we could move LiveDebugVariables
> to run before phi-elimination. My understanding is that the register
> allocation phase of LLVM starts there and ends after virtregrewriter,
> and it'd be legitimate to say "we do special things for these passes".
> After regalloc however, there would need to be some way of specifying
> a block and a register, where entry to the block defines a variable
> value in that register. This isn't pretty; but IMO is the
> representation closest to the truth. Passes like tail duplication and
> branchfolder might need to perform debuginfo maintenence when they
> altered blocks -- however I believe these circumstances are rare, as
> few control flow changes happen after regalloc. It (IMO) would be
> worth it given the other benefits.
> 
> I also haven't considered the impact of this on -O0: one would imagine
> it would be easier to deal with than optimised builds though.
> 
> Discussion
> 
> I feel like this would be a better way of representing variable
> locations in the codegen backend; my fear is that this is a lot of
> work, and I don't know what appetite there is for change amongst other
> interested parties. Thus I'd be interested in any kind of feedback as
> to whether a) this is a good idea, b) whether this category of change
> is what people want, and c) whether this is seen as being achievable.
> 
> Being able to introduce this change incrementally presents some
> challenges: while the way of representing variable locations described
> above is more expressive than the current way, converting between one
> and the other requires running the LiveDebugValues analysis, which
> makes moving transparently between the two hard to do. Moving
> backwards through the backend, from emission towards the start might
> be doable though.
> 
> This introduces some additional complexity into a pass
> (LiveDebugValues) that's been difficult to understand and reason about
> in the past. In my opinion, given that we have to perform this
> dataflow analysis at the end of compilation to propagate variable
> locations anyway, it would be worthwhile to harness it to remove the
> need for complexity elsewhere. Some of the problems I've described
> above need their own dataflow analyses to be both sound and complete:
> IMO it would be better to record the bare minimum of facts and then
> interpret them at the end of compilation.
> 
> Happily there are "only" 130 tests that input or output MIR in
> llvm/test/DebugInfo, so this doesn't involve rewriting *every* single
> test that there is.
> 
> [0] You could consider an SSA register a "location" too, my point is
> that it's both a value and a location.
> [1] https://bugs.llvm.org/show_bug.cgi?id=41583
> [2] https://bugs.llvm.org/show_bug.cgi?id=44117
> [3] You knew it was coming!
> 
> --
> Thanks,
> Jeremy