[llvm-dev] [RFC] Moving llvm.dbg.value out of the instruction stream

Tue Oct 23 10:02:15 PDT 2018

On Mon, Oct 22, 2018 at 6:43 PM Reid Kleckner via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> One of the major drawbacks of our current variable tracking debug info
> design is that it puts lots of llvm.dbg.value instructions in the
> instruction stream. After the debug info BoF at the 2018 dev meeting (last
> week), I started to seriously consider a design where we track variable
> locations on the side of the instruction stream. People seemed to agree
> this idea was good enough to at least explore, so I put together some
> exploratory data and ideas.
>
> Our current design tracks variable values with llvm.dbg.value
> instructions, which live in the instruction stream. This is crummy, because
> passes have to know to filter them. Today we have filtering iterators, but
> most passes have not been audited to use them, and we regularly have bugs
> where enabling debug info changes what the optimizer does. If we move
> variable tracking out of the instruction stream, we can fix this entire
> class of bugs by design.
>
> Second, the current design slows down basic block iteration. My intuition
> says that, after inlining, we have *tons* of dbg.value instructions
> clogging up our basic blocks, slowing down our standard optimizer pipeline.
> Linked list iteration isn't fast on modern architectures, so if dbg.values
> are an appreciable source of elements, moving it out of the list will speed
> up compilation and make -g less of a compile-time bear.
>
> One thing this change will *not* address is the kind of use-before-def
> bugs we get today where dbg info unaware passes sink or hoist SSA
> instructions across the dbg.value that uses them. It has been suggested
> that we should attach the dbg.value directly to the instruction producing
> the value, but I think this is a mistake, because it only handles the
> common case, and not the special case where the value is produced far from
> the point of the assignment. Even if we have the special case attachment
> for value-producing instructions, we need new APIs to help passes decide
> how to update the debug info when sinking or hoisting across blocks.
>
>
> Data & Motivation
> -----------------
>
> I used my compile-time test case of the day for this: SemaChecking.cpp. I
> re-compiled opt with the attached patch, which gets the total number of
> Instructions after CGP, and the number of dbg intrinsics. I compiled
> SemaChecking.cpp to bitcode with `clang -cc1 -debug-info-kind=limited
> -gcodeview -mllvm -instcombine-lower-dbg-declare=0` to match what Chromium
> uses. I think these numbers are general, though. Here are the numbers I got
> from my patch:
>
> 160402 codegenprepare - Total number of llvm.dbg.* instructions
> 296080 codegenprepare - Total number of instructions
>
> So, 54% of all instructions are dbg.(value|declare) instructions today!
> Moving them out of the main BB ilist would speed up BB iteration by 2x.
> That's not necessarily where we spend our time. I happen to already know
> that this test case does a lot of BB scanning (PR38829), so I know it will
> help this test case, but it may not matter if we make local dominance not
> scan.
>
> In any case, here is the time to compile to bitcode for this example:
> -O2 -g: 89.6s
> -O2: 53.6s
> -O2 -gmlt: 52.6s
>

Not urgent, but before someone dives into this, might be worth doing more
detailed measurements (removing some of the noise &) - such as disabling
the debug info emission (and/or just running the O3 pipeline in opt
specifically - without debug info generation in Clang or emission in the
LLVM backend) - would give a more accurate measurement of the cost of these
intrinsics in the kind of ways you're suggesting. (conveniently, this is a
bit easier to measure (because you can compare with/without, as you have
here) than the bitcast stuff - which is hard to measure without actually
removing them)

> -gmlt should be slower than no debug info, so clearly my system is noisy,
> but I think the impact of dbg.value is still clearly observable.
>
> The next thing I noticed is that dbg.values tend to appear in packs. My
> theory is that most of this is from multiple levels of inlining of helpers
> that forward parameters. Here is a histogram of the length of dbg intrinsic
> runs:
>
> dbg instructions, run length | freq
> 1 13086
> 2 6587
> 3 3927
> 4 1605
> 5 1643
> 6 1414
> 7 2165
> 8 1708
> 9 226
> 10 36
> 11 26
> 12 14
> 13 1750
> 14 70
> 15 224
> 16 3
> 17 154
> 18 5
> 19 1
> 20 2
> 22 209
> 23 688
> 24 705
> 28 4
> 29 1
> 36 1
> 40 2
> 71 1
>
> My understanding of the histogram above is that 52% of dbg.values are in
> runs of length 8 or greater. You can see some spikes here at 13 and 22-24,
> which I think supports my theory about inlining above.
>
> This example is just a standard -O2 large TU in clang. I can only imagine
> that the dbg.value runs get longer in ThinLTO, where the inliner runs more.
>
>
> Design
> -------
>
> dbg.value is meant to associate an SSA value with a label. My basic idea
> is that it's better to attach a label to the instruction it precedes rather
> than creating a separate instruction. However, a label should stay behind
> if the instruction it precedes moves. For example,
> Instruction::moveAfter/moveBefore should implicitly reattach their variable
> tracking info to the next instruction. While working on
> https://reviews.llvm.org/D51664, I realized it would be pretty easy to
> use the ilist callbacks currently used for symbol table updating to
> implement this.
>
> I think making the attachments describe the values of variables before the
> instruction is good, because in the general case every block has at least
> one instruction, which typically doesn't produce a value: the terminator.
> The only terminator that produces a value is invoke, and we can attach any
> variable tracking info for it to the first (non-phi?) instruction in the
> normal successor.
>
> I think, overall, this is just an internal representation shift. We may
> want to change our bitcode and assembly language to better align with our
> internal representation, but it should be functionaly equivalent.
>
>
> IR syntax
> ----------
>
> This is kind of important, since it affects how developers think about
> these and keep them up to date. Consider the following C source:
>
>   int foo(void);
>   int bar(void) {
>     int v = foo();
>     v = foo();
>     return v;
>   }
>
> I'm proposing moving from something like this:
>
>   %v1 = call i32 @foo()
>   call void @llvm.dbg.value(metadata i32 %v1, metadata !123, metadata
> !DIExpression) !dbg !456
>   %v2 = call i32 @foo()
>   call void @llvm.dbg.value(metadata i32 %v2, metadata !123, metadata
> !DIExpression) !dbg !456
>   ret i32 %v2
>
> To something like this:
>
>   %v1 = call i32 @foo()
>   !dbgvalue i32 %v1, variable !123, loc !456 [, expr ...]
>   %v2 = call i32 @foo()
>   !dbgvalue i32 %v2, variable !123, loc !456 [, expr ...]
>   ret i32 %v2
>
> 'expr' would be an optional DIExpression, expressed more compactly if
> possible, perhaps using our inline !DIExpression parsing for now.
>
> This also avoids the confusion that today the dbgloc attached to a
> dbg.value is not actually used to generate line table entries, it's only
> for tracking distinct variables created from different inlined call sites.
>
>
> Data structures
> ----------------
>
> What we're really trying to represent is two separate sequences. For
> simplicity and minimal change, I initially propose that we create a
> secondary linked list of some new DbgValue data structures. They would
> relate kind of like this:
>
>   inst1 -> inst2 -> inst3 -> inst4
>     \        \        \        \
>     dv1 ->   dv2 ->   dv3 ->   dv4
>
> If we were to delete inst2, dv2 would be reattached to the following
> instruction:
>
>   inst1 -> inst3 ->          inst4
>     \        \                 \
>     dv1 ->   dv2 ->   dv3 ->   dv4
>
> What's interesting is that once dv2 and dv3 are grouped together, it's
> actually a bug if we ever generate code between them. I can imagine more
> compact representations than a linked list, but we want to be able to
> efficiently join two sequences of dbgvalues when deleting or moving a
> single instruction, and linked lists achieve that. The list would be owned
> by the BasicBlock, since if a whole block is unreachable, there's no where
> to put the value tracking info.
>
>
> Actually doing it
> -----------------
>
> LLVM currently has several unfinished migrations going on and a fair
> amount of technical debt. It's not clear that this project is the top
> priority for me or anyone else at this moment, so even if people like this
> idea, don't assume it will be implemented soon. I think similar efficiency
> reasoning applies to the pointee-type removal API migration that David
> Blaikie started.
>

Yeah, pretty much - though it's nice that we can measure the likely impact
of this ahead of time, unlike with the bitcast work. Easier to assess the
importance, etc.

(well, I guess it's a bit different - the thing we can measure here isn't
such a concern for pointer bitcasts - but then there's also the concern
that covers both (that it hinders (rather than just slowing down)
transformations because they may not correctly ignore them))

> Casts are very similar to dbg.values in this way.
>
> However, I wanted to write it all down before letting it fade from memory.
> Apologies for the length, I didn't have time to make a shorter RFC. =P
>

For sure - worth having it down - thanks for that!

> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20181023/5a386983/attachment.html>