[LLVMbugs] [Bug 18270] New: Add support for DWARF path discriminators

Tue Dec 17 16:38:27 PST 2013

http://llvm.org/bugs/show_bug.cgi?id=18270

            Bug ID: 18270
           Summary: Add support for DWARF path discriminators
           Product: new-bugs
           Version: unspecified
          Hardware: PC
                OS: Linux
            Status: NEW
          Keywords: missing-feature
          Severity: normal
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: dnovillo at google.com
                CC: dblaikie at gmail.com, echristo at gmail.com,
                    llvmbugs at cs.uiuc.edu
    Classification: Unclassified

This feature in DWARF is mainly used for sample-based PGO
(http://wiki.dwarfstd.org/index.php?title=Path_Discriminators). When the
profiler samples an instruction, it determines the associated source LOC for
that instruction using debug information.

This allows the compiler to map that source LOC back to the IR, compute
block/edge frequencies and use them to guide the optimizer. This mapping is,
naturally, imprecise, but in some cases it can be actively detrimental to the
optimizer.

For instance, take the following program:

$ cat -n a.cc
     1  long bar(long x, long y) {
     2    while (x < y * 30000)
     3      x++;
     4    return x + y;
     5  }
     6
     7  int main() { return bar(0, 100000) ? 0 : -1; }

When executed under a runtime profiler, the instruction-to-SLOC mapping
provides the following profile:

$ cat a.prof 
_Z3barll:3146580:0
2: 87405
3: 87405

so, line 2 (the while header) executes with frequency 87,405. The body (line 3)
executes with frequency 87,405. And main() does not have enough samples to even
register. That seems straightforward enough.

The problem begins when this is converted into IR. We get the following code
for function bar():

define i64 @_Z3barll(i64 %x, i64 %y) #0 {
entry:
  %x.addr = alloca i64, align 8
  %y.addr = alloca i64, align 8
  store i64 %x, i64* %x.addr, align 8
  store i64 %y, i64* %y.addr, align 8
  br label %while.cond, !dbg !11

while.cond:                                       ; preds = %while.body, %entry
  %0 = load i64* %x.addr, align 8, !dbg !11
  %1 = load i64* %y.addr, align 8, !dbg !11
  %mul = mul nsw i64 %1, 30000, !dbg !11
  %cmp = icmp slt i64 %0, %mul, !dbg !11
  br i1 %cmp, label %while.body, label %while.end, !dbg !11

while.body:                                       ; preds = %while.cond
  %2 = load i64* %x.addr, align 8, !dbg !12
  %inc = add nsw i64 %2, 1, !dbg !12
  store i64 %inc, i64* %x.addr, align 8, !dbg !12
  br label %while.cond, !dbg !12

while.end:                                        ; preds = %while.cond
  %3 = load i64* %x.addr, align 8, !dbg !13
  %4 = load i64* %y.addr, align 8, !dbg !13
  %add = add nsw i64 %3, %4, !dbg !13
  ret i64 %add, !dbg !13
} 

Notice that block 'entry' has a jump to block 'while.cond'. This branch
instruction has exactly the same debug info as the instructions in block
'while.cond'.

                     E1
              +-------------+
              |             |
              v             |
entry -> while.cond -> while.body     while.end
              |                            ^
              |             E2             |
              +----------------------------+

The assignment of block weights from instruction weights is done such that a
block's weight is defined to be
the maximum of all sample counts in the instructions for that basic block.
Since block 'entry' has the branch 'br label %while.cond', its weight is
computed as 87,405.

During propagation, we find that block 'entry' dominates block 'while.end'. By
the property of dominance, the weight of 'while.end' will be the same as the
weight for 'entry'.

This gives us:
weight[entry]: 87405
weight[while.cond]: 87405
weight[while.body]: 87405
weight[while.end]: 87405

After computing block weights, we propagate those weights into edge weights.
Since while.end has a single incoming edge, and the weight of the block is
known, we conclude that the weight of that edge (E2) must be 87,405. Similarly,
the weight of the edge entry->while.cond must be 87,405.

This leaves edge E1 as the only unassigned edge. Since the weight of all
incoming edges cannot add up to more than the weight of the block, we conclude
that the weight of edge E1 must be 0.

This is the opposite of what we should've concluded. The real conclusion should
have been:

weight(E2) = 0
weight(E1) = 87405

This causes us to mispredict the backedge, leading into bad block placement.

=========================================================================================

This can be avoided with the use of path discriminators
(http://wiki.dwarfstd.org/index.php?title=Path_Discriminators).

Notice that the problem starts with the mis-computation of 'entry's weight. If
the compiler
emits path discriminators, the branch instruction at the end of entry would be
emitted with a different discriminator as the instructions in block while.cond
(essentially, they are on the same line but on different basic blocks).

In this case, we could have something like:

$ cat a.prof 
_Z3barll:3146580:0
2: 0
2.1: 87405
3: 87405

All the instructions in block 'entry' belonging to line 2 would get no
discriminator, all the instructions in block 'while.cond' belonging to line 2
would get discriminator '1'.

This way, the compiler can now correctly compute 'entry's and 'while.end's
weight as 0. Leaving edge E2 with weight 0 and E1 with weight 87405.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20131218/e09f90de/attachment.html>