[PATCH] D31900: [AddDiscriminators] Assign discriminators to memset/memcpy/memmove intrinsic calls.

Tue Apr 11 08:22:32 PDT 2017

andreadb added a comment.

In https://reviews.llvm.org/D31900#723596, @danielcdh wrote:

> Thanks for explaining. My understanding is that memcpy will introduce new basic block, which will share the same discriminator with other basic block without this patch. As a result, that basic block is incorrectly annotated so that the block placement is suboptimal. If my understanding is correct, could you include a that in the unittest?

SROA would expand the memcpy call into a straight sequence of loads and stores since the size of each new alloca slice is known statically.
So, no new basic block is created during the process; each new load/store pair lives in the same basic block as the original memcpy call.

We can see this by running sroa on test memcpy-discriminator.ll.

SROA partitions the alloca `g_b.coerce` in two smaller alloca ( one for slice [0,8) and another for slice [8,12) ).

  Slices of alloca:   %g_b.coerce = alloca { i64, i32 }, align 4
    [0,12) slice #0 (splittable)
      used by:   call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 12, i32 4, i1 false), !dbg !10
    [0,8) slice #1 (splittable)
      used by:   %3 = load i64, i64* %2, align 4, !dbg !8
    [8,12) slice #2 (splittable)
      used by:   %5 = load i32, i32* %4, align 4, !dbg !8

Each user of the original alloca is rewritten. The memcpy call is one of the users, so it is split into two pairs of loads and store (one pair per each slice).

Later on SROA uses mem2reg to promote the newly created alloca slices to register. In this example, the promotion is successful, and load %3 and %5 are made redundant by the loads materialized as a result of the memcpy expansion.

This is the IR at the end of SROA:

  define i32 @foo(i32 %cond) !dbg !5 {
  entry:
    %tobool = icmp ne i32 %cond, 0, !dbg !7
    br i1 %tobool, label %cond.true, label %cond.end, !dbg !7

  cond.true:                                        ; preds = %entry
    %g_b.coerce.sroa.0.0.copyload = load i64, i64* bitcast (%struct.B* @g_b to i64*), align 4, !dbg !8    <<< NO DISCRIMINATOR
    %g_b.coerce.sroa.2.0.copyload = load i32, i32* getelementptr inbounds (%struct.B, %struct.B* @g_b, i64 0, i32 1), align 4, !dbg !8 <<< NO DISCRIMINATOR
    %call = call i32 @bar(i64 %g_b.coerce.sroa.0.0.copyload, i32 %g_b.coerce.sroa.2.0.copyload, i32 33), !dbg !9 
    br label %cond.end, !dbg !11

  cond.end:                                         ; preds = %cond.true, %entry
    %cond1 = phi i32 [ %call, %cond.true ], [ 42, %entry ], !dbg !12
    ret i32 %cond1, !dbg !14
  }

Where:
!7 = !DILocation(line: 16, column: 16, scope: !5)
!8 = !DILocation(line: 16, column: 23, scope: !5)
!9 = !DILocation(line: 16, column: 23, scope: !10)
!10 = !DILexicalBlockFile(scope: !5, file: !1, discriminator: 2)
!14 = !DILocation(line: 17, column: 3, scope: !5)

As you can see, the two loads are in the same basic block as the original memcpy call. However, those now reference the wrong location.

> The reason I'm careful on adding new discriminator is that as we are encoding discriminator, we have limited bit to represent the base discriminator. We only want to set new discriminator when it can improve profile quality, which is valid if we can have a test showing how this can improve profile quality.

I understand that we don't want to waste discriminators, and I totally agree with you that we shall avoid to consume discriminators as much as possible.

I guess I could add a simple unit test that shows how branch weight metadata is incorrecty for the terminator in the %entry block.

For example, we could have a profile like this:

  foo:10000:10000
   16: 10000
   16.1: 0
   17: 10000

SampleProfile would compute a wrong block weight of 10000 for block %cond.true because of the presence of two loads referencing line 16. That weight would then be propagated to the edge from %entry to %cond.true. So we would end up with an incorrect edge weight of

  br i1 %tobool, label %cond.true, label %cond.end, !dbg !7 !prof !36
  !36 = !{!"branch_weights", i32 10002, i32 1}

Without this patch, this would result in a sub-optimal placement (see x86 assembly below):

  foo:                                    # @foo
          testl   %edi, %edi
          je      .LBB0_1
  # BB#2:                                 # %cond.true
          movq    g_b(%rip), %rdi
          movl    g_b+8(%rip), %esi
          movl    $33, %edx
          jmp     bar                     # TAILCALL
  .LBB0_1:                                # %cond.end
          movl    $42, %eax
          retq

With this patch, the cold block %cond.true is correctly pushed at the bottom of the function.

  foo:                                    # @foo
          testl   %edi, %edi
          jne     .LBB0_2
  # BB#1:                                 # %cond.end
          movl    $42, %eax
          retq
  .LBB0_2:                                # %cond.true
          movq    g_b(%rip), %rdi
          movl    g_b+8(%rip), %esi
          movl    $33, %edx
          jmp     bar                     # TAILCALL

That being said, I am under the impression that memcpy-discriminator.ll is already enough to show the problem with discriminators and I am not sure if a unit test would add extra value in this case.

-Andrea

https://reviews.llvm.org/D31900