[llvm-dev] Align loops by 32 to use DSB more efficiently in x86

Wed Jan 27 21:44:00 PST 2021

Hello everyone,

I wanted to discuss the loop alignment choice in X86 codegen. Currently, LLVM unconditionally aligns all loops by 16 bits. And in some cases it does not interact well with some processor mechanisms, in particular with DSB. The effect I’m observing now has been discussed before, at least it is mentioned in this slides: https://llvm.org/devmtg/2016-11/Slides/Ansari-Code-Alignment.pdf, but it doesn’t seem that any decision has been taken on it ever since.

Motivation:
The motivating code piece that demonstrated significant score swings is as in the following example:

define i32 @test(i32* %p, i64 %len, i32 %x) {
entry:
  br label %loop

loop:                                             ; preds = %backedge, %entry
  %iv = phi i64 [ %iv.next, %backedge ], [ %len, %entry ]
  %iv.next = add nsw i64 %iv, -1
  %cond_1 = icmp eq i64 %iv, 0
  br i1 %cond_1, label %exit, label %backedge

backedge:                                         ; preds = %loop
  %addr = getelementptr inbounds i32, i32* %p, i64 %iv.next
  %loaded = load atomic i32, i32* %addr unordered, align 4
  %cond_2 = icmp eq i32 %loaded, %x
  br i1 %cond_2, label %failure, label %loop

exit:                                             ; preds = %loop
  ret i32 -1

failure:                                          ; preds = %backedge
  unreachable
}

Basically this code is searching element x in array of values. Here is llc result for this loop in mtriple=x86_64-apple-macosx:

  .p2align 4, 0x90
LBB0_1: ## %loop
  ## =>This Inner Loop Header: Depth=1
  subq $1, %rax
  jb LBB0_4
## %bb.2: ## %backedge
  ## in Loop: Header=BB0_1 Depth=1
  cmpl %edx, -4(%rdi,%rsi,4)
  movq %rax, %rsi
  jne LBB0_1

(Note: the last movq is redundant, inserted by LSR likely due to cost model bug, filed as https://bugs.llvm.org/show_bug.cgi?id=48355. Regardless of it, the situation remains the same).

And here is the assembly on x64 platform:

97.34%  ↗  0x30026d50: 83ea01                   subl    $1, %edx
         │  0x30026d53: 0f820b060000             jb    1547                            ; 0x30027364
  0.04%  │  0x30026d59: 89d3                     movl    %edx, %ebx
         │  0x30026d5b: 394c9810                 cmpl    %ecx, 16(%rax,%rbx,4)
         ╰  0x30026d5f: 75ef                     jne    -17                            ; 0x30026d50

Important notes here:

  *   Loop is aligned by 16 bytes;
  *   Loop size is 17 bytes;

Depending on a particular machine, this loop is also aligned by 32 bytes. The score difference for this example is dramatic: 32 ops/ms when loop is aligned by 16 (and not by 32), against 51 ops/ms (when loop is aligned by 32). It means that the workload is bound to decoding, and the rest of execution works fine (insns/cycpes grows from 2.5 to 3.7 when aligned).

Alignment of this particular loop depends on how code for previous IR is generated, and in our case it varies:

  *   Host to host;
  *   Build to build;
  *   Run to run (observed at least once, might be a JIT effect?).

Here are some performance counters collected on this test:

Align 16:

    17,502,234,018      cycles                                                        (81.80%)

        26,147,856      idq.all_dsb_cycles_4_uops                                     (81.80%)

    17,304,494,357      idq.all_dsb_cycles_any_uops                                     (81.80%)

    17,302,169,702      idq.dsb_cycles                                                (81.80%)

    34,681,358,154      idq.dsb_uops                                                  (81.80%)

        23,714,772      idq.all_mite_cycles_4_uops                                     (81.82%)

        87,658,024      idq.all_mite_cycles_any_uops                                     (81.83%)

        66,740,027      idq.mite_cycles                                               (81.84%)

       163,425,243      idq.mite_uops                                                 (81.84%)

         2,674,281      dsb2mite_switches.penalty_cycles                                     (72.74%)

         5,537,882      idq.ms_switches                                               (72.72%)

       5.001125917 seconds time elapsed

Align 32:

    17,686,896,278      cycles                                                        (81.79%)

    13,101,510,457      idq.all_dsb_cycles_4_uops                                     (81.81%)

    13,113,604,250      idq.all_dsb_cycles_any_uops                                     (81.83%)

    13,112,858,568      idq.dsb_cycles                                                (81.83%)

    52,468,480,758      idq.dsb_uops                                                  (81.83%)

        23,659,985      idq.all_mite_cycles_4_uops                                     (81.83%)

        64,453,229      idq.all_mite_cycles_any_uops                                     (81.83%)

        44,043,473      idq.mite_cycles                                               (81.83%)

       123,098,484      idq.mite_uops                                                 (81.83%)

         2,485,361      dsb2mite_switches.penalty_cycles                                     (72.71%)

         5,676,762      idq.ms_switches                                               (72.70%)

       5.051353273 seconds time elapsed

There is a dramatic difference in counter “idq.all_dsb_cycles_4_uops”. The apparent reason of it is that DSB works with 32-byte aligned instruction window. If the loop crosses the border of this window, it cannot work with max possible efficiency.

We are only observing this in one workload (out of Java benchmark set we run regularly), but provided how simple this example is, I see no reason why it’s not a generic widespread issue. Obviously, we want such loops aligned by 32 to benefit from DSB.

Proposal:
We could have an heuristic trying to align such loops by 32 bytes. It may be more or less complex. The simplest solution would be “let’s align all innermost loops by 32” (we already have an experimental option for that). The more accurate solution would be:
Align loops by 32 if:

  *   They are innermost;
  *   Size of loop mod 32 is between 16 and 31 (only in this case alignment by 32 will strictly reduce the number of 32 window crossings by 1);
  *   (Optional) The loop is small, e.g. less than 32 bytes;
  *   (Optional) We could make even sharper checks trying to ensure that all other conditions of DSB max utilization are met (may be very complex!)

Possible cons we’re aware about:

  *   Code size. Insertion of NOPs for alignment potentially increases code size.
     *   Despite there is clearly a growth, I think it’s negligible: we’ve measured this on a huge Java code corps, and the impact is far less than 1% and not distinguishable from noise in JIT runs.
  *   Code size growth may affect performance of containing loops (e.g. they will stop fitting into some code window/cache line/icache).
     *   This, though theoretically possible, is very unlikely to be a real problem. Outer loops only matter if inner loops are small w.r.t. to them. It means that inner loops will likely be unrolled and fully eliminated, and the parents will become inner loops themselves.
     *   If it’s a real problem, we can restrain the 32-byte alignment from applying to loops with parents.
  *   This maybe interacts with some other processor mechanisms in a negative way. My knowledge here is not strong enough to say which and how.
     *   We did not see negative performance impact on our benchmarks. It does not mean there is none, but we may hope they’re uncommon.

I suggest to think of this and chose how restrictively we want to align loops by 32, provided this data.

Regards,
Max

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210128/0f359782/attachment-0001.html>