[PATCH] D46493: [DagCombiner] Not all 'andn''s work with immediates.

Sat May 5 05:22:21 PDT 2018

andreadb added a comment.

> I'd say this regression is an improvement, since IPC increased in that case?

As a rule of thumb when using llvm-mca, it's best to always remove return statements from the assembly code sequence.
llvm-mca should have warned you about the presence of a return statement in the input sequence:

  warning: found a return instruction in the input assembly sequence.
  note: program counter updates are ignored.

To get the correct resource pressure distribution in example icmp-opt.txt, you should remove the `retq`.
As a result, you should see this:

  Resource pressure per iteration:
  [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]
   -      -     0.75   0.75   0.75   0.75    -      -      -      -      -      -

  Resource pressure by instruction:
  [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]       Instructions:
   -      -     0.25   0.25   0.25   0.25    -      -      -      -      -      -         shrq    $63, %rdi
   -      -     0.25   0.25   0.25   0.25    -      -      -      -      -      -         xorl    $1, %edi
   -      -     0.25   0.25   0.25   0.25    -      -      -      -      -      -         movl    %edi, %eax

And...

  Resource pressure by instruction:
  [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]       Instructions:
   -      -     0.25   0.25   0.25   0.25    -      -      -      -      -      -         xorl    %eax, %eax
   -      -     0.25   0.25   0.25   0.25    -      -      -      -      -      -         testq   %rdi, %rdi
   -      -     0.25   0.25   0.25   0.25    -      -      -      -      -      -         setns   %al

In terms of register pressure, the two code sequences are equally good.

The `shlq+xor+mov` is worse in terms of IPC because of the data dependency on %edi that limits the ILP when executing multiple iterations of the loop.
If you run multiple iterations and print the timeline view, you can see how the "average wait time" in the scheduler's queue is quite high for the shlq instruction.

You can see a similar behavior in test `pos_sel_constants`. The only problem I see is the slow LEA instruction (which is not treated specially by the scheduling model; at the moment it uses the same resources as a normal LEA).
Assuming that these instructions are executed in a loop, the new variant suffer less for long data dependencies between iterations.

Repository:
  rL LLVM

https://reviews.llvm.org/D46493