[PATCH] D36858: [x86] Teach the cmov converter to aggressively convert cmovs with memory operands into control flow.

Thu Aug 17 19:02:41 PDT 2017

chandlerc created this revision.
Herald added subscribers: mcrosier, sanjoy.

We have seen periodically performance problems with cmov where one
operand comes from memory. On modern x86 processors with strong branch
predictors and speculative execution, this tends to be much better done
with a branch than cmov. We routinely see cmov stalling while the load
is completed rather than continuing, and if there are subsequent
branches, they cannot be speculated in turn.

Also, in many (even simple) cases, macro fusion causes the control flow
version to be fewer uops.

Consider the IACA output for the initial sequence of code in a very hot
function in one of our internal benchmarks that motivates this, and notice the
micro-op reduction provided.
Before, SNB:

  Throughput Analysis Report
  --------------------------
  Block Throughput: 2.20 Cycles       Throughput Bottleneck: Port1

  | Num Of |              Ports pressure in cycles               |    |
  |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
  ---------------------------------------------------------------------
  |   1    |           | 1.0 |           |           |     |     | CP | mov rcx, rdi
  |   0*   |           |     |           |           |     |     |    | xor edi, edi
  |   2^   | 0.1       | 0.6 | 0.5   0.5 | 0.5   0.5 |     | 0.4 | CP | cmp byte ptr [rsi+0xf], 0xf
  |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |    | mov rax, qword ptr [rsi]
  |   3    | 1.8       | 0.6 |           |           |     | 0.6 | CP | cmovbe rax, rdi
  |   2^   |           |     | 0.5   0.5 | 0.5   0.5 |     | 1.0 |    | cmp byte ptr [rcx+0xf], 0x10
  |   0F   |           |     |           |           |     |     |    | jb 0xf
  Total Num Of Uops: 9

After, SNB:

  Throughput Analysis Report
  --------------------------
  Block Throughput: 2.00 Cycles       Throughput Bottleneck: Port5

  | Num Of |              Ports pressure in cycles               |    |
  |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
  ---------------------------------------------------------------------
  |   1    | 0.5       | 0.5 |           |           |     |     |    | mov rax, rdi
  |   0*   |           |     |           |           |     |     |    | xor edi, edi
  |   2^   | 0.5       | 0.5 | 1.0   1.0 |           |     |     |    | cmp byte ptr [rsi+0xf], 0xf
  |   1    | 0.5       | 0.5 |           |           |     |     |    | mov ecx, 0x0
  |   1    |           |     |           |           |     | 1.0 | CP | jnbe 0x39
  |   2^   |           |     |           | 1.0   1.0 |     | 1.0 | CP | cmp byte ptr [rax+0xf], 0x10
  |   0F   |           |     |           |           |     |     |    | jnb 0x3c
  Total Num Of Uops: 7

The difference even manifests in a throughput cycle rate difference on Haswell.
Before, HSW:

  Throughput Analysis Report
  --------------------------
  Block Throughput: 2.00 Cycles       Throughput Bottleneck: FrontEnd

  | Num Of |                    Ports pressure in cycles                     |    |
  |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
  ---------------------------------------------------------------------------------
  |   0*   |           |     |           |           |     |     |     |     |    | mov rcx, rdi
  |   0*   |           |     |           |           |     |     |     |     |    | xor edi, edi
  |   2^   |           |     | 0.5   0.5 | 0.5   0.5 |     | 1.0 |     |     |    | cmp byte ptr [rsi+0xf], 0xf
  |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | mov rax, qword ptr [rsi]
  |   3    | 1.0       | 1.0 |           |           |     |     | 1.0 |     |    | cmovbe rax, rdi
  |   2^   | 0.5       |     | 0.5   0.5 | 0.5   0.5 |     |     | 0.5 |     |    | cmp byte ptr [rcx+0xf], 0x10
  |   0F   |           |     |           |           |     |     |     |     |    | jb 0xf
  Total Num Of Uops: 8

After, HSW:

  Throughput Analysis Report
  --------------------------
  Block Throughput: 1.50 Cycles       Throughput Bottleneck: FrontEnd

  | Num Of |                    Ports pressure in cycles                     |    |
  |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
  ---------------------------------------------------------------------------------
  |   0*   |           |     |           |           |     |     |     |     |    | mov rax, rdi
  |   0*   |           |     |           |           |     |     |     |     |    | xor edi, edi
  |   2^   |           |     | 1.0   1.0 |           |     | 1.0 |     |     |    | cmp byte ptr [rsi+0xf], 0xf
  |   1    |           | 1.0 |           |           |     |     |     |     |    | mov ecx, 0x0
  |   1    |           |     |           |           |     |     | 1.0 |     |    | jnbe 0x39
  |   2^   | 1.0       |     |           | 1.0   1.0 |     |     |     |     |    | cmp byte ptr [rax+0xf], 0x10
  |   0F   |           |     |           |           |     |     |     |     |    | jnb 0x3c
  Total Num Of Uops: 6

Note that this cannot be usefully restricted to inner loops. Much of the
hot code we see hitting this is not in an inner loop or not in a loop at
all. The optimization still remains effective and indeed critical for
some of our code.

I have run a suite of internal benchmarks with this change and saw no
significant regressions and few very significant improvements. I'm still
working on collecting data for SPEC and the LLVM test suite. I will
update when I have it.

I also am still working on dedicated testing of this functionality, but
I've built a very large amount of code with the patch and had no issues.

Depends on https://reviews.llvm.org/D36783.

https://reviews.llvm.org/D36858

Files:
  lib/Target/X86/X86CmovConversion.cpp
  test/CodeGen/X86/cmov.ll
  test/CodeGen/X86/pr15981.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D36858.111605.patch
Type: text/x-patch
Size: 10701 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20170818/6ba06847/attachment.bin>