[PATCH] D66535: [X86][BtVer2] Fix latency and throughput of XCHG and XADD.

Wed Aug 21 07:19:38 PDT 2019

andreadb created this revision.
andreadb added reviewers: RKSimon, craig.topper.
Herald added subscribers: jfb, gbedwell.

On Jaguar, XCHG has a latency of 1cy and decodes to 2 macro-opcodes.
I was able to observe a maximum throughput for XCHG is 2 IPC. XCHG only seems to consume 1cy of JALU01 (based on my experiments, an ADD can execute in parallel with an independet XCHG).

The byte exchange has worse latency/throughput (1 extra latency cycle; 1 extra decoded uOP; throughput is superiorly limited to 1 IPC).

  xchgb %cl, %dl           # Latency: 2cy    -  uOPs: 3  -  2 ALU
  xchgw %cx, %dx           # Latency: 1cy    -  uOPs: 2  -  1 ALU
  xchgl %ecx, %edx         # Latency: 1cy    -  uOPs: 2  -  1 ALU
  xchgq %rcx, %rdx         # Latency: 1cy    -  uOPs: 2  -  1 ALU

The reg-mem forms of XCHG are atomic operations with an observed latency of 16cy.
The resource usage is similar to the XCHGrr variants. The biggest difference is obviously the bus-locking, which prevents the LS to issue other memory uOPs in parallel until the unlocking store uOP is executed.

  xchgb %cl, (%rsp)        # Latency: 16cy   -  uOPs: 3 -- ECX available in 11cy
  xchgw %cx, (%rsp)        # Latency: 16cy   -  uOPs: 3 -- ECX available in 11cy
  xchgl %ecx, (%rsp)       # Latency: 16cy   -  uOPs: 3 -- ECX available in 11cy
  xchgq %rcx, (%rsp)       # Latency: 16cy   -  uOPs: 3 -- ECX available in 11cy

The exchanged in/out register operand becomes available after 11cy from the start of execution. Added test xchg.s to verify that we correctly see that register write committed in 11cy (and not 16cy).

Reg-reg XADD instructions have the same latency/throughput than the byte exchange (register-register variant).

  xaddb %cl, %dl           # latency: 2cy    -  uOPs: 3  -  2 ALU
  xaddw %cx, %dx           # latency: 2cy    -  uOPs: 3  -  2 ALU
  xaddl %ecx, %edx         # latency: 2cy    -  uOPs: 3  -  2 ALU
  xaddq %rcx, %rdx         # latency: 2cy    -  uOPs: 3  -  2 ALU

The non-atomic RM variants have a latency of 11cy, and decode to 4 macro-opcodes. They still consume 2 ALU pipes, and the exchange in/out register operand becomes available in 3cy (it matches the 'load-to-use latency').

  xaddb %cl, (%rsp)        # latency: 11cy   -  uOPs: 4  -  2 ALU, 1 Ld, 1 St
  xaddw %cx, (%rsp)        # latency: 11cy   -  uOPs: 4  -  2 ALU, 1 Ld, 1 St
  xaddl %ecx, (%rsp)       # latency: 11cy   -  uOPs: 4  -  2 ALU, 1 Ld, 1 St
  xaddq %rcx, (%rsp)       # latency: 11cy   -  uOPs: 4  -  2 ALU, 1 Ld, 1 St

The atomic XADD variants execute in 16cy. The in/out register operand is available after 11cy from the start of execution.

  lock xaddb %cl, (%rsp)   # latency: 16cy - uOPs: 4  - 2 ALU, 1 Ld, 1 St -- ECX available in 11cy
  lock xaddw %cx, (%rsp)   # latency: 16cy - uOPs: 4  - 2 ALU, 1 Ld, 1 St -- ECX available in 11cy
  lock xaddl %ecx, (%rsp)  # latency: 16cy - uOPs: 4  - 2 ALU, 1 Ld, 1 St -- ECX available in 11cy
  lock xaddq %rcx, (%rsp)  # latency: 16cy - uOPs: 4  - 2 ALU, 1 Ld, 1 St -- ECX available in 11cy

Added test xadd.s to verify those latencies as well as read-advance values.

Please let me know if okay to commit.

https://reviews.llvm.org/D66535

Files:
  lib/Target/X86/X86ScheduleBtVer2.td
  test/tools/llvm-mca/X86/BtVer2/resources-x86_64.s
  test/tools/llvm-mca/X86/BtVer2/xadd.s
  test/tools/llvm-mca/X86/BtVer2/xchg.s

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D66535.216386.patch
Type: text/x-patch
Size: 29781 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20190821/10654af0/attachment.bin>