[PATCH] D66535: [X86][BtVer2] Fix latency and throughput of XCHG and XADD.
Andrea Di Biagio via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Aug 21 07:19:38 PDT 2019
andreadb created this revision.
andreadb added reviewers: RKSimon, craig.topper.
Herald added subscribers: jfb, gbedwell.
On Jaguar, XCHG has a latency of 1cy and decodes to 2 macro-opcodes.
I was able to observe a maximum throughput for XCHG is 2 IPC. XCHG only seems to consume 1cy of JALU01 (based on my experiments, an ADD can execute in parallel with an independet XCHG).
The byte exchange has worse latency/throughput (1 extra latency cycle; 1 extra decoded uOP; throughput is superiorly limited to 1 IPC).
xchgb %cl, %dl # Latency: 2cy - uOPs: 3 - 2 ALU
xchgw %cx, %dx # Latency: 1cy - uOPs: 2 - 1 ALU
xchgl %ecx, %edx # Latency: 1cy - uOPs: 2 - 1 ALU
xchgq %rcx, %rdx # Latency: 1cy - uOPs: 2 - 1 ALU
The reg-mem forms of XCHG are atomic operations with an observed latency of 16cy.
The resource usage is similar to the XCHGrr variants. The biggest difference is obviously the bus-locking, which prevents the LS to issue other memory uOPs in parallel until the unlocking store uOP is executed.
xchgb %cl, (%rsp) # Latency: 16cy - uOPs: 3 -- ECX available in 11cy
xchgw %cx, (%rsp) # Latency: 16cy - uOPs: 3 -- ECX available in 11cy
xchgl %ecx, (%rsp) # Latency: 16cy - uOPs: 3 -- ECX available in 11cy
xchgq %rcx, (%rsp) # Latency: 16cy - uOPs: 3 -- ECX available in 11cy
The exchanged in/out register operand becomes available after 11cy from the start of execution. Added test xchg.s to verify that we correctly see that register write committed in 11cy (and not 16cy).
Reg-reg XADD instructions have the same latency/throughput than the byte exchange (register-register variant).
xaddb %cl, %dl # latency: 2cy - uOPs: 3 - 2 ALU
xaddw %cx, %dx # latency: 2cy - uOPs: 3 - 2 ALU
xaddl %ecx, %edx # latency: 2cy - uOPs: 3 - 2 ALU
xaddq %rcx, %rdx # latency: 2cy - uOPs: 3 - 2 ALU
The non-atomic RM variants have a latency of 11cy, and decode to 4 macro-opcodes. They still consume 2 ALU pipes, and the exchange in/out register operand becomes available in 3cy (it matches the 'load-to-use latency').
xaddb %cl, (%rsp) # latency: 11cy - uOPs: 4 - 2 ALU, 1 Ld, 1 St
xaddw %cx, (%rsp) # latency: 11cy - uOPs: 4 - 2 ALU, 1 Ld, 1 St
xaddl %ecx, (%rsp) # latency: 11cy - uOPs: 4 - 2 ALU, 1 Ld, 1 St
xaddq %rcx, (%rsp) # latency: 11cy - uOPs: 4 - 2 ALU, 1 Ld, 1 St
The atomic XADD variants execute in 16cy. The in/out register operand is available after 11cy from the start of execution.
lock xaddb %cl, (%rsp) # latency: 16cy - uOPs: 4 - 2 ALU, 1 Ld, 1 St -- ECX available in 11cy
lock xaddw %cx, (%rsp) # latency: 16cy - uOPs: 4 - 2 ALU, 1 Ld, 1 St -- ECX available in 11cy
lock xaddl %ecx, (%rsp) # latency: 16cy - uOPs: 4 - 2 ALU, 1 Ld, 1 St -- ECX available in 11cy
lock xaddq %rcx, (%rsp) # latency: 16cy - uOPs: 4 - 2 ALU, 1 Ld, 1 St -- ECX available in 11cy
Added test xadd.s to verify those latencies as well as read-advance values.
Please let me know if okay to commit.
https://reviews.llvm.org/D66535
Files:
lib/Target/X86/X86ScheduleBtVer2.td
test/tools/llvm-mca/X86/BtVer2/resources-x86_64.s
test/tools/llvm-mca/X86/BtVer2/xadd.s
test/tools/llvm-mca/X86/BtVer2/xchg.s
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D66535.216386.patch
Type: text/x-patch
Size: 29781 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20190821/10654af0/attachment.bin>
More information about the llvm-commits
mailing list