[PATCH] D67799: [InstCombine] Fold a shifty implementation of clamp negative to zero.
Huihui Zhang via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Sep 23 00:36:43 PDT 2019
huihuiz added a comment.
llvm-mca results for more general folding pattern
- Scalar Tests ---
X86: skylake cmovgl latency 1
test input; run : clang clampNegToZero.ll -O2 -target x86_64 -march=skylake -S -o - | llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=skylake
define i32 @clamp0(i32 %v, i32 %x) {
%sub = sub nsw i32 %x, %v
%shr = ashr i32 %sub, 31
%and = and i32 %shr, %v
ret i32 %and
}
Before:
Iterations: 100
Instructions: 500
Total Cycles: 159
Total uOps: 700
Dispatch Width: 6
uOps Per Cycle: 4.40
IPC: 3.14
Block RThroughput: 1.2
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.25 movl %esi, %eax
1 1 0.25 subl %edi, %eax
1 1 0.50 sarl $31, %eax
1 1 0.25 andl %edi, %eax
3 7 1.00 U retq
After this transformation:
Iterations: 100
Instructions: 400
Total Cycles: 110
Total uOps: 600
Dispatch Width: 6
uOps Per Cycle: 5.45
IPC: 3.64
Block RThroughput: 1.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 0 0.17 xorl %eax, %eax
1 1 0.25 cmpl %esi, %edi
1 1 0.50 cmovgl %edi, %eax
3 7 1.00 U retq
X86: cooper lake cmovgl latency also 1
same input; run: clang clampNegToZero.ll -O2 -target x86_64 -march=cooperlake -S -o - | llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=cooperlake
before
Iterations: 100
Instructions: 500
Total Cycles: 159
Total uOps: 700
Dispatch Width: 6
uOps Per Cycle: 4.40
IPC: 3.14
Block RThroughput: 1.2
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.25 movl %esi, %eax
1 1 0.25 subl %edi, %eax
1 1 0.50 sarl $31, %eax
1 1 0.25 andl %edi, %eax
3 7 1.00 U retq
After this transformation:
Iterations: 100
Instructions: 400
Total Cycles: 110
Total uOps: 600
Dispatch Width: 6
uOps Per Cycle: 5.45
IPC: 3.64
Block RThroughput: 1.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 0 0.17 xorl %eax, %eax
1 1 0.25 cmpl %esi, %edi
1 1 0.50 cmovgl %edi, %eax
3 7 1.00 U retq
AMD :
same input; run: clang clampNegToZero.ll -O2 -target x86_64 -march=znver2 -S -o - | llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=znver2
Before
Iterations: 100
Instructions: 500
Total Cycles: 155
Total uOps: 600
Dispatch Width: 4
uOps Per Cycle: 3.87
IPC: 3.23
Block RThroughput: 1.5
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.25 movl %esi, %eax
1 1 0.25 subl %edi, %eax
1 1 0.25 sarl $31, %eax
1 1 0.25 andl %edi, %eax
2 1 0.50 U retq
After this transformation:
Iterations: 100
Instructions: 400
Total Cycles: 203
Total uOps: 500
Dispatch Width: 4
uOps Per Cycle: 2.46
IPC: 1.97
Block RThroughput: 1.3
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.25 xorl %eax, %eax
1 1 0.25 cmpl %esi, %edi
1 1 0.25 cmovgl %edi, %eax
2 1 0.50 U retq
AArch64: cortex-a57 csel latency 1
run: clang clampNegToZero.ll -O2 -target aarch64 -mcpu=cortex-a57 -S -o - | llvm-mca -mtriple=aarch64 -mcpu=cortex-a57
before:
Iterations: 100
Instructions: 300
Total Cycles: 303
Total uOps: 300
Dispatch Width: 3
uOps Per Cycle: 0.99
IPC: 0.99
Block RThroughput: 1.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.50 sub w8, w1, w0
1 2 1.00 and w0, w0, w8, asr #31
1 1 1.00 U ret
After this transformation:
Iterations: 100
Instructions: 300
Total Cycles: 203
Total uOps: 300
Dispatch Width: 3
uOps Per Cycle: 1.48
IPC: 1.48
Block RThroughput: 1.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.50 cmp w0, w1
1 1 0.50 csel w0, w0, wzr, gt
1 1 1.00 U ret
- Vector Tests ---
test input
define <4 x i32> @clamp0-vec(<4 x i32> %v, <4 x i32> %x) {
%sub = sub nsw <4 x i32> %x, %v
%shr = ashr <4 x i32> %sub, <i32 31, i32 31, i32 31, i32 31>
%and = and <4 x i32> %shr, %v
ret <4 x i32> %and
}
X86 : skylake
clang clampNegToZero-vec.ll -O2 -target x86_64 -march=skylake -S -o - | llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=skylake
before
Iterations: 100
Instructions: 400
Total Cycles: 303
Total uOps: 600
Dispatch Width: 6
uOps Per Cycle: 1.98
IPC: 1.32
Block RThroughput: 1.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.33 vpsubd %xmm0, %xmm1, %xmm1
1 1 0.50 vpsrad $31, %xmm1, %xmm1
1 1 0.33 vpand %xmm0, %xmm1, %xmm0
3 7 1.00 U retq
After this transformation
Iterations: 100
Instructions: 300
Total Cycles: 203
Total uOps: 500
Dispatch Width: 6
uOps Per Cycle: 2.46
IPC: 1.48
Block RThroughput: 1.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.50 vpcmpgtd %xmm1, %xmm0, %xmm1
1 1 0.33 vpand %xmm0, %xmm1, %xmm0
3 7 1.00 U retq
AMD znver2
clang clampNegToZero-vec.ll -O2 -target x86_64 -march=znver2 -S -o - | llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=znver2
before
Iterations: 100
Instructions: 400
Total Cycles: 303
Total uOps: 500
Dispatch Width: 4
uOps Per Cycle: 1.65
IPC: 1.32
Block RThroughput: 1.3
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.25 vpsubd %xmm0, %xmm1, %xmm1
1 1 0.25 vpsrad $31, %xmm1, %xmm1
1 1 0.25 vpand %xmm0, %xmm1, %xmm0
2 1 0.50 U retq
After this transformation
Iterations: 100
Instructions: 300
Total Cycles: 203
Total uOps: 400
Dispatch Width: 4
uOps Per Cycle: 1.97
IPC: 1.48
Block RThroughput: 1.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 1 0.25 vpcmpgtd %xmm1, %xmm0, %xmm1
1 1 0.25 vpand %xmm0, %xmm1, %xmm0
2 1 0.50 U retq
AArch64 cortex-a57
clang clampNegToZero-vec.ll -O2 -target aarch64 -mcpu=cortex-a57 -S -o - | llvm-mca -mtriple=aarch64 -mcpu=cortex-a57
before
Iterations: 100
Instructions: 400
Total Cycles: 903
Total uOps: 400
Dispatch Width: 3
uOps Per Cycle: 0.44
IPC: 0.44
Block RThroughput: 1.5
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 3 0.50 sub v1.4s, v1.4s, v0.4s
1 3 0.50 sshr v1.4s, v1.4s, #31
1 3 0.50 and v0.16b, v1.16b, v0.16b
1 1 1.00 U ret
After this transformation
Iterations: 100
Instructions: 300
Total Cycles: 603
Total uOps: 300
Dispatch Width: 3
uOps Per Cycle: 0.50
IPC: 0.50
Block RThroughput: 1.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 3 0.50 cmgt v1.4s, v0.4s, v1.4s
1 3 0.50 and v0.16b, v0.16b, v1.16b
1 1 1.00 U ret
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D67799/new/
https://reviews.llvm.org/D67799
More information about the llvm-commits
mailing list