[PATCH] D67799: [InstCombine] Fold a shifty implementation of clamp negative to zero.

Mon Sep 23 00:36:43 PDT 2019

huihuiz added a comment.

llvm-mca results for more general folding pattern

- Scalar Tests ---

X86:  skylake cmovgl latency 1

test input; run : clang clampNegToZero.ll -O2 -target x86_64 -march=skylake -S -o - | llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=skylake

  define i32 @clamp0(i32 %v, i32 %x) {
    %sub = sub nsw i32 %x, %v
    %shr = ashr i32 %sub, 31
    %and = and i32 %shr, %v
    ret i32 %and
  }

Before:

  Iterations:        100
  Instructions:      500
  Total Cycles:      159
  Total uOps:        700

  Dispatch Width:    6
  uOps Per Cycle:    4.40
  IPC:               3.14
  Block RThroughput: 1.2

  Instruction Info:
  [1]: #uOps
  [2]: Latency
  [3]: RThroughput
  [4]: MayLoad
  [5]: MayStore
  [6]: HasSideEffects (U)

  [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
   1      1     0.25                        movl  %esi, %eax
   1      1     0.25                        subl  %edi, %eax
   1      1     0.50                        sarl  $31, %eax
   1      1     0.25                        andl  %edi, %eax
   3      7     1.00                  U     retq

After this transformation:

  Iterations:        100
  Instructions:      400
  Total Cycles:      110
  Total uOps:        600

  Dispatch Width:    6
  uOps Per Cycle:    5.45
  IPC:               3.64
  Block RThroughput: 1.0

  Instruction Info:
  [1]: #uOps
  [2]: Latency
  [3]: RThroughput
  [4]: MayLoad
  [5]: MayStore
  [6]: HasSideEffects (U)

  [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
   1      0     0.17                        xorl  %eax, %eax
   1      1     0.25                        cmpl  %esi, %edi
   1      1     0.50                        cmovgl        %edi, %eax
   3      7     1.00                  U     retq

X86: cooper lake cmovgl latency also 1

same input; run: clang clampNegToZero.ll -O2 -target x86_64 -march=cooperlake -S -o - | llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=cooperlake

before

  Iterations:        100
  Instructions:      500
  Total Cycles:      159
  Total uOps:        700

  Dispatch Width:    6
  uOps Per Cycle:    4.40
  IPC:               3.14
  Block RThroughput: 1.2

  Instruction Info:
  [1]: #uOps
  [2]: Latency
  [3]: RThroughput
  [4]: MayLoad
  [5]: MayStore
  [6]: HasSideEffects (U)

  [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
   1      1     0.25                        movl  %esi, %eax
   1      1     0.25                        subl  %edi, %eax
   1      1     0.50                        sarl  $31, %eax
   1      1     0.25                        andl  %edi, %eax
   3      7     1.00                  U     retq

After this transformation:

  Iterations:        100
  Instructions:      400
  Total Cycles:      110
  Total uOps:        600

  Dispatch Width:    6
  uOps Per Cycle:    5.45
  IPC:               3.64
  Block RThroughput: 1.0

  Instruction Info:
  [1]: #uOps
  [2]: Latency
  [3]: RThroughput
  [4]: MayLoad
  [5]: MayStore
  [6]: HasSideEffects (U)

  [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
   1      0     0.17                        xorl  %eax, %eax
   1      1     0.25                        cmpl  %esi, %edi
   1      1     0.50                        cmovgl        %edi, %eax
   3      7     1.00                  U     retq

AMD : 
same input; run: clang clampNegToZero.ll -O2 -target x86_64 -march=znver2 -S -o - | llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=znver2

Before

  Iterations:        100
  Instructions:      500
  Total Cycles:      155
  Total uOps:        600

  Dispatch Width:    4
  uOps Per Cycle:    3.87
  IPC:               3.23
  Block RThroughput: 1.5

  Instruction Info:
  [1]: #uOps
  [2]: Latency
  [3]: RThroughput
  [4]: MayLoad
  [5]: MayStore
  [6]: HasSideEffects (U)

  [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
   1      1     0.25                        movl  %esi, %eax
   1      1     0.25                        subl  %edi, %eax
   1      1     0.25                        sarl  $31, %eax
   1      1     0.25                        andl  %edi, %eax
   2      1     0.50                  U     retq

After this transformation:

  Iterations:        100
  Instructions:      400
  Total Cycles:      203
  Total uOps:        500

  Dispatch Width:    4
  uOps Per Cycle:    2.46
  IPC:               1.97
  Block RThroughput: 1.3

  Instruction Info:
  [1]: #uOps
  [2]: Latency
  [3]: RThroughput
  [4]: MayLoad
  [5]: MayStore
  [6]: HasSideEffects (U)

  [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
   1      1     0.25                        xorl  %eax, %eax
   1      1     0.25                        cmpl  %esi, %edi
   1      1     0.25                        cmovgl        %edi, %eax
   2      1     0.50                  U     retq

AArch64: cortex-a57 csel latency 1
run: clang clampNegToZero.ll -O2 -target aarch64 -mcpu=cortex-a57 -S -o - | llvm-mca -mtriple=aarch64 -mcpu=cortex-a57
before:

  Iterations:        100
  Instructions:      300
  Total Cycles:      303
  Total uOps:        300

  Dispatch Width:    3
  uOps Per Cycle:    0.99
  IPC:               0.99
  Block RThroughput: 1.0

  Instruction Info:
  [1]: #uOps
  [2]: Latency
  [3]: RThroughput
  [4]: MayLoad
  [5]: MayStore
  [6]: HasSideEffects (U)

  [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
   1      1     0.50                        sub   w8, w1, w0
   1      2     1.00                        and   w0, w0, w8, asr #31
   1      1     1.00                  U     ret

After this transformation:

  Iterations:        100
  Instructions:      300
  Total Cycles:      203
  Total uOps:        300

  Dispatch Width:    3
  uOps Per Cycle:    1.48
  IPC:               1.48
  Block RThroughput: 1.0

  Instruction Info:
  [1]: #uOps
  [2]: Latency
  [3]: RThroughput
  [4]: MayLoad
  [5]: MayStore
  [6]: HasSideEffects (U)

  [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
   1      1     0.50                        cmp   w0, w1
   1      1     0.50                        csel  w0, w0, wzr, gt
   1      1     1.00                  U     ret

- Vector Tests ---

test input

  define <4 x i32> @clamp0-vec(<4 x i32> %v, <4 x i32> %x) {
    %sub = sub nsw <4 x i32> %x, %v
    %shr = ashr <4 x i32> %sub, <i32 31, i32 31, i32 31, i32 31>
    %and = and <4 x i32> %shr, %v
    ret <4 x i32> %and
  }

X86 : skylake
clang clampNegToZero-vec.ll -O2 -target x86_64 -march=skylake -S -o - | llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=skylake

before

  Iterations:        100
  Instructions:      400
  Total Cycles:      303
  Total uOps:        600

  Dispatch Width:    6
  uOps Per Cycle:    1.98
  IPC:               1.32
  Block RThroughput: 1.0

  Instruction Info:
  [1]: #uOps
  [2]: Latency
  [3]: RThroughput
  [4]: MayLoad
  [5]: MayStore
  [6]: HasSideEffects (U)

  [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
   1      1     0.33                        vpsubd        %xmm0, %xmm1, %xmm1
   1      1     0.50                        vpsrad        $31, %xmm1, %xmm1
   1      1     0.33                        vpand %xmm0, %xmm1, %xmm0
   3      7     1.00                  U     retq

After this transformation

  Iterations:        100
  Instructions:      300
  Total Cycles:      203
  Total uOps:        500

  Dispatch Width:    6
  uOps Per Cycle:    2.46
  IPC:               1.48
  Block RThroughput: 1.0

  Instruction Info:
  [1]: #uOps
  [2]: Latency
  [3]: RThroughput
  [4]: MayLoad
  [5]: MayStore
  [6]: HasSideEffects (U)

  [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
   1      1     0.50                        vpcmpgtd      %xmm1, %xmm0, %xmm1
   1      1     0.33                        vpand %xmm0, %xmm1, %xmm0
   3      7     1.00                  U     retq

AMD znver2
clang clampNegToZero-vec.ll -O2 -target x86_64 -march=znver2 -S -o - | llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=znver2

before

  Iterations:        100
  Instructions:      400
  Total Cycles:      303
  Total uOps:        500

  Dispatch Width:    4
  uOps Per Cycle:    1.65
  IPC:               1.32
  Block RThroughput: 1.3

  Instruction Info:
  [1]: #uOps
  [2]: Latency
  [3]: RThroughput
  [4]: MayLoad
  [5]: MayStore
  [6]: HasSideEffects (U)

  [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
   1      1     0.25                        vpsubd        %xmm0, %xmm1, %xmm1
   1      1     0.25                        vpsrad        $31, %xmm1, %xmm1
   1      1     0.25                        vpand %xmm0, %xmm1, %xmm0
   2      1     0.50                  U     retq

After this transformation

  Iterations:        100
  Instructions:      300
  Total Cycles:      203
  Total uOps:        400

  Dispatch Width:    4
  uOps Per Cycle:    1.97
  IPC:               1.48
  Block RThroughput: 1.0

  Instruction Info:
  [1]: #uOps
  [2]: Latency
  [3]: RThroughput
  [4]: MayLoad
  [5]: MayStore
  [6]: HasSideEffects (U)

  [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
   1      1     0.25                        vpcmpgtd      %xmm1, %xmm0, %xmm1
   1      1     0.25                        vpand %xmm0, %xmm1, %xmm0
   2      1     0.50                  U     retq

AArch64 cortex-a57
clang clampNegToZero-vec.ll -O2 -target aarch64 -mcpu=cortex-a57 -S -o - | llvm-mca -mtriple=aarch64 -mcpu=cortex-a57

before

  Iterations:        100
  Instructions:      400
  Total Cycles:      903
  Total uOps:        400

  Dispatch Width:    3
  uOps Per Cycle:    0.44
  IPC:               0.44
  Block RThroughput: 1.5

  Instruction Info:
  [1]: #uOps
  [2]: Latency
  [3]: RThroughput
  [4]: MayLoad
  [5]: MayStore
  [6]: HasSideEffects (U)

  [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
   1      3     0.50                        sub   v1.4s, v1.4s, v0.4s
   1      3     0.50                        sshr  v1.4s, v1.4s, #31
   1      3     0.50                        and   v0.16b, v1.16b, v0.16b
   1      1     1.00                  U     ret

After this transformation

  Iterations:        100
  Instructions:      300
  Total Cycles:      603
  Total uOps:        300

  Dispatch Width:    3
  uOps Per Cycle:    0.50
  IPC:               0.50
  Block RThroughput: 1.0

  Instruction Info:
  [1]: #uOps
  [2]: Latency
  [3]: RThroughput
  [4]: MayLoad
  [5]: MayStore
  [6]: HasSideEffects (U)

  [1]    [2]    [3]    [4]    [5]    [6]    Instructions:
   1      3     0.50                        cmgt  v1.4s, v0.4s, v1.4s
   1      3     0.50                        and   v0.16b, v0.16b, v1.16b
   1      1     1.00                  U     ret

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D67799/new/

https://reviews.llvm.org/D67799