[PATCH] D88785: Support {S,U}REMEqFold before legalization

Fri Mar 12 08:40:59 PST 2021

nagisa added a comment.

I went through the list of what looked like regressions and attempted to root-cause and validate them. So far there seem to be a few major differences between old/new:

1. In some cases the old code would lower to a shorter sequence of instructions via `BuildUDIV` and `BuildSDIV` strength reduction lowering. While the code is shorter in these cases, `llvm-mca` seems to suggest that the throughput of the `rem-seteq` lowering is better despite the fact that `rem-seteq` strength reduction produces more instructions overall;
2. For the newly added `illegal-types` tests, a number of regressions are occurring due to `and` masking necessary to clear out the upper bits in registers between operations that `rem-seteq` lowering produces. This is a genuine regression and generally results in us producing 1 or 2 extra `and` masking operations;
3. The new lowering needs more constants to operate, which on some backends requires more instructions to prepare the registers holding said constants (especially on targets such as MIPS which needs multiple instructions to ready one such register).

See the inline comments to see what the specific changes look like. Hopefully these address your concern about codegen regressions in AArch64, @lebedev.ri.

================
Comment at: llvm/test/CodeGen/AArch64/srem-seteq-vec-nonsplat.ll:235
+; CHECK-NEXT:    orr v0.16b, v2.16b, v0.16b
+; CHECK-NEXT:    cmhs v0.4s, v1.4s, v0.4s
 ; CHECK-NEXT:    movi v1.4s, #1
----------------
This (and the one above) instruction count regression looks like a combination of the "more constants" and "previously used lowering through `BuildSDIV`".  In particular we have here what seems like 6 additional instructions for moving constants into registers.

llvm-mca claims a better throughput (lower cycle count) for the new version on all CPUs I tried: apple-a7, cortex-a53, exynos-m3, cortex-x1.

================
Comment at: llvm/test/CodeGen/AArch64/srem-seteq-vec-splat.ll:45
+; CHECK-NEXT:    orr v0.16b, v1.16b, v0.16b
+; CHECK-NEXT:    cmhs v0.4s, v3.4s, v0.4s
 ; CHECK-NEXT:    movi v1.4s, #1
----------------
Looks like a combination of "needs more constants" and "previously used lowering through BuildSDIV".

================
Comment at: llvm/test/CodeGen/AArch64/urem-seteq-vec-splat.ll:39
+; CHECK-NEXT:    orr v0.16b, v0.16b, v1.16b
+; CHECK-NEXT:    cmhs v0.4s, v2.4s, v0.4s
 ; CHECK-NEXT:    movi v1.4s, #1
----------------
This instruction count regression appears to be occurring because codegen used to fall-back to the strength reduction implemented by `BuildUDIV`, and then compare to 0. The `rem-seteq` lowering needs additional constants.

It is however, an improvement in cycle count as reported by llvm-mca (on apple-a10), primarily due to increased IPC:

```
CMD: llvm-mca -mtriple=aarch64 -mcpu=apple-a10

OLD:
Iterations:        100
Instructions:      1300
Total Cycles:      1712
Total uOps:        1400
Dispatch Width:    6
uOps Per Cycle:    0.82
IPC:               0.76
Block RThroughput: 3.3

NEW:
Iterations:        100
Instructions:      1400
Total Cycles:      1312
Total uOps:        1600
Dispatch Width:    6
uOps Per Cycle:    1.22
IPC:               1.07
Block RThroughput: 3.0
```

================
Comment at: llvm/test/CodeGen/ARM/urem-seteq-illegal-types.ll:388
+; ARM6-NEXT:    orr r4, r4, #1024
+; ARM6-NEXT:    mla r1, r1, r3, r12
 ; ARM6-NEXT:    mov r12, #255
----------------
I have no idea why the backend decided to construct the constants through code here, rather than loading like for the other targets.

This and similar changes below also make it quite difficult to compare which of the old or new code is better.

================
Comment at: llvm/test/CodeGen/ARM/urem-seteq-illegal-types.ll:770
+; NEON8-NEXT:    sbcs r1, r2, #1
+; NEON8-NEXT:    movwlo r0, #1
 ; NEON8-NEXT:    pop {r11, pc}
----------------
Was calling `__umoddi3` before, multiplying inline now.

================
Comment at: llvm/test/CodeGen/Mips/urem-seteq-illegal-types.ll:135
 ; MIPS64EL-NEXT:    jr $ra
-; MIPS64EL-NEXT:    sltu $2, $2, $1
+; MIPS64EL-NEXT:    sltu $2, $3, $1
   %urem = urem i9 %X, -5
----------------
As far as I can tell this is instruction count regression is coming from a combination of the new lowering needing more constants (which are put into registers) as well as the fact that bits need to be masked out in between the intermediate operations that we lowered down to.

================
Comment at: llvm/test/CodeGen/Mips/urem-seteq-illegal-types.ll:217
+; MIPS64EL-NEXT:    or $1, $3, $1
+; MIPS64EL-NEXT:    andi $1, $1, 3
 ; MIPS64EL-NEXT:    jr $ra
----------------
This would just call i128 `__umodti3` intrinsic before, and now is unrolling the multiplication over 66-bit integers inline.

================
Comment at: llvm/test/CodeGen/PowerPC/urem-seteq-illegal-types.ll:273
+; PPC64LE-NEXT:    crnand 20, 2, 4
+; PPC64LE-NEXT:    isel 3, 0, 3, 20
 ; PPC64LE-NEXT:    blr
----------------
Used to call `__umodti3`, multiplying inline now.

================
Comment at: llvm/test/CodeGen/Thumb2/urem-seteq-illegal-types.ll:11
 ; CHECK-NEXT:    movs r0, #0
-; CHECK-NEXT:    cmn.w r1, #-858993460
+; CHECK-NEXT:    bfc r1, #13, #19
+; CHECK-NEXT:    cmp r1, r2
----------------
This and next regression due to bit masking between intermediate operations.

================
Comment at: llvm/test/CodeGen/X86/urem-seteq-illegal-types.ll:54
+; X64-NEXT:    andl $134217727, %eax # imm = 0x7FFFFFF
+; X64-NEXT:    cmpl $9586981, %eax # imm = 0x924925
 ; X64-NEXT:    setb %al
----------------
A regression that hasn't been present before rem-seteq fold allowed illegal types through, caused mostly by the backend being forced to legalize the `i27` at each intermediate step.

I believe this regression we can live with, given its nature and how unlikely it is to occur in practice.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D88785/new/

https://reviews.llvm.org/D88785