[PATCH] D50222: [CodeGen] [SelectionDAG] More efficient code for X % C == 0 (UREM case)

Thu Aug 30 21:25:34 PDT 2018

hermord marked 10 inline comments as done.
hermord added inline comments.

================
Comment at: lib/CodeGen/SelectionDAG/TargetLowering.cpp:3753-3755
+  // If D0 == 1, we cannot build this fold. Otherwise, D0 > 1, as needed.
+  if (D0.isOneValue())
+    return SDValue();
----------------
lebedev.ri wrote:
> So what we are really checking here, is that `D` is not power of two.
> How about simplifying this into:
> ```
> APInt D = Divisor->getAPIntValue();
> unsigned W = D.getBitWidth();
> 
> // If D is power of two, D0 would be 1, and we cannot build this fold.
> if(D.isPowerOf2())
>   return SDValue();
> 
> // Decompose D into D0 * 2^K
> ...
> ```
> 
> BUT. I do not understand from where does this requirement comes from?
> Can you quote specific part of [[ https://doc.lagout.org/security/Hackers%20Delight.pdf | `10–17 Test for Zero Remainder after Division by a Constant` ]] that warrants this?
The algorithm also fails in general in this case because the inverse of 1 is 1 itself, and we get, for example:

```
Given: N: 1, D: 4, C: 0

Then:
K = 2
D0 = 1
P = 1
Q = (2^32 - 1)/1 = 2^32 - 1

And so: (P >>rot 2) == 2^30 < 2^32 - 1 == Q, so the condition holds,
but N % D != 0 
```
The book mentions this in passing: "This can be simplified a little by observing that because d is odd and, as we are assuming, positive and not equal to 1 [...]" (p. 227).

================
Comment at: test/CodeGen/X86/jump_sign.ll:239-243
+; CHECK-NEXT:    imull $-13107, %eax, %eax
+; CHECK-NEXT:    rorw $1, %ax
+; CHECK-NEXT:    movzwl	%ax, %eax
+; CHECK-NEXT:    cmpl	$13108, %eax
+; CHECK-NEXT:    jae	.LBB12_5
----------------
lebedev.ri wrote:
> What do you mean by breaks? Is this a miscompilation?
Oh my bad, to clarify: I was referring to `test1` in this file and not the one that's changed here. This one is fine, but the code produced for `test1` after this patch would be less optimal than before. The output of `BuildUREMEqFold` just doesn't seem to fall into patterns that we're already optimizing well in this particular case (`N` is extended from `i1`, then `AND`'ed and then passed into an UREM), and we'd probably have to do a new `KnownBits`-based optimization elsewhere to avoid this. Via `InstCombine` this can actually be fixed without close to no extra overhead because we are already computing the necessary bits anyway.

Repository:
  rL LLVM

https://reviews.llvm.org/D50222