[PATCH] D87976: Support the division-by-constant strength reduction for more integer types

Sun Sep 27 15:04:55 PDT 2020

nagisa added a comment.

In D87976#2297076 <https://reviews.llvm.org/D87976#2297076>, @efriedma wrote:

> I'm skeptical this is a good idea when the division is wider than the widest legal mulhi.  You end up generating either a ton of inline code, or a libcall; the result might not be faster than the original divide libcall.

LLVM (with this change) is definitely going to generate a //ton// of inline code. For instance a single function containing `sdiv i1024` generates a whooping 2238 lines of x86_64 assembly. However, it does manage to generate any assembly at all. LLVM could not do it before because there are no libcalls it knows about >128bit. So that's already somewhat of an improvement. This is also not an unique problem. Legalization of other super wide operations, even for things as simple as `add` or even various shifts, expand to large amounts of code too.

You are also right that with this change LLVM may generate multiplication libcalls (such as when dividing i128 integers). Multiplications are comparatively easy to implement efficiently in software, where they become a tree of smaller multiplications, added together. And so, generation of multiplication libcalls is perhaps even desirable, as it reduces the amount of code generated inline somewhat! You can’t do anything alike for division.

Ultimately however, `i64` and `i128` operations are all that matter. So as a quick comparison I could produce in the short amount of time I had for this comment:

|                            | div RThroughput | This expansion RThroughput |
| i686 core2: `i64 / 42`     | 18-37**[^2]**   | 13.8**[^1]**               |
| i686 core2: `i128 / 42`    | libcall (?)     | 61.0**[^1]**               |
| x86_64 znver2: `i128 / 42` | 13-44**[^3]**   | 8.0**[^4]**                |
|

**[^1]**: Calculated by `llvm-mca -mcpu=core2 -mtriple=i686`
**[^2]**: Taken from Agner's instruction tables for “Intel Core 2 (Merom, 65nm)”.
**[^3]**: Taken from Agner's instruction tables for “AMD Zen 2” (used zen instead of skylake, because zen's native instruction throughput is //better//).
**[^4]**: Calculated by `llvm-mca -mcpu=znver2 -mtriple=x86_64`

In either of these two instances the strength-reduced operation has a better throughput than the best case of a //native// (although most likely micro-coded) division instruction. I strongly doubt a software implementation of division could do at all better than such a native instruction. Although I could see scales tipping for targets where there are no native multiplication instructions.

Finally, `isIntDivCheap` exists and should allow targets to prevent this optimisation where it makes sense for them?

---

For these computations I used the following snippet of code (and equivalent with `s/i128/i64/` for i686):

  define dso_local i128 @foo(i128 %x) local_unnamed_addr #0 {
  entry:
    %d = udiv i128 %x, 42
    ret i128 %d
  }

---

In D87976#2297078 <https://reviews.llvm.org/D87976#2297078>, @efriedma wrote:

> Also, I think I remember some discussion that the compiler-rt implementations of division on x86 have performance issues.

We recently got heavily optimised software division implementations in Rust's compiler-builtins. I could compare against those as well, but many of them are very architecture specific, and I don’t have good means for cycle-accurate measurements outside of x86_64.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D87976/new/

https://reviews.llvm.org/D87976