[PATCH] D50310: Improve the legalisation lowering of UMULO

Sun Aug 5 08:00:35 PDT 2018

nagisa created this revision.
nagisa added a reviewer: echristo.
Herald added a reviewer: javed.absar.
Herald added a subscriber: kristof.beyls.

  There is no way in the universe, that doing a full-width division in
  software will be faster than doing overflowing multiplication in
  software in the first place, especially given that this same full-width
  multiplication needs to be done anyway.

  This patch replaces the previous implementation with a direct lowering
  into an overflowing multiplication algorithm based on half-width
  operations.

  Correctness of the algorithm was verified by exhaustively checking the
  output of this algorithm for overflowing multiplication of 16 bit
  integers against an obviously correct widening multiplication. Baring
  any oversights introduced by porting the algorithm to DAG, confidence in
  correctness of this algorithm is extremely high.

  Following table shows the change in both t = runtime and s = space. The
  change is expressed as a multiplier of original, so anything under 1 is
  “better” and anything above 1 is worse.

  +-------+-----------+-----------+-------------+-------------+
  | Arch  | u64*u64 t | u64*u64 s | u128*u128 t | u128*u128 s |
  +-------+-----------+-----------+-------------+-------------+
  |   X64 |     -     |     -     |    ~0.5     |    ~0.64    |
  |  i686 |   ~0.5    |   ~0.6666 |    ~0.05    |    ~0.9     |
  | armv7 |     -     |   ~0.75   |      -      |    ~1.4     |
  +-------+-----------+-----------+-------------+-------------+

  Performance numbers have been collected by running overflowing
  multiplication in a loop under `perf` on two x86_64 (one Intel Haswell,
  other AMD Ryzen) based machines. Size numbers have been collected by
  looking at the size of function containing an overflowing multiply in
  a loop.

  All in all, it can be seen that both performance and size has improved
  except in the case of armv7 where code size has regressed for 128-bit
  multiply. u128*u128 overflowing multiply on 32-bit platforms seem to
  benefit from this change a lot, taking only 5% of the time compared to
  original algorithm to calculate the same thing.

  The final benefit of this change is that LLVM is now capable of lowering
  the overflowing unsigned multiply for integers of any bit-width as long
  as the target is capable of lowering regular multiplication for the same
  bit-width. Previously, 128-bit overflowing multiply was the widest
  possible.

---

Notes:

- This change might have broken some tests I have not caught. I have no idea what tests are present and how to run them all, so I’ll leave it up to CI to build and run the tests.
  - ninja check-all seems to pass locally, but 1) I haven’t all targets enabled; and 2) Some of my previous revisions failed tests at CI even if I had all targets enabled…
- I have no idea how style in LLVM is enforced so I tried my best to match style with the surrounding code by hand;
- I have no idea who the reviewers should be so I just picked Eric who seems to have introduced this code in the first place.

Repository:
  rL LLVM

https://reviews.llvm.org/D50310

Files:
  lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
  test/CodeGen/X86/muloti.ll
  test/CodeGen/X86/select.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D50310.159212.patch
Type: text/x-patch
Size: 12871 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20180805/3979169c/attachment.bin>