[PATCH] D132828: Add new optimization pass of Tree Height Reduction

Mon Aug 29 00:44:20 PDT 2022

kawashima-fj created this revision.
kawashima-fj added reviewers: hfinkel, xbolva00, fhahn.
kawashima-fj added projects: LLVM, LoopOptWG.
Herald added subscribers: ormris, wenlei, steven_wu, hiraditya, mgorny.
Herald added a project: All.
kawashima-fj requested review of this revision.
Herald added a reviewer: jdoerfert.
Herald added subscribers: llvm-commits, sstefan1.

The tree height reduction optimization increases the instruction-level parallelism by changing the order of operations in a loop to keep the operation tree as low as possible.

For example, see the the following code.

  for (int i = 0; i < N; ++i)
    x[i] = a0[i] + a1[i] + a2[i] + a3[i] + a4[i] + a5[i] + a6[i] + a7[i];

This is equivalent to the following code. Each addition depends on the result of the preceding addition.

  for (int i = 0; i < N; ++i)
    x[i] = ((((((a0[i] + a1[i]) + a2[i]) + a3[i]) + a4[i]) + a5[i]) + a6[i]) + a7[i];

Tree height reduction transforms it to the following code. Additions in innermost parentheses can be executed in parallel.

  for (int i = 0; i < N; ++i)
    x[i] = ((a0[i] + a1[i]) + (a2[i] + a3[i])) + ((a4[i] + a5[i]) + (a6[i] + a7[i]));

The implemented algorithm is based on the following paper:

- Katherine Coons, Warren Hunt, Bertrand A. Maher, Doug Burger, Kathryn S. McKinley. Optimal Huffman Tree-Height Reduction for Instruction-Level Parallelism.

Applicable conditions
---------------------

This patch incorporates tree height reduction pass into the default optimization pipeline but it is disabled by default. You need the `-enable-int-thr` option for integer operations (`add`, `mul`, `and`, `or`, and `xor`) and the `-enable-fp-thr` option for floating-point operations (`fadd` and `fmul`) to enable it. Furthermore, for floating-point operations, you also need `reassoc` and `nsz` flags to `fadd`/`fmul`.

You can use it via Clang by:

  clang -O1 -fassociative-math -fno-signed-zeros -mllvm -enable-int-thr -mllvm -enable-fp-thr

Or, simply:

  clang -Ofast -mllvm -enable-int-thr -mllvm -enable-fp-thr

It is only applied to operations in innermost loops.

Performance
-----------

I run C/C++ benchmarks in SPECspeed 2017 on Fujitsu A64FX processor, which has two pipelines for integer operations and SIMD/FP operations each <https://github.com/fujitsu/A64FX/>. `600.perlbench_s` and `619.lbm_s` had 3% improvement. Other benchmarks (602, 605, 620, 623, 625, 631, 641, 644, 657) were within 1% up/down. In these runs, to emphasize the performance improvement, the number of OpenMP threads is limited to one.

Relation to D67383
------------------

This patch is an updated version of D67383 <https://reviews.llvm.org/D67383>. The author @masakazu.ueno was my colleague. I took over his patch.

The following comments in D67383 <https://reviews.llvm.org/D67383> are addressed in this patch.

- Support bitwise instructions <https://reviews.llvm.org/D67383#1799213>
- Use llvm::stable_sort <https://reviews.llvm.org/D67383#1665319>
- Correct tests <https://reviews.llvm.org/D67383#1665887>
- Add TTI comment <https://reviews.llvm.org/D67383#1665580>
- Correct misleading comment <https://reviews.llvm.org/D67383#1812509>

Also, this patch has following updates.

- Adapt to the latest `main` branch (new pass manager, opaque pointer, Apache license, ...)
- Remove the requirement of full fast-math flags
- Fix bugs
- Simplify the code
- etc.

Future work
-----------

Currently the cost estimation is not implemented. Of course, as @hfinkel said <https://reviews.llvm.org/D67383#1665580>, this optimization has no positive effect if the target processor cannot utilize ILP. I want to add some cost estimation or something and enable this optimization automatically when profitable.

And, I want to add a Clang option to enable/disable it.

These will be addressed in another patch.

Discussion
----------

I want comments about the following points.

1. This optimization is only applied to instructions in innermost loops because D67383 <https://reviews.llvm.org/D67383> was implemented so. Should we expand it to instructions in all basic blocks?
2. Where is the best position of this pass in the default optimization pipeline? I put it after the loop unrolling pass. Because, if a loop has a reduction, the reduction can also be optimized by this pass.
3. As explained in the future work above, I want to add decision to apply the optimization or not. What information this decision should be based on? Issue width in a machine model?

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D132828

Files:
  llvm/include/llvm/Transforms/Scalar/TreeHeightReduction.h
  llvm/lib/Passes/PassBuilder.cpp
  llvm/lib/Passes/PassBuilderPipelines.cpp
  llvm/lib/Passes/PassRegistry.def
  llvm/lib/Transforms/Scalar/CMakeLists.txt
  llvm/lib/Transforms/Scalar/TreeHeightReduction.cpp
  llvm/test/Other/new-pm-defaults.ll
  llvm/test/Other/new-pm-thinlto-defaults.ll
  llvm/test/Other/new-pm-thinlto-postlink-pgo-defaults.ll
  llvm/test/Other/new-pm-thinlto-postlink-samplepgo-defaults.ll
  llvm/test/Transforms/TreeHeightReduction/floating-point-add-only.ll
  llvm/test/Transforms/TreeHeightReduction/floating-point-add-with-constant.ll
  llvm/test/Transforms/TreeHeightReduction/floating-point-mult-only.ll
  llvm/test/Transforms/TreeHeightReduction/floating-point-sub-only.ll
  llvm/test/Transforms/TreeHeightReduction/fp16-add-with-constant.ll
  llvm/test/Transforms/TreeHeightReduction/fp16-add.ll
  llvm/test/Transforms/TreeHeightReduction/fp16-mult.ll
  llvm/test/Transforms/TreeHeightReduction/fp16-sub.ll
  llvm/test/Transforms/TreeHeightReduction/integer-add-only.ll
  llvm/test/Transforms/TreeHeightReduction/integer-add-with-constant.ll
  llvm/test/Transforms/TreeHeightReduction/integer-mult-only.ll
  llvm/test/Transforms/TreeHeightReduction/integer-sub-only.ll
  llvm/test/Transforms/TreeHeightReduction/leaf-num-check.ll
  llvm/test/Transforms/TreeHeightReduction/long-double-add-with-constant.ll
  llvm/test/Transforms/TreeHeightReduction/long-double-add.ll
  llvm/test/Transforms/TreeHeightReduction/long-double-mult.ll
  llvm/test/Transforms/TreeHeightReduction/long-double-sub.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D132828.456269.patch
Type: text/x-patch
Size: 124301 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20220829/3e728403/attachment.bin>