[PATCH] D91053: [PowerPC] Lump the constants to save one addis for each constant access

Mon Nov 9 01:00:26 PST 2020

steven.zhang created this revision.
steven.zhang added reviewers: nemanjai, MaskRay, stefanp, jsji, masoud.ataei, PowerPC.
Herald added subscribers: shchenz, kbarton, hiraditya, mgorny.
Herald added a project: LLVM.
steven.zhang requested review of this revision.

For now, we are placing the constant into TOC and whenever it is accessed, we need addis/addi + load. See:

  double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }

And this is what we have now:

          addis 2, 12, .TOC.-.Lfunc_gep0 at ha
          addi 2, 2, .TOC.-.Lfunc_gep0 at l
  .Lfunc_lep0:
          .localentry     X, .Lfunc_lep0-.Lfunc_gep0
  # %bb.0:                                # %entry
          addis 3, 2, .LCPI0_0 at toc@ha
          lfd 0, .LCPI0_0 at toc@l(3)             #<-- addi is folding into lfd
          addis 3, 2, .LCPI0_1 at toc@ha
          xsmuldp 0, 1, 0
          lfd 1, .LCPI0_1 at toc@l(3)
          addis 3, 2, .LCPI0_2 at toc@ha
          xsadddp 0, 0, 1
          lfd 1, .LCPI0_2 at toc@l(3)
          addis 3, 2, .LCPI0_3 at toc@ha
          xsmuldp 0, 0, 1
          lfd 1, .LCPI0_3 at toc@l(3)
          xsadddp 1, 0, 1
          blr

It can be optimized as grouping all the constants together into RO data section, so that their relative positions are fixed. Then, create a symbol in TOC which point to that data   section.  The benefit for this optimization is to reduce the GOT size and improve the performance as the addis is saved. It works like this:

          .section        .data.rel.ro,"aw", at progbits
          .p2align        3                               # -- Begin function X
  .LCPI0_0:
          .quad   0x402cc28f5c28f5c3              # double 14.380000000000001
          .quad   0x4002b851eb851eb8              # double 2.3399999999999999
          .quad   0x40120c49ba5e353f              # double 4.5119999999999996
          .quad   0x3ff3ae147ae147ae              # double 1.23
  .Lfunc_gep0:
          addis 2, 12, .TOC.-.Lfunc_gep0 at ha
          addi 2, 2, .TOC.-.Lfunc_gep0 at l
  .Lfunc_lep0:
          .localentry     X, .Lfunc_lep0-.Lfunc_gep0
  # %bb.0:                                # %entry
          addis 3, 2, .LC0 at toc@ha
          ld 3, .LC0 at toc@l(3)
          lfd 0, 24(3)
          xsmuldp 0, 1, 0
          lfd 1, 16(3)
          xsadddp 0, 0, 1
          lfd 1, 8(3)
          xsmuldp 0, 0, 1
          lfdx 1, 0, 3
          xsadddp 1, 0, 1
          blr

  .LC0:
          .tc .LCPI0_0[TC],.LCPI0_0

This optimization has been discussed before. See PowerPC/README.txt for more information.

  Lump the constant pool for each function into ONE pic object, and reference
  pieces of it as offsets from the start.  For functions like this (contrived
  to have lots of constants obviously):

  double X(double Y) { return (Y*1.23 + 4.512)*2.34 + 14.38; }

  We generate:

  _X:
          lis r2, ha16(.CPI_X_0)
          lfd f0, lo16(.CPI_X_0)(r2)
          lis r2, ha16(.CPI_X_1)
          lfd f2, lo16(.CPI_X_1)(r2)
          fmadd f0, f1, f0, f2
          lis r2, ha16(.CPI_X_2)
          lfd f1, lo16(.CPI_X_2)(r2)
          lis r2, ha16(.CPI_X_3)
          lfd f2, lo16(.CPI_X_3)(r2)
          fmadd f1, f0, f1, f2
          blr

  It would be better to materialize .CPI_X into a register, then use immediates
  off of the register to avoid the lis's.  This is even more important in PIC
  mode.

  Note that this (and the static variable version) is discussed here for GCC:
  http://gcc.gnu.org/ml/gcc-patches/2006-02/msg00133.html

  Here's another example (the sgn function):
  double testf(double a) {
         return a == 0.0 ? 0.0 : (a > 0.0 ? 1.0 : -1.0);
  }

  it produces a BB like this:
  LBB1_1: ; cond_true
          lis r2, ha16(LCPI1_0)
          lfs f0, lo16(LCPI1_0)(r2)
          lis r2, ha16(LCPI1_1)
          lis r3, ha16(LCPI1_2)
          lfs f2, lo16(LCPI1_2)(r3)
          lfs f3, lo16(LCPI1_1)(r2)
          fsub f0, f0, f1
          fsel f1, f0, f2, f3
          blr

Some limitation:

- If there is only one constant, we will have one extra load with this patch. But the load could be optimized by linker if it merges the TOC.  It is not easy insider compiler to handle it as ISEL is done basing on perf BB, and we don't know if there is other constants until other BB are selected. Any thoughts ?
- Lump the constant with the same type. Technical speaking, all the constant could be lumped together as far as the alignment is handle carefully.

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D91053

Files:
  llvm/lib/Target/PowerPC/CMakeLists.txt
  llvm/lib/Target/PowerPC/PPCAsmPrinter.cpp
  llvm/lib/Target/PowerPC/PPCConstantPoolValue.cpp
  llvm/lib/Target/PowerPC/PPCConstantPoolValue.h
  llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp
  llvm/lib/Target/PowerPC/PPCISelLowering.cpp
  llvm/lib/Target/PowerPC/PPCISelLowering.h
  llvm/test/CodeGen/PowerPC/2012-09-16-TOC-entry-check.ll
  llvm/test/CodeGen/PowerPC/branch_coalesce.ll
  llvm/test/CodeGen/PowerPC/build-vector-allones.ll
  llvm/test/CodeGen/PowerPC/build-vector-tests.ll
  llvm/test/CodeGen/PowerPC/canonical-merge-shuffles.ll
  llvm/test/CodeGen/PowerPC/combine-fneg.ll
  llvm/test/CodeGen/PowerPC/constant-pool.ll
  llvm/test/CodeGen/PowerPC/extract-and-store.ll
  llvm/test/CodeGen/PowerPC/f128-aggregates.ll
  llvm/test/CodeGen/PowerPC/f128-passByValue.ll
  llvm/test/CodeGen/PowerPC/float-logic-ops.ll
  llvm/test/CodeGen/PowerPC/fma-combine.ll
  llvm/test/CodeGen/PowerPC/fma-mutate.ll
  llvm/test/CodeGen/PowerPC/fmf-propagation.ll
  llvm/test/CodeGen/PowerPC/fp-strict-conv-f128.ll
  llvm/test/CodeGen/PowerPC/handle-f16-storage-type.ll
  llvm/test/CodeGen/PowerPC/load-shuffle-and-shuffle-store.ll
  llvm/test/CodeGen/PowerPC/mcm-12.ll
  llvm/test/CodeGen/PowerPC/mcm-4.ll
  llvm/test/CodeGen/PowerPC/mcm-obj-2.ll
  llvm/test/CodeGen/PowerPC/mcm-obj.ll
  llvm/test/CodeGen/PowerPC/nofpexcept.ll
  llvm/test/CodeGen/PowerPC/p10-splatImm-CPload-pcrel.ll
  llvm/test/CodeGen/PowerPC/p9-vinsert-vextract.ll
  llvm/test/CodeGen/PowerPC/ppcf128-constrained-fp-intrinsics.ll
  llvm/test/CodeGen/PowerPC/ppcf128-endian.ll
  llvm/test/CodeGen/PowerPC/pr25080.ll
  llvm/test/CodeGen/PowerPC/pr43976.ll
  llvm/test/CodeGen/PowerPC/pr45628.ll
  llvm/test/CodeGen/PowerPC/pr45709.ll
  llvm/test/CodeGen/PowerPC/pr47660.ll
  llvm/test/CodeGen/PowerPC/pr47891.ll
  llvm/test/CodeGen/PowerPC/pre-inc-disable.ll
  llvm/test/CodeGen/PowerPC/recipest.ll
  llvm/test/CodeGen/PowerPC/repeated-fp-divisors.ll
  llvm/test/CodeGen/PowerPC/sat-add.ll
  llvm/test/CodeGen/PowerPC/scalar_cmp.ll
  llvm/test/CodeGen/PowerPC/scalar_vector_test_4.ll
  llvm/test/CodeGen/PowerPC/select_const.ll
  llvm/test/CodeGen/PowerPC/signbit-shift.ll
  llvm/test/CodeGen/PowerPC/toc-float.ll
  llvm/test/CodeGen/PowerPC/vavg.ll
  llvm/test/CodeGen/PowerPC/vec-itofp.ll
  llvm/test/CodeGen/PowerPC/vec-trunc.ll
  llvm/test/CodeGen/PowerPC/vec-trunc2.ll
  llvm/test/CodeGen/PowerPC/vec_add_sub_doubleword.ll
  llvm/test/CodeGen/PowerPC/vec_add_sub_quadword.ll
  llvm/test/CodeGen/PowerPC/vec_conv_i16_to_fp32_elts.ll
  llvm/test/CodeGen/PowerPC/vec_conv_i16_to_fp64_elts.ll
  llvm/test/CodeGen/PowerPC/vec_conv_i8_to_fp32_elts.ll
  llvm/test/CodeGen/PowerPC/vec_conv_i8_to_fp64_elts.ll
  llvm/test/CodeGen/PowerPC/vector-constrained-fp-intrinsics.ll
  llvm/test/CodeGen/PowerPC/vector-extend-sign.ll
  llvm/test/CodeGen/PowerPC/vector-rotates.ll
  llvm/test/CodeGen/PowerPC/vperm-lowering.ll
  llvm/test/CodeGen/PowerPC/vselect-constants.ll
  llvm/test/CodeGen/PowerPC/vsx.ll