[PATCH] D115653: [DAG]Introduce llvm::processShuffleMasks and use it for shuffles in DAG Type Legalizer.

Alexey Bataev via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Tue Apr 5 13:12:49 PDT 2022


ABataev added a comment.

In D115653#3401997 <https://reviews.llvm.org/D115653#3401997>, @RKSimon wrote:

> Thanks @ABataev this looks like its almost there - please can you do a run of the test suite to see if there are any noteworthy changes?

Did some changes in the analysis. Still need to improve the split process and instead bisecting implement immediate splitting to the register sizes. This should fix the regressions.
Some results. Did not gather perf changes, the system is too busy, instead code size changes.

**march=native(Skylake), -O3+LTO**

  Metric: size..text
  
  Program                                                                                                           results  results0 diff
                                                test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test    45610    45866  0.6%
                                                         test-suite :: SingleSource/Benchmarks/SmallPT/smallpt.test     5907     5939  0.5%
                                            test-suite :: SingleSource/UnitTests/Vector/SSE/Vector-sse.stepfft.test     8166     8198  0.4%
                                      test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   201680   202256  0.3%
                                                   test-suite :: MultiSource/Benchmarks/McCat/12-IOtest/iotest.test     9155     9171  0.2%
                                                   test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test   441951   442523  0.1%
                                                     test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test   589569   590305  0.1%
                                                        test-suite :: MultiSource/Applications/ClamAV/clamscan.test   585848   586296  0.1%
                                                        test-suite :: SingleSource/UnitTests/matrix-types-spec.test   220022   220118  0.0%
                                                      test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  1122554  1123018  0.0%
                                        test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test    41667    41683  0.0%
                                               test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test    41674    41690  0.0%
                                                       test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   405280   405360  0.0%
                                                       test-suite :: MultiSource/Applications/JM/lencod/lencod.test   901096   901256  0.0%
                                                          test-suite :: MultiSource/Applications/oggenc/oggenc.test   192635   192667  0.0%
                                             test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2116993  2117265  0.0%
                                                            test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   317150   317182  0.0%
                                                        test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test   170427   170443  0.0%
                                              test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test   236760   236776  0.0%
                                              test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1141289  1141353  0.0%
                                             test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  1165233  1165297  0.0%
                                     test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   301095   301111  0.0%
                                          test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1422763  1422811  0.0%
                                           test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1422763  1422811  0.0%
                                     test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  2129557  2129605  0.0%
                                      test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  2129557  2129605  0.0%
                                           test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12493380 12493572  0.0%

Regressions:

   test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   611440   611408 -0.0%
  test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   611440   611408 -0.0%

Some extra loads of the patterns, I do not expect significant perf regressions here.

                         test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test   100502   100486 -0.0%
                  test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt.test   107636   107604 -0.0%
                  test-suite :: MultiSource/Benchmarks/TSVC/ControlFlow-dbl/ControlFlow-dbl.test   107156   107124 -0.0%
            test-suite :: MultiSource/Benchmarks/TSVC/GlobalDataFlow-dbl/GlobalDataFlow-dbl.test   103066   103034 -0.0%
        test-suite :: MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl.test   102066   102034 -0.0%
                      test-suite :: MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl.test   101516   101484 -0.0%
                test-suite :: MultiSource/Benchmarks/TSVC/ControlLoops-dbl/ControlLoops-dbl.test   100904   100872 -0.0%
              test-suite :: MultiSource/Benchmarks/TSVC/Equivalencing-dbl/Equivalencing-dbl.test   100613   100581 -0.0%
    test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-dbl/CrossingThresholds-dbl.test   100609   100577 -0.0%
      test-suite :: MultiSource/Benchmarks/TSVC/LoopRestructuring-dbl/LoopRestructuring-dbl.test   100480   100448 -0.0%
                      test-suite :: MultiSource/Benchmarks/TSVC/Expansion-flt/Expansion-flt.test   100352   100320 -0.0%
      test-suite :: MultiSource/Benchmarks/TSVC/InductionVariable-dbl/InductionVariable-dbl.test   100344   100312 -0.0%
      test-suite :: MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt.test   100212   100180 -0.0%
              test-suite :: MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl.test    98220    98188 -0.0%
                      test-suite :: MultiSource/Benchmarks/TSVC/Symbolics-dbl/Symbolics-dbl.test    98136    98104 -0.0%
    test-suite :: MultiSource/Benchmarks/TSVC/IndirectAddressing-dbl/IndirectAddressing-dbl.test    98136    98104 -0.0%
                          test-suite :: MultiSource/Benchmarks/TSVC/Packing-dbl/Packing-dbl.test    97784    97752 -0.0%
              test-suite :: MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl.test    97448    97416 -0.0%
        test-suite :: MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt.test    96918    96886 -0.0%
                  test-suite :: MultiSource/Benchmarks/TSVC/Recurrences-dbl/Recurrences-dbl.test    96700    96668 -0.0%
  test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-dbl/StatementReordering-dbl.test    96428    96396 -0.0%
    test-suite :: MultiSource/Benchmarks/TSVC/CrossingThresholds-flt/CrossingThresholds-flt.test    96229    96197 -0.0%
                      test-suite :: MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl.test    95896    95864 -0.0%
      test-suite :: MultiSource/Benchmarks/TSVC/InductionVariable-flt/InductionVariable-flt.test    94844    94812 -0.0%
                test-suite :: MultiSource/Benchmarks/TSVC/ControlLoops-flt/ControlLoops-flt.test    94396    94364 -0.0%
              test-suite :: MultiSource/Benchmarks/TSVC/Equivalencing-flt/Equivalencing-flt.test    94169    94137 -0.0%
              test-suite :: MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt.test    91872    91840 -0.0%
                      test-suite :: MultiSource/Benchmarks/TSVC/Symbolics-flt/Symbolics-flt.test    91692    91660 -0.0%
    test-suite :: MultiSource/Benchmarks/TSVC/IndirectAddressing-flt/IndirectAddressing-flt.test    91564    91532 -0.0%
                          test-suite :: MultiSource/Benchmarks/TSVC/Packing-flt/Packing-flt.test    91324    91292 -0.0%
                  test-suite :: MultiSource/Benchmarks/TSVC/Recurrences-flt/Recurrences-flt.test    90208    90176 -0.0%
  test-suite :: MultiSource/Benchmarks/TSVC/StatementReordering-flt/StatementReordering-flt.test    90048    90016 -0.0%
                      test-suite :: MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt.test    89390    89358 -0.0%
                                         test-suite :: MultiSource/Benchmarks/nbench/nbench.test    75643    75611 -0.0%
                    test-suite :: MultiSource/Benchmarks/TSVC/Reductions-dbl/Reductions-dbl.test    99871    99823 -0.0%
                    test-suite :: MultiSource/Benchmarks/TSVC/Reductions-flt/Reductions-flt.test    93429    93381 -0.1%
              test-suite :: MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt.test    91676    91612 -0.1%

TSVC - inner loop body has less instructions, the preheader is larger because of movement of some patterns loading.
nbench - no significant changes
jpeg/jpeg-6a - no significant changes

**Generic, -O3+LTO**

  Metric: size..text
                                            test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test    88982    89094  0.1%
                                           test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1332828  1333276  0.0%
                                          test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1332828  1333276  0.0%
                                                     test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test   502161   502177  0.0%
                                           test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12015748 12015828  0.0%
                                                         test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test   154226   154226  0.0%

The perf changes are about 1-2% perf improvements (433.milc has ~10% but I would not rely on it).

Regressions:

                        test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   348030   347966 -0.0%
  test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test   102840   102632 -0.2%
        test-suite :: SingleSource/UnitTests/Vector/SSE/Vector-sse.stepfft.test     3361     3345 -0.5%

Bullet - 3 extra instructions in the loop, need to improve final lowering but should not affect performance too much (2 unpkl are replaced by 4 shufps and 1 movq)
consumer-jpeg - no significant changes actually, just changed the order of shuffles after combining extractvector instructions.
Vector-sse.stepfft - 2 unpck replaced by 4 shufps, no perf changes expected

The updated patch will be uploaded later today.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D115653/new/

https://reviews.llvm.org/D115653



More information about the llvm-commits mailing list