[llvm-dev] [RFC] Switching to MemorySSA-backed Dead Store Elimination (aka cross-bb DSE)

Tue Aug 18 06:35:56 PDT 2020

Hi,

Over the past six months, a MemorySSA-backed DSE implementation has been added to LLVM and it now covers almost all cases the existing DSE implementation does, plus adding a major new capability: eliminating stores across basic blocks. Thanks everyone involved with reviews, testing & patches!

I think now would be a good time to start working towards switching to use MemorySSA-backed DSE as default.

TLDR:
MemorySSA DSE can substantially increase the number of stores eliminated in many cases (e.g. across MultiSource/SPEC2000/SPEC2006 with -O3 -flto, the number of eliminated stores increases roughly by 45%, with some benchmarks seeing much larger improvements). But the improvements come at additional compile-time cost (CTmark geomean -O3 +0.86%, ReleaseThinLTO +1.59%), due to a substantially increased search space. I’d like to propose to switch to MemorySSA-backed DSE, as in my opinion the benefits in terms of store reductions outweigh the compile-time impact).

Details:

First, lets take a look at the benefits:

1. More dead stores eliminated.

The current DSE implementation is mostly limited to eliminating stores in a single BB. MemorySSA-backed DSE can eliminate stores across different basic blocks. This results in large improvements in some benchmarks. Below some numbers for MultiSource/SPEC2000/SPEC2006 on X86 with -O3 -flto:

                                                                                               Legacy    MSSA   
Number of dead stores removed across all benchmarks:       17919     26120        ~  +45.76%
Total number of stores remaining after DSE:                       1424860    1409300    ~  -1.1%

Some notable changes (excluding benchmarks with small absolute values of NumFastStores) below. Note that roughly half of the benchmarks have binary changes.

Same hash: 114 (filtered out)
Remaining: 123
Metric: dse.NumFastStores

Program                                                               legacy     mssa.           diff
 test-suite...-typeset/consumer-typeset.test          186.00     1815.00      875.8%
 test-suite...lications/sqlite3/sqlite3.test                   29.00      167.00      475.9%
 test-suite...T2006/445.gobmk/445.gobmk.test       19.00        88.00      363.2%
 test-suite.../Applications/SPASS/SPASS.test         49.00      155.00      216.3%
 test-suite...lications/ClamAV/clamscan.test            72.00      227.00      215.3%
 test-suite.../Benchmarks/nbench/nbench.test        30.00        92.00      206.7%
 test-suite...T2000/256.bzip2/256.bzip2.test           31.00        85.00      174.2%
 test-suite...chmarks/MallocBench/gs/gs.test          57.00      133.00      133.3%
 test-suite...000/255.vortex/255.vortex.test             95.00      200.00      110.5%
 test-suite...CFP2006/433.milc/433.milc.test           64.00      134.00      109.4%
 test-suite...ications/JM/ldecod/ldecod.test             258.00     514.00       99.2%
 test-suite.../Trimaran/enc-pc1/enc-pc1.test             87.00     168.00       93.1%
 test-suite...ce/Benchmarks/PAQ8p/paq8p.test       35.00        66.00       88.6%
 test-suite...INT2000/164.gzip/164.gzip.test            23.00        43.00       87.0%
 test-suite...CFP2000/188.ammp/188.ammp.test    85.00      157.00       84.7%
 test-suite...pplications/oggenc/oggenc.test             63.00      110.00       74.6%
 test-suite...T2006/456.hmmer/456.hmmer.test       24.00        41.00       70.8%
 test-suite.../Benchmarks/Bullet/bullet.test             799.00    1351.00       69.1%
 test-suite.../CINT2000/176.gcc/176.gcc.test         412.00      668.00       62.1%
 test-suite...nsumer-lame/consumer-lame.test       111.00      175.00       57.7%
 test-suite...marks/7zip/7zip-benchmark.test        1069.00    1683.00       57.4%
 test-suite...006/447.dealII/447.dealII.test            1715.00     2689.00       56.8%

There are a few existing missed optimization issues MemorySSA-backed DSE addresses, e.g.
https://bugs.llvm.org/show_bug.cgi?id=46847
https://bugs.llvm.org/show_bug.cgi?id=40527

And some that should be relatively straight-forward to address with the new implementation:
https://bugs.llvm.org/show_bug.cgi?id=16520
https://bugs.llvm.org/show_bug.cgi?id=10292

2. Major new users for MemorySSA means more testing.

MemorySSA-backed DSE makes extensive use of MemorySSA which increases the test-coverage MemorySSA gets. It already surfaced one issue, that has since been fixed. More users also likely means more eyes on tuning MemorySSA.

3. Eliminating some long-standing MemDepAnalysis related issue.

MemoryDependenceAnalysis used by the current DSE implementation makes heavy use of caching and is relatively fragile with respect to cache invalidation. I think there are at least a few long-standing known issues in that area, e.g. https://bugs.llvm.org/show_bug.cgi?id=28014

4. Laying ground-work for further optimizations, like cross-bb memcpy opts	

The current memcpy optimizers has the same BB-local limitation as the existing DSE. MemorySSA-backed DSE should add most helpers required to port MemCpyOpt to work across basic blocks. Ideally  we would do both optimizations as part of the same memorySSA walk, saving some compile-time.

Now, lets take a look at the costs:

1. Increase in compile-time

The current DSE implementation relies heavily on caching of info about memory dependences, relying on the BB-only nature to keep compile-time in check. The cross-bb nature of the MemorySSA-backed DSE implementation drastically increase the search space, thus leading to an increase in compile-time. Here are some numbers for CTMark. Note that the numbers are from a branch with additional compile-time tuning.

Geomean changes (see link for full details)

                                      Executed Instrs.   Size (text)
O3                                         +0.86%         -0.29%
ReleaseThinLTO                   +1.59%         -0.38%
ReleaseLTO-g:                      +1.18%         -0.35%
ReleaseThinLTO (link-only)  +1.68%          -0.38%
ReleaseLTO-g (link only):     +1.36%          -0.35%

http://llvm-compile-time-tracker.com/compare.php?from=4cc20aa74339812c3e4daa4138ed40cb98521bfc&to=504153dc6efd29ad37ffb6ffb2e80d4101b56e5b&stat=instructions

Note that both the current and MemorySSA-backed DSE implementations rely on various thresholds to avoid excessive compile-times. There are a few knobs to turn to further reduce compile-time of the MemorySSA-backed DSE implementation, at the cost of missing to eliminate some stores.The current settings are chose so the compile-time difference is limited without limiting optimizations too much.

Another thing to note is that the current DSE implementation has seen compile-time tuning over quite a few years now, while the amount of tuning done for the MemorySSA-backed one is relatively small, so I am optimistic that we will see improvements in compile-time in the future.

I took a look at some of the worst regressions ( tramp3d-v4, 7zip) and MemorySSA-backed DSE takes roughly 3x more time than the current DSE, taking DSE from the upper half of passes ordered by time to the top 5, still noticeably quicker than heavy hitters like GVN, ISEL and roughly the same as the most expensive instcombine run.

There are also cases where MemorySSA-backed DSE greatly out-performs legacy DSE: https://bugs.llvm.org/show_bug.cgi?id=37588

Given the above, I think there are a few options

a) Switch to MemorySSA-backed DSE in all configurations
b) Switch to memorySSA-backed DSE, but use more aggressive limits. For some configurations (e.g. limit optimizations for -O2/-Os)
c) Switch to MemorySSA-backed DSE in some configurations (e.g. -O3/-Oz only)
d) Do not switch.

I’d like to propose opting for a), because I think the additional optimizations warrant the increase in compile-time and I am optimistic we see further tuning once it gets used heavily and that it will enable additional optimizations.

I think b) would also be an improvement over the current status, but we’d have to balance to limits so we still get substantially enough improvements over the current DSE.

I don’t think c) is great, as we would have to maintain 2 versions of DSE, with both of them seeing half the testing.

d) is also not great, as we are currently are failing to optimize a substantial number of dead stores and are missing future optimizations made possible by the implementation.

If there are no objections to switching, I’ll work on getting the remaining compile-time related patches in ASAP and flip the switch. That should give us enough time to address issues in time for the next release and fine tune compile-time.

I’m looking forward to your thoughts!

Cheers,
Florian