[llvm] [SROA] Use SmallPtrSet for PromotableAllocas (PR #105809)

Fri Aug 23 02:56:56 PDT 2024

https://github.com/b-chmiel created https://github.com/llvm/llvm-project/pull/105809

When compiling large SystemVerilog designs transpiled by https://github.com/verilator/verilator, `clang` compilation hangs on during SROA phase.

The [PromotableAllocas](https://github.com/llvm/llvm-project/blob/57dc09341e5eef758b1abce78822c51069157869/llvm/lib/Transforms/Scalar/SROA.cpp#L201) field is represented as a `std::vector`. In our case this number is close to 500 000 000 which makes [random search-delete](https://github.com/llvm/llvm-project/blob/57dc09341e5eef758b1abce78822c51069157869/llvm/lib/Transforms/Scalar/SROA.cpp#L5615) on `std::vector` inefficient.

Assuming that PromotableAllocas contains only unique raw pointers to allocas, SmallPtrSet may be used for storing them.

Note: I'm creating `std::vector` from SmallPtrSet in `SROA::promoteAllocas` to match signature of `PromoteMem2Reg`. However, the `PromoteMem2Reg` constructor makes yet another copy of this structure (using its begin/end iterators). Do You think there is a better way to optimize this? For example using `std::move` and adjusting `PromoteMem2Reg` interface?

@chandlerc 

## Benchmarks

### Internal benchmark

Base version timed out after 9 hours, improved version finished in 37 minutes.

### Minimal example benchmark

This test mimics our internal benchmark; creates a lot of allocas considered in SROA.
In 10 test runs, the improved version was 6% faster than the base one.


gen.cpp - generates a test file
```cpp
#include <fstream>
int main() {
  constexpr int fields = 100000;
  std::ofstream of{"out.cpp"};

  of << "#include <random>\n";

  of << " struct VlWide final {\n";
  of << "\tstd::uint32_t m_storage[5];\n";
  of << "};\n";

  of << "int main() {\n";
  of << "\tunsigned int rnd = rand();\n";
  for (auto i = 0; i < fields; ++i)
    of << "\tVlWide tmp_" << i << "{rnd};\n";
  of << "\treturn 0;\n";
  of << "}\n";
  return 0;
}
```

Makefile - compiles both generate script and generated file
```
.PHONY = clean

out.o: out.cpp
    $(CXX) -c -O1 -emit-llvm -mllvm -stats -o $@ $<

out.cpp: gen.o
    ./gen.o

gen.o: gen.cpp
    $(CXX) -O3 -o $@ $<

clean:
    - rm *.o out.cpp
```

Run with `make`.


### llvm-test-suite

CTMark `compile_time` results for base (commit https://github.com/llvm/llvm-project/commit/b05c55472bf7cadcd0e4cb1a669b3474695b0524) and improved `clang` versions of 100 runs:

```
Program                                       compile_time                                                                                                                                                                                                                            
                                               base         improved diff                                                                                                                                                                
tramp3d-v4/tramp3d-v4                           6.13         6.28    2.4%                                                                                                                                                               
sqlite3/sqlite3                                 1.21         1.24    2.1%                                                                                                                                                               
lencod/lencod                                   4.43         4.45    0.5%                                                                                                                                                               
SPASS/SPASS                                     5.84         5.87    0.4%                                                                                                                                                               
Bullet/bullet                                  27.64        27.58   -0.2%                                                                                                                                                               
consumer-typeset/consumer-typeset               4.37         4.34   -0.8%                                                                                                                                                               
7zip/7zip-benchmark                            71.99        71.43   -0.8%                                                                                                                                                               
ClamAV/clamscan                                 5.42         5.38   -0.8%                                                                                                                                                               
kimwitu++/kc                                   11.64        11.52   -1.0%                                                                                                                                                               
mafft/pairlocalalign                            2.31         2.27   -1.6%                                                                                                                                                                                          
Geomean difference                        0.0%                                                                                                                                                                     
compile_time                                                                                                                                                                                                                      
l/r           base   improved       diff                                                                                                                                                                                                
count  10.000000    10.000000  10.000000                                                                                                                                                                                                
mean   14.099270    14.034550  0.000112                                                                                                                                                                                                 
std    21.708482    21.533859  0.013314                                                                                                                                                                                                 
min    1.211500     1.236500  -0.016101                                                                                                                                                                                                 
25%    4.384250     4.364425  -0.007946                                                                                                                                                                                                 
50%    5.632550     5.621750  -0.004951                                                                                                                                                                                                 
75%    10.262475    10.211450  0.004309                                                                                                                                                                                                 
max    71.992200    71.425300  0.024213
```

>From 83e73d70c16839869936c50842c8aeac680ec836 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bart=C5=82omiej=20Chmiel?= <bchmiel at antmicro.com>
Date: Thu, 22 Aug 2024 14:56:21 +0200
Subject: [PATCH] [SROA] Use SmallPtrSet for PromotableAllocas
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Optimize SROA pass for large number of allocas by
speeding-up PromotableAllocas erase operation. The optimization
involves using SmallPtrSet which proves to be efficient since
PromotableAllocas is used only for manipulating unique pointers.

Signed-off-by: Bartłomiej Chmiel <bchmiel at antmicro.com>
---
 llvm/lib/Transforms/Scalar/SROA.cpp | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/llvm/lib/Transforms/Scalar/SROA.cpp b/llvm/lib/Transforms/Scalar/SROA.cpp
index 26b62cb79cdedf..e13dfed5adb458 100644
--- a/llvm/lib/Transforms/Scalar/SROA.cpp
+++ b/llvm/lib/Transforms/Scalar/SROA.cpp
@@ -198,7 +198,7 @@ class SROA {
   SmallSetVector<AllocaInst *, 16> PostPromotionWorklist;
 
   /// A collection of alloca instructions we can directly promote.
-  std::vector<AllocaInst *> PromotableAllocas;
+  SmallPtrSet<AllocaInst *, 16> PromotableAllocas;
 
   /// A worklist of PHIs to speculate prior to promoting allocas.
   ///
@@ -4769,9 +4769,8 @@ bool SROA::presplitLoadsAndStores(AllocaInst &AI, AllocaSlices &AS) {
 
   // Finally, don't try to promote any allocas that new require re-splitting.
   // They have already been added to the worklist above.
-  llvm::erase_if(PromotableAllocas, [&](AllocaInst *AI) {
-    return ResplitPromotableAllocas.count(AI);
-  });
+  for (auto *RPA : ResplitPromotableAllocas)
+    PromotableAllocas.erase(RPA);
 
   return true;
 }
@@ -4933,7 +4932,7 @@ AllocaInst *SROA::rewritePartition(AllocaInst &AI, AllocaSlices &AS,
     }
     if (PHIUsers.empty() && SelectUsers.empty()) {
       // Promote the alloca.
-      PromotableAllocas.push_back(NewAI);
+      PromotableAllocas.insert(NewAI);
     } else {
       // If we have either PHIs or Selects to speculate, add them to those
       // worklists and re-queue the new alloca so that we promote in on the
@@ -5568,7 +5567,9 @@ bool SROA::promoteAllocas(Function &F) {
     LLVM_DEBUG(dbgs() << "Not promoting allocas with mem2reg!\n");
   } else {
     LLVM_DEBUG(dbgs() << "Promoting allocas with mem2reg...\n");
-    PromoteMemToReg(PromotableAllocas, DTU->getDomTree(), AC);
+    PromoteMemToReg(
+        std::vector(PromotableAllocas.begin(), PromotableAllocas.end()),
+        DTU->getDomTree(), AC);
   }
 
   PromotableAllocas.clear();
@@ -5585,7 +5586,7 @@ std::pair<bool /*Changed*/, bool /*CFGChanged*/> SROA::runSROA(Function &F) {
     if (AllocaInst *AI = dyn_cast<AllocaInst>(I)) {
       if (DL.getTypeAllocSize(AI->getAllocatedType()).isScalable() &&
           isAllocaPromotable(AI))
-        PromotableAllocas.push_back(AI);
+        PromotableAllocas.insert(AI);
       else
         Worklist.insert(AI);
     }
@@ -5609,10 +5610,10 @@ std::pair<bool /*Changed*/, bool /*CFGChanged*/> SROA::runSROA(Function &F) {
       // Remove the deleted allocas from various lists so that we don't try to
       // continue processing them.
       if (!DeletedAllocas.empty()) {
-        auto IsInSet = [&](AllocaInst *AI) { return DeletedAllocas.count(AI); };
-        Worklist.remove_if(IsInSet);
-        PostPromotionWorklist.remove_if(IsInSet);
-        llvm::erase_if(PromotableAllocas, IsInSet);
+        Worklist.set_subtract(DeletedAllocas);
+        PostPromotionWorklist.set_subtract(DeletedAllocas);
+        for (auto *DA : DeletedAllocas)
+          PromotableAllocas.erase(DA);
         DeletedAllocas.clear();
       }
     }