[llvm] [CodeLayout] cache-directed sort: limit max chain size (PR #69039)

Mon Oct 16 17:36:57 PDT 2023

MaskRay wrote:

Thank you for the responses! 

> Can we apply the density-based condition, as in https://github.com/llvm/llvm-project/pull/68617? Besides making the code faster, it often helps for quality too. I'd keep the max-chain-size to be 512, if possible (assuming in conjunction with the density-based check the runtime is low).

I am curious whether `cdsort-max-chain-size=128` or 512 has a noticeable performance regression for your applications.
I am nervous about 512 as it is quite slow to justify making the default...

I am happy to bootstrap Clang and check Clang compilation performance if you give me a script on accessable workloads (building llvm-project or linux kernel, for example).

```
% time /tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=hfsort --threads=8
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=hfsort  14.86s user 3.84s system 501% cpu 3.732 total
% time /tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort --threads=8 -mllvm -cdsort-max-chain-size=128
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort  16.71s user 3.94s system 360% cpu 5.735 total
% time /tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort --threads=8 -mllvm -cdsort-max-chain-size=512
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort  20.15s user 3.95s system 257% cpu 9.347 total
```

Adding something like `MaxMergeDensityRatio` from #68617 has very little effect. If I set the pruning threshold to 1.5:
```

--- i/llvm/lib/Transforms/Utils/CodeLayout.cpp
+++ w/llvm/lib/Transforms/Utils/CodeLayout.cpp
@@ -1167,6 +1167,9 @@ private:
         if (Edge->srcChain()->numBlocks() + Edge->dstChain()->numBlocks() >
             CDMaxChainSize)
           continue;
+        auto [mn, mx] = std::minmax(Edge->srcChain()->numBlocks(), Edge->dstChain()->numBlocks());
+        if (mx/mn > 1.5)
+          continue;

         // Compute the gain of merging the two chains.
         MergeGainT Gain = getBestMergeGain(Edge);
```

I'll get this (nearly no effect):
```
% repeat 2 time /tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort --threads=8 -mllvm -cdsort-max-chain-size=128 # if (mx/mn > 1.5) continue
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort  16.83s user 3.91s system 355% cpu 5.843 total
/tmp/out/custom-gcc/bin/ld.lld @response.txt --call-graph-profile-sort=cdsort  16.77s user 4.01s system 356% cpu 5.826 total
```

> Thanks for experimenting with my PR. I have another optimization too that I will try myself. It makes sense to set a limit. I was just curious what would be the perf. gain that we drop if we set it to 128. (Any ideas @spupyrev ?)

Without applying this `cdsort-max-chain-size=128` patch, using `DenseMap` instead of `std::vector` for `ChainEdges` (https://github.com/rlavaee/llvm-project/tree/improve-cdsort) makes the link like 10+ seconds faster in the total link time of 9 min...


https://github.com/llvm/llvm-project/pull/69039