[PATCH] D113424: ext-tsp basic block layout
Rahman Lavaee via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Nov 18 19:06:14 PST 2021
rahmanl added a comment.
In D113424#3135015 <https://reviews.llvm.org/D113424#3135015>, @spupyrev wrote:
>> Re memory utilization:
>
> I am not sure what exactly the concern is. We keep all the allocated objects as the fields of ExtTSPImpl; no new objects should be created. When we merge two chains, all the jumps from one chain are moved to another one; and one of the chains is cleared. So the overall space is proportional to (|blocks| + |jumps|), unless there is a bug.
>
> To validate correctness, I ran the algorithm on large synthetic instances (with up to 100K blocks); the memory didn't go beyond a few tens of MBs. In fact, it is hard to exactly measure the memory consumption, as the runtime of the implementation is super-linear and doesn't scale to huge instances.
Please see my inline comments. Since we keep all the Chains in the vector (they're not destructed), space usage is currently super-linear. We have a choice of deallocating the space (using shrink_to_fit, resize, or swap) and that's a time-space tradeoff. I am not sure which one is better, since this is already super-linear time, it doesn't hurt to make those calls and keep the space linear. Also, you can use `llc` on a large function to compare memory usage with and without `-enable-ext-tsp-block-placement`. It might very well be the case that the memory consumption remains the same. In that case, maybe you don't want to free up the storage either.
>> Re perf tests on IR/CSIR:
>
> Here are my measurements for clang-10 binary with different PGO modes. Of course we should keep in mind that the numbers won't fully generalize to other binaries/benchmarks and only provide one data point.
>
> Comparing clang-10 using 200 iterations on input1.cpp
>
> [test] -mllvm -enable-ext-tsp-block-placement=1
> [control] -mllvm -enable-ext-tsp-block-placement=0
> -DLLVM_BUILD_INSTRUMENTED=FE
> task-clock : 5942 (+- 0.19%) vs 6027 (+- 0.19%) [diff = -1.4170%]
> binary_size : 3316628472 vs 3325095000 [diff = -0.2546%]
> -DLLVM_BUILD_INSTRUMENTED=IR
> task-clock : 5793 (+- 0.20%) vs 5876 (+- 0.20%) [diff = -1.4050%]
> binary_size : 3305880848 vs 3325234728 [diff = -0.5820%]
> -DLLVM_BUILD_INSTRUMENTED=CSIR
> task-clock : 5705 (+- 0.26%) vs 5757 (+- 0.26%) [diff = -0.8985%]
> binary_size : 3312445120 vs 3329071760 [diff = -0.4994%]
> CSSPGO
> task-clock : 6110 (+- 0.37%) vs 6226 (+- 0.37%) [diff = -1.8617%]
> binary_size : 3995536616 vs 3980411480 [diff = +0.3799%]
> AutoFDO
> task-clock : 5711 (+- 0.19%) vs 5797 (+- 0.19%) [diff = -1.4882%]
> binary_size : 3707959712 vs 3694537688 [diff = +0.3632%]
Thanks for sharing the detailed results.
================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:262-263
+ void clear() {
+ Blocks.clear();
+ Edges.clear();
+ }
----------------
Although this clears the vector, the underlying storage is not released. To free up the storage, we would need additional effort. (shrink_to_fit, resize, or swap).
================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:304
+ Jumps.insert(Jumps.end(), Other->Jumps.begin(), Other->Jumps.end());
+ Other->Jumps.clear();
+ }
----------------
Same comment. Storage won't be released.
================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:659
+ /// element being the corresponding merging type.
+ MergeGainTy mergeGain(Chain *ChainPred, Chain *ChainSucc,
+ ChainEdge *Edge) const {
----------------
Please rename this to GetBestMergeGain for consistency with the function comment and to further separate it from computeMergeGain.
================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:127
+ bool operator<(const MergeGainTy &Other) const {
+ return (Other.Score > EPS && Other.Score > Score + EPS);
+ }
----------------
This can be removed if we initialize with zero.
================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:137
+private:
+ double Score{-1.0};
+ size_t MergeOffset{0};
----------------
Does this initialization serve a special purpose? We can make it zero if we always reject zero score for merging.
================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:445
+private:
+ /// Initialize algorithm's data structures.
+ void initialize(const NodeSizeMap &NodeSizes, const NodeCountMap &NodeCounts,
----------------
================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:838
+ [&](const Chain *C1, const Chain *C2) {
+ // Original entry point to the front
+ if (C1->isEntry() != C2->isEntry()) {
----------------
Change comment to clearly mention this is for putting the entry block at the beginning.
================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:840-843
+ if (C1->isEntry())
+ return true;
+ if (C2->isEntry())
+ return false;
----------------
We can simply use `return C1->isEntry()`?
================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:847-852
+ const double D2 = ChainDensity[C2];
+ if (D1 != D2)
+ return D1 > D2;
+
+ // Making the order deterministic
+ return C1->id() < C2->id();
----------------
How about shortening this to `return (D1 != D2) ? (D1 > D2) : (C1->id() < C2->id())`?
================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:634
+
+ // Attach (a part of) ChainPred before the first block of ChainSucc
+ for (auto &Jump : ChainSucc->blocks().front()->InJumps) {
----------------
spupyrev wrote:
> rahmanl wrote:
> > I see the following two loops try to form a new fallthrough between ChainPred and ChainSucc and this is independent from the size of the ChainPred. This could be an independent optimization. Do you think you can do this as a later patch? That way, we can evaluate its benefit /overhead in a better way.
> I don't see a clear advantage of moving this optimization to a separate diff. It is a relatively minor aspect of the algorithm and responsible for just a few lines of code. Not sure if the separation would noticeable simplify reviewing.
>
> I do agree with another point here: The algorithm may benefit from some fine-tuning. I don't think the currently utilized parameters and options (e.g., how exactly we split/merge the chains) are optimal; they have been tuned a few years ago on a couple of workloads running on a particular hardware. I would really appreciate the community help.
>
> For now, I'd however keep the params and optimizations as is, so that the existing customers do not see sudden regressions.
Agreed. I am mostly interested in separately analyzing the impact of this. I think the user can do so by setting the split parameter to be zero.
One suggestion to make this function more compact: Please consider adding a helper lambda function that given an offset and a list of merge types, does the necessary checks for the offset (like `PredChain->blocks[Offset]->ForcedSucc != nullptr`) and calls `computeMergeGain` for each merge type. I think we can make this function more readable and concise as well.
================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:683
+ // Try to split the chain in different ways
+ Gain = computeMergeGain(Gain, ChainPred, ChainSucc, Jumps, Offset,
+ MergeTypeTy::X1_Y_X2);
----------------
rahmanl wrote:
> Why not doing X2_Y_X1 here?
I am still puzzled by why we don't do X2_Y_X1. Is it not beneficial?
================
Comment at: llvm/test/CodeGen/X86/code_placement_ext_tsp.ll:1
+; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -enable-ext-tsp-block-placement=1 < %s | FileCheck %s
+
----------------
spupyrev wrote:
> spupyrev wrote:
> > davidxl wrote:
> > > rahmanl wrote:
> > > > The interesting code path (splitting the chain) is not exercised by these tests. Please add a test for that.
> > > The extTSP algorithm should be able to handle common loop rotation case. Can you add a test case about it?
> > Added some (possibly naive) test. Let me know if you have something more complex in mind.
> I added some toy test, let me know if you have something more complex in mind.
Thanks for adding the test case. I was thinking of something less complex which exercises the splitting to improve fallthroughs. (Might be worthwhile to check if the old algorithm gets the optimal layout in this case).
foo
| \
| \10
| \
| v
|17 foo.bb1
| /
| / 10
v v
foo.bb2
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D113424/new/
https://reviews.llvm.org/D113424
More information about the llvm-commits
mailing list