[PATCH] D113424: ext-tsp basic block layout

Fri Nov 19 11:22:18 PST 2021

spupyrev added inline comments.

================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:262-263
+  void clear() {
+    Blocks.clear();
+    Edges.clear();
+  }
----------------
rahmanl wrote:
> Although this clears the vector, the underlying storage is not released. To free up the storage, we would need additional effort. (shrink_to_fit, resize, or swap).
I don't think that makes any measurable difference, as we're working with relatively small instances. But explicitly freeing up memory won't hurt, i guess.

================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:137
+private:
+  double Score{-1.0};
+  size_t MergeOffset{0};
----------------
rahmanl wrote:
> Does this initialization serve a special purpose? We can make it zero if we always reject zero score for merging.
I feel like negative score is a bit better reflects the intention here: it indicates that the merge is disadvantageous for the quality or simply "invalid". A score of "0" means that the merge is neutral for perf, so it is up to the algorithm to decide whether such merge needs to be done.
At some point we're experimenting with merging of 0-score chains (e.g., two chains connected by a cold jump). It might be beneficial for certain cases, e.g., for code size. I guess we can play with this more in the future

================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:634
+
+    // Attach (a part of) ChainPred before the first block of ChainSucc
+    for (auto &Jump : ChainSucc->blocks().front()->InJumps) {
----------------
rahmanl wrote:
> spupyrev wrote:
> > rahmanl wrote:
> > > I see the following two loops try to form a new fallthrough between ChainPred and ChainSucc and this is independent from the size of the ChainPred. This could be an independent optimization. Do you think you can do this as a later patch? That way, we can evaluate its benefit /overhead in a better way.
> > I don't see a clear advantage of moving this optimization to a separate diff. It is a relatively minor aspect of the algorithm and responsible for just a few lines of code. Not sure if the separation would noticeable simplify reviewing.
> > 
> > I do agree with another point here: The algorithm may benefit from some fine-tuning. I don't think the currently utilized parameters and options (e.g., how exactly we split/merge the chains) are optimal; they have been tuned a few years ago on a couple of workloads running on a particular hardware. I would really appreciate the community help.
> > 
> > For now, I'd however keep the params and optimizations as is, so that the existing customers do not see sudden regressions.
> Agreed. I am mostly interested in separately analyzing the impact of this. I think the user can do so by setting the split parameter to be zero.
> 
> One suggestion to make this function more compact: Please consider adding a helper lambda function that given an offset and a list of merge types, does the necessary checks for the offset (like `PredChain->blocks[Offset]->ForcedSucc != nullptr`) and calls `computeMergeGain` for each merge type. I think we can make this function more readable and concise as well.
Added an option to disable this extra optimization.

Not sure I understand the proposal here.

================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:683
+        // Try to split the chain in different ways
+        Gain = computeMergeGain(Gain, ChainPred, ChainSucc, Jumps, Offset,
+                                MergeTypeTy::X1_Y_X2);
----------------
rahmanl wrote:
> rahmanl wrote:
> > Why not doing X2_Y_X1 here?
> I am still puzzled by why we don't do X2_Y_X1. Is it not beneficial?
Yes, it is not giving much benefits. I checked a binary with ~100K instances: Adding this type of splits makes a difference for 4 CFGs but it is responsible for ~25% of splits. Probably keeping the split doesn't worth it.

================
Comment at: llvm/test/CodeGen/X86/code_placement_ext_tsp.ll:1
+; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -enable-ext-tsp-block-placement=1 < %s | FileCheck %s
+
----------------
rahmanl wrote:
> spupyrev wrote:
> > spupyrev wrote:
> > > davidxl wrote:
> > > > rahmanl wrote:
> > > > > The interesting code path (splitting the chain) is not exercised by these tests. Please add a test for that.
> > > > The extTSP algorithm should be able to handle common loop rotation case. Can you add a test case about it?
> > > Added some (possibly naive) test. Let me know if you have something more complex in mind.
> > I added some toy test, let me know if you have something more complex in mind.
> Thanks for adding the test case. I was thinking of something less complex which exercises the splitting to improve fallthroughs. (Might be worthwhile to check if the old algorithm gets the optimal layout in this case).
> 
>          foo
>            |   \
>            |     \10
>            |       \
>            |         v
>            |17       foo.bb1
>            |          / 
>            |        /  10
>            v      v 
>          foo.bb2
Added this test (func4)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D113424/new/

https://reviews.llvm.org/D113424