[PATCH] D113424: ext-tsp basic block layout

Mon Nov 22 14:36:14 PST 2021

rahmanl added inline comments.

================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:137
+private:
+  double Score{-1.0};
+  size_t MergeOffset{0};
----------------
spupyrev wrote:
> rahmanl wrote:
> > Does this initialization serve a special purpose? We can make it zero if we always reject zero score for merging.
> I feel like negative score is a bit better reflects the intention here: it indicates that the merge is disadvantageous for the quality or simply "invalid". A score of "0" means that the merge is neutral for perf, so it is up to the algorithm to decide whether such merge needs to be done.
> At some point we're experimenting with merging of 0-score chains (e.g., two chains connected by a cold jump). It might be beneficial for certain cases, e.g., for code size. I guess we can play with this more in the future
I think we would never merge 0-score chains. If there is any additional benefit (code size) it can be incorporated into the score. But I am OK with this.

================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:634
+
+    // Attach (a part of) ChainPred before the first block of ChainSucc
+    for (auto &Jump : ChainSucc->blocks().front()->InJumps) {
----------------
spupyrev wrote:
> rahmanl wrote:
> > spupyrev wrote:
> > > rahmanl wrote:
> > > > I see the following two loops try to form a new fallthrough between ChainPred and ChainSucc and this is independent from the size of the ChainPred. This could be an independent optimization. Do you think you can do this as a later patch? That way, we can evaluate its benefit /overhead in a better way.
> > > I don't see a clear advantage of moving this optimization to a separate diff. It is a relatively minor aspect of the algorithm and responsible for just a few lines of code. Not sure if the separation would noticeable simplify reviewing.
> > > 
> > > I do agree with another point here: The algorithm may benefit from some fine-tuning. I don't think the currently utilized parameters and options (e.g., how exactly we split/merge the chains) are optimal; they have been tuned a few years ago on a couple of workloads running on a particular hardware. I would really appreciate the community help.
> > > 
> > > For now, I'd however keep the params and optimizations as is, so that the existing customers do not see sudden regressions.
> > Agreed. I am mostly interested in separately analyzing the impact of this. I think the user can do so by setting the split parameter to be zero.
> > 
> > One suggestion to make this function more compact: Please consider adding a helper lambda function that given an offset and a list of merge types, does the necessary checks for the offset (like `PredChain->blocks[Offset]->ForcedSucc != nullptr`) and calls `computeMergeGain` for each merge type. I think we can make this function more readable and concise as well.
> Added an option to disable this extra optimization.
> 
> Not sure I understand the proposal here.
You can use something like the following function:
  lang=c++
  auto ComputeMergeGainWithOffsetAndTypes([&](int Offset, std::vector<MergeTypeTy> &merge_types) {
    if (Offset == 0 || Offset == PredChain.size())
      return;
    auto BB = ChainPred->blocks()[Offset - 1];
     // Skip the splitting if it breaks FT successors
     if (BB->ForcedSucc != nullptr)
       continue;
     for (auto &MergeType: merge_types) {
        Gain.updateIfLessThan(computeMergeGain(ChainPred, ChainSucc, Jumps, Offset, MergeType));
     }
  }

================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:683
+        // Try to split the chain in different ways
+        Gain = computeMergeGain(Gain, ChainPred, ChainSucc, Jumps, Offset,
+                                MergeTypeTy::X1_Y_X2);
----------------
spupyrev wrote:
> rahmanl wrote:
> > rahmanl wrote:
> > > Why not doing X2_Y_X1 here?
> > I am still puzzled by why we don't do X2_Y_X1. Is it not beneficial?
> Yes, it is not giving much benefits. I checked a binary with ~100K instances: Adding this type of splits makes a difference for 4 CFGs but it is responsible for ~25% of splits. Probably keeping the split doesn't worth it.
Great. Please add a comment to explain why this splitting is not exercised.

================
Comment at: llvm/test/CodeGen/X86/code_placement_ext_tsp.ll:4
+
+define void @func1a()  {
+; Test that the algorithm positions the most likely successor first
----------------
This is missing the ascii cfg.

================
Comment at: llvm/test/CodeGen/X86/code_placement_ext_tsp.ll:305
+; CHECK2-LABEL: func4:
+; CHECK2: callq b
+; CHECK2: callq c
----------------
Please use the same names as in the CFG so this can be easily mapped. And please do this across all tests.

================
Comment at: llvm/test/CodeGen/X86/code_placement_ext_tsp_large.ll:1
+; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -enable-ext-tsp-block-placement=1 -ext-tsp-chain-split-threshold=128 < %s | FileCheck %s
+; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -enable-ext-tsp-block-placement=1 -ext-tsp-chain-split-threshold=1 < %s | FileCheck %s -check-prefix=CHECK2
----------------
I am not an expert on this, but I feel like this is not needed. Does it make a difference?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D113424/new/

https://reviews.llvm.org/D113424