[PATCH] D83265: [MBP] Use profile count to compute tail dup cost if it is available

Mon Jul 13 18:27:50 PDT 2020

Carrot marked 3 inline comments as done.
Carrot added a comment.

In D83265#2144298 <https://reviews.llvm.org/D83265#2144298>, @davidxl wrote:

> A few things:
>
> 1. can this be extended to the TailDup pass too?

This patch only changes the cost function, it doesn't change tail dup algorithms, especially the partial duplication is not implemented in TailDup pass.

> 2. what is the impact on text size?

The total text size of spec2006int is reduced by 7804 bytes.

> 3. what is the performance with AFDO and XFDO?

The performance benefit depends on the precision of profile count. So I expect the performance gain with AFDO is similar to FDO. But XFDO has less accurate profile count, so it should has less performance gain.

> 
> 
> 4. Can the threshold be tuned higher (with more performnace)?

The performance impact of tail duplication is mainly a function of two interfering factors, increased fall through (reduced taken branch) and increased code size (less effective usage of instruction cache).
For small binaries, the instruction cache is not a problem, so they prefer more tail duplications, thus smaller TailDupProfileCountThreshold.
For large binaries, it's not a good idea to trade a small number of increased fall through for worse instruction cache behavior. So we need to increase the bar for tail duplication, thus larger TailDupProfileCountThreshold.
The current default value of TailDupProfileCountThreshold is tuned to keep spec2006int not impacted, while still has 0.1% perf win for Goolge search benchmark. I can get another 0.1%  improvement for the search benchmark if I increase TailDupProfileCountThreshold to 500.

================
Comment at: llvm/lib/CodeGen/MachineBlockPlacement.cpp:420
+  /// The return value is used to model tail duplication cost.
+  BlockFrequency getBlockCountOrFrequency(const MachineBasicBlock *BB) {
+    if (UseProfileCount) {
----------------
davidxl wrote:
> This should probbaly return Optional<..>
See following reply.

================
Comment at: llvm/lib/CodeGen/MachineBlockPlacement.cpp:426
+      else
+        return 0;
+    }
----------------
davidxl wrote:
> Return None here. Also when None is returned, I think the caller needs to handle it conservatively -- perhaps resort to Freq based method
Look at the implementation of getBlockProfileCount and getBlockFreq.

BlockFrequency BlockFrequencyInfo::getBlockFreq(const BasicBlock *BB) const {
  return BFI ? BFI->getBlockFreq(BB) : 0;
}

Optional<uint64_t>
BlockFrequencyInfo::getBlockProfileCount(const BasicBlock *BB,
                                         bool AllowSynthetic) const {
  if (!BFI)
    return None;

  return BFI->getBlockProfileCount(*getFunction(), BB, AllowSynthetic);
}

In the same condition (!BFI), neither function returns meaningful result, but one returuns None, another returns 0.

Another situation None may be returned is the function has no profile count, so the result is actually a meaningful 0.

================
Comment at: llvm/lib/CodeGen/MachineBlockPlacement.cpp:3167
+  SmallVector<MachineBasicBlock *, 8> Succs;
+  for (MachineBasicBlock *Succ : BB->successors()) {
+    if (BlockFilter && !BlockFilter->count(Succ))
----------------
davidxl wrote:
> Is this change related?
It is not profile count cost model related.
It is a different tail dup improvement.
Do you want me to send another patch for it?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D83265/new/

https://reviews.llvm.org/D83265