[all-commits] [llvm/llvm-project] ad2a59: [CSSPGO] Introducing dangling pseudo probes.

Wed Mar 3 22:45:14 PST 2021

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: ad2a59f5840482d7dd802e83b82262c97704a4eb
      https://github.com/llvm/llvm-project/commit/ad2a59f5840482d7dd802e83b82262c97704a4eb
  Author: Hongtao Yu <hoy at fb.com>
  Date:   2021-03-03 (Wed, 03 Mar 2021)

  Changed paths:
    M llvm/include/llvm/CodeGen/MachineInstr.h
    M llvm/include/llvm/IR/PseudoProbe.h
    M llvm/include/llvm/MC/MCPseudoProbe.h
    M llvm/include/llvm/ProfileData/SampleProf.h
    M llvm/lib/CodeGen/PseudoProbeInserter.cpp
    M llvm/lib/ProfileData/SampleProf.cpp
    A llvm/test/Transforms/SampleProfile/pseudo-probe-dangling.mir
    M llvm/test/tools/llvm-profdata/Inputs/pseudo-probe-profile.proftext
    M llvm/test/tools/llvm-profdata/merge-probe-profile.test

  Log Message:
  -----------
  [CSSPGO] Introducing dangling pseudo probes.

Dangling probes are the probes associated to an empty block. This usually happens when all real instructions are optimized away from the block. There is a problem with dangling probes during the offline counts processing. The way the sample profiler works is that samples collected on the first physical instruction following a probe will be counted towards the probe. This logically equals to treating the instruction next to a probe as if it is from the same block of the probe. In the dangling probe case, the real instruction following a dangling probe actually starts a new block, and samples collected on the new block may cause issues when counted towards the empty block.

To mitigate this issue, we first try to move around a dangling probe inside its owning block. If there are still native instructions preceding the probe in the same block, we can then use them as a place holder to collect samples for the probe. A pass is added to walk each block backwards looking for probes not followed by any real instruction and moving them before the first real instruction. This is done right before the object emission.

If we are unlucky to find such in-block preceding instructions for a probe, the solution we are taking is to tag such probe as dangling so that the samples reported for them will not be trusted by the compiler. We leave it up to the counts inference algorithm to get such probes a reasonable count. The number `UINT64_MAX` is used to mark sample count as collected for a dangling probe.

Reviewed By: wmi

Differential Revision: https://reviews.llvm.org/D95962

  Commit: 89855158228644b7be273055efd728b82ea82803
      https://github.com/llvm/llvm-project/commit/89855158228644b7be273055efd728b82ea82803
  Author: Hongtao Yu <hoy at fb.com>
  Date:   2021-03-03 (Wed, 03 Mar 2021)

  Changed paths:
    M llvm/include/llvm/CodeGen/MachineBasicBlock.h
    M llvm/include/llvm/IR/PseudoProbe.h
    M llvm/lib/CodeGen/BranchFolding.cpp
    M llvm/lib/CodeGen/MachineBasicBlock.cpp
    M llvm/lib/CodeGen/TailDuplicator.cpp
    M llvm/lib/IR/PseudoProbe.cpp
    M llvm/lib/Transforms/IPO/SampleProfile.cpp
    M llvm/lib/Transforms/IPO/SampleProfileProbe.cpp
    M llvm/lib/Transforms/Scalar/JumpThreading.cpp
    M llvm/lib/Transforms/Utils/Local.cpp
    M llvm/lib/Transforms/Utils/SimplifyCFG.cpp
    A llvm/test/Transforms/SampleProfile/pseudo-probe-dangle.ll

  Log Message:
  -----------
  [CSSPGO] Unblocking optimizations by dangling pseudo probes.

This change fixes a couple places where the pseudo probe intrinsic blocks optimizations because they are not naturally removable. To unblock those optimizations, the blocking pseudo probes are moved out of the original blocks and tagged dangling, instead of allowing pseudo probes to be literally removed. The reason is that when the original block is removed, we won't be able to sample it. Instead of assigning it a zero weight, moving all its pseudo probes into another block and marking them dangling should allow the counts inference a chance to assign them a more reasonable weight. We have not seen counts quality degradation from our experiments.

The optimizations being unblocked are:

	1. Removing conditional probes for if-converted branches. Conditional probes are tagged dangling when their homing branch arms are folded so that they will not be over-counted.
	2. Unblocking jump threading from removing empty blocks. Pseudo probe prevents jump threading from removing logically empty blocks that only has one unconditional jump instructions.
	3. Unblocking SimplifyCFG and MIR tail duplicate to thread empty blocks and blocks with redundant branch checks.

Since dangling probes are logically deleted, they should not consume any samples in LTO postLink. This can be achieved by setting their distribution factors to zero when dangled.

Reviewed By: wmi

Differential Revision: https://reviews.llvm.org/D97481

  Commit: c75da238b419516534f372f87c9fd707650ebf3f
      https://github.com/llvm/llvm-project/commit/c75da238b419516534f372f87c9fd707650ebf3f
  Author: Hongtao Yu <hoy at fb.com>
  Date:   2021-03-03 (Wed, 03 Mar 2021)

  Changed paths:
    M llvm/include/llvm/IR/PseudoProbe.h
    M llvm/include/llvm/Transforms/Utils/BasicBlockUtils.h
    M llvm/lib/CodeGen/PseudoProbeInserter.cpp
    M llvm/lib/IR/PseudoProbe.cpp
    M llvm/lib/Transforms/Scalar/JumpThreading.cpp
    M llvm/lib/Transforms/Utils/BasicBlockUtils.cpp
    M llvm/test/Transforms/SampleProfile/pseudo-probe-dangle.ll
    A llvm/test/Transforms/SampleProfile/pseudo-probe-dedup.ll

  Log Message:
  -----------
  [CSSPGO] Deduplicating dangling pseudo probes.

Same dangling probes are redundant since they all have the same semantic that is to rely on the counts inference tool to get reasonable count for the same original block. Therefore, there's no need to keep multiple copies of them. I've seen jump threading created tons of redundant dangling probes that slowed down the compiler dramatically. Other optimization passes can also result in redundant probes though without an observed impact so far.

This change removes block-wise redundant dangling probes specifically introduced by jump threading. To support removing redundant dangling probes caused by all other passes, a final function-wise deduplication is also added.

An 18% size win of the .pseudo_probe section was seen for SPEC2017. No performance difference was observed.

Differential Revision: https://reviews.llvm.org/D97482

Compare: https://github.com/llvm/llvm-project/compare/647af31e7483...c75da238b419