[llvm] [Codegen] (NFC) Faster algorithm for MachineBlockPlacement (PR #91843)

Tue Jun 4 19:51:11 PDT 2024

huangjd wrote:

Some performance data, measured in an internal large proto (with 700 fields) 

Before:
```
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 77.7118 seconds (77.7120 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  18.7052 ( 27.6%)   0.0006 (  0.0%)  18.7059 ( 24.1%)  18.7067 ( 24.1%)  Branch Probability Basic Block Placement
  12.0425 ( 17.8%)   0.1802 (  1.8%)  12.2227 ( 15.7%)  12.2205 ( 15.7%)  Loop Strength Reduction
  10.9024 ( 16.1%)   0.2040 (  2.1%)  11.1064 ( 14.3%)  11.1069 ( 14.3%)  Greedy Register Allocator #2
   2.3942 (  3.5%)   7.3894 ( 74.5%)   9.7835 ( 12.6%)   9.7841 ( 12.6%)  X86 Assembly Printer
   5.8601 (  8.6%)   1.9305 ( 19.5%)   7.7906 ( 10.0%)   7.7911 ( 10.0%)  X86 DAG->DAG Instruction Selection
   2.8717 (  4.2%)   0.0323 (  0.3%)   2.9040 (  3.7%)   2.9040 (  3.7%)  Live DEBUG_VALUE analysis
   2.7199 (  4.0%)   0.0010 (  0.0%)   2.7209 (  3.5%)   2.7210 (  3.5%)  Machine Instruction Scheduler
   2.4429 (  3.6%)   0.0004 (  0.0%)   2.4433 (  3.1%)   2.4434 (  3.1%)  Register Coalescer
   0.8445 (  1.2%)   0.0001 (  0.0%)   0.8446 (  1.1%)   0.8447 (  1.1%)  Machine Cycle Info Analysis
   0.7191 (  1.1%)   0.0003 (  0.0%)   0.7194 (  0.9%)   0.7195 (  0.9%)  Control Flow Optimizer
...
```

After
```
===-------------------------------------------------------------------------===
                          Pass execution timing report
===-------------------------------------------------------------------------===
  Total Execution Time: 67.9013 seconds (67.9011 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  12.7974 ( 22.3%)   0.1834 (  1.7%)  12.9808 ( 19.1%)  12.9781 ( 19.1%)  Loop Strength Reduction
  11.1578 ( 19.4%)   0.2416 (  2.3%)  11.3993 ( 16.8%)  11.3999 ( 16.8%)  Greedy Register Allocator #2
   2.5642 (  4.5%)   7.6675 ( 73.0%)  10.2317 ( 15.1%)  10.2333 ( 15.1%)  X86 Assembly Printer
   6.0531 ( 10.5%)   2.2174 ( 21.1%)   8.2706 ( 12.2%)   8.2712 ( 12.2%)  X86 DAG->DAG Instruction Selection
   5.0775 (  8.8%)   0.0004 (  0.0%)   5.0779 (  7.5%)   5.0781 (  7.5%)  Branch Probability Basic Block Placement
   3.3444 (  5.8%)   0.0234 (  0.2%)   3.3678 (  5.0%)   3.3679 (  5.0%)  Live DEBUG_VALUE analysis
   2.7316 (  4.8%)   0.0013 (  0.0%)   2.7329 (  4.0%)   2.7331 (  4.0%)  Machine Instruction Scheduler
   2.2701 (  4.0%)   0.0006 (  0.0%)   2.2707 (  3.3%)   2.2709 (  3.3%)  Register Coalescer
   0.8733 (  1.5%)   0.0004 (  0.0%)   0.8738 (  1.3%)   0.8738 (  1.3%)  Control Flow Optimizer
   0.8678 (  1.5%)   0.0002 (  0.0%)   0.8680 (  1.3%)   0.8681 (  1.3%)  Machine Cycle Info Analysis
...
```

I can't find a open-source proto with non-trivial amount of fields, or any source file with similar structure where it contains many loops in a sequence, which is the use case for this optimization. 

https://github.com/llvm/llvm-project/pull/91843