[LLVMdev] Landing my new development on the trunk ...

Thu Nov 11 15:26:09 PST 2010

Evan Cheng <evan.cheng <at> apple.com> writes:
 > Eli is right. We do need to see some benchmark numbers and understand 
how the
pass will fit in the target
 > independent optimizer. While we encourage contribution, we typically 
don't
commit new passes unless it
 > introduce new functionalities that have active clients. It would also 
help if
you provide us with compile
 > time numbers.
 >
 > Evan

Evan,

Let's restart this conversation ...

I propose that I test OSR versus LSR compile time performance via

opt shell.ll -mem2reg -licm -osr         -instcombine -stats 
-time-passes -disable-output

and

opt shell.ll -mem2reg -licm -loop-reduce -instcombine -stats 
-time-passes -disable-output

Caveats which may skew any compile time testing:

1. LST is target dependent and was developed to run closer to code 
generation time.  Thus, we are testing LSR out of its normal context.  
Also as mentioned previously, -instcombine is not normally called after 
LSR (though LSR does create opportunities for -instcombine)

2. OSR finds loops in the SSA graph and will find more reduction 
opportunities than LSR.  Thus, OSR may spend more time doing strength 
reduction, since OSR does more work.

3. From Dan Gohman's reply, "LSR's primary goal is to reduce integer 
register pressure, with strength reduction being just one of its 
available tools." Thus, LSR may be doing more than just strength reduction.

4. It is nontrivial to compare OSR and LSR optimizations in a given 
program, especially the larger tests in the test suite where the compile 
time differences may be the largest.

These caveats makes it a bit like comparing apples and oranges.

I created a simple test for the shell sort with a triple nested loop.  I 
hand verified that both OSR and LSR performed the same strength 
reductions. Here are the results:

% opt shell.ll -mem2reg -licm -osr -instcombine -stats -time-passes 
-disable-output
===-------------------------------------------------------------------------=== 

                           ... Statistics Collected ...
===-------------------------------------------------------------------------=== 

  5 instcombine - Number of dead inst eliminated
  1 instcombine - Number of instructions sunk
12 instcombine - Number of insts combined
  5 mem2reg     - Number of PHI nodes inserted
  8 mem2reg     - Number of alloca's promoted
  5 mem2reg     - Number of alloca's promoted with a single store
  9 osr         - Number of operators reduced

===-------------------------------------------------------------------------=== 

                       ... Pass execution timing report ...
===-------------------------------------------------------------------------=== 

   Total Execution Time: 0.0058 seconds (0.0058 wall clock)

    ---User Time---   --System Time--   --User+System--   ---Wall 
Time---  --- Name ---
    0.0018 ( 34.5%)   0.0001 ( 26.5%)   0.0020 ( 33.8%)   0.0020 ( 
34.1%)  Combine redundant instructions
    0.0007 ( 13.2%)   0.0001 ( 25.5%)   0.0008 ( 14.2%)   0.0008 ( 
14.2%)  Promote Memory to Register
    0.0007 ( 13.8%)   0.0000 (  6.6%)   0.0008 ( 13.3%)   0.0008 ( 
13.2%)  Operator Strength Reduction
    0.0005 (  9.2%)   0.0001 ( 16.1%)   0.0006 (  9.7%)   0.0006 (  
9.6%)  Dominator Tree Construction
    0.0005 (  8.7%)   0.0000 (  5.3%)   0.0005 (  8.4%)   0.0005 (  
8.4%)  Module Verifier
    0.0003 (  6.3%)   0.0000 (  4.0%)   0.0004 (  6.2%)   0.0004 (  
6.1%)  Loop Invariant Code Motion
    0.0003 (  5.4%)   0.0000 (  3.6%)   0.0003 (  5.3%)   0.0003 (  
5.3%)  Natural Loop Information
    0.0002 (  4.2%)   0.0000 (  5.3%)   0.0003 (  4.3%)   0.0003 (  
4.3%)  Dominance Frontier Construction
    0.0002 (  4.3%)   0.0000 (  4.9%)   0.0003 (  4.4%)   0.0002 (  
4.3%)  Canonicalize natural loops
    0.0000 (  0.2%)   0.0000 (  1.7%)   0.0000 (  0.3%)   0.0000 (  
0.3%)  No Alias Analysis (always returns 'may' alias)
    0.0000 (  0.2%)   0.0000 (  0.4%)   0.0000 (  0.2%)   0.0000 (  
0.2%)  Preliminary module verification
    0.0053 (100.0%)   0.0005 (100.0%)   0.0058 (100.0%)   0.0058 
(100.0%)  Total

% opt shell.ll -mem2reg -licm -loop-reduce -instcombine -stats 
-time-passes -disable-output
===-------------------------------------------------------------------------=== 

                           ... Statistics Collected ...
===-------------------------------------------------------------------------=== 

  7 instcombine      - Number of dead inst eliminated
  2 instcombine      - Number of instructions sunk
19 instcombine      - Number of insts combined
  5 mem2reg          - Number of PHI nodes inserted
  8 mem2reg          - Number of alloca's promoted
  5 mem2reg          - Number of alloca's promoted with a single store
  2 scalar-evolution - Number of loops with predictable loop counts
  2 scalar-evolution - Number of loops without predictable loop counts

===-------------------------------------------------------------------------=== 

                       ... Pass execution timing report ...
===-------------------------------------------------------------------------=== 

   Total Execution Time: 0.0114 seconds (0.0114 wall clock)

    ---User Time---   --System Time--   --User+System--   ---Wall 
Time---  --- Name ---
    0.0050 ( 46.0%)   0.0001 ( 26.6%)   0.0051 ( 45.1%)   0.0051 ( 
45.1%)  Loop Strength Reduction
    0.0020 ( 18.3%)   0.0001 ( 17.2%)   0.0021 ( 18.3%)   0.0021 ( 
18.3%)  Combine redundant instructions
    0.0013 ( 12.0%)   0.0001 ( 12.0%)   0.0014 ( 12.0%)   0.0014 ( 
12.0%)  Induction Variable Users
    0.0005 (  4.5%)   0.0001 ( 14.8%)   0.0006 (  5.0%)   0.0006 (  
5.0%)  Promote Memory to Register
    0.0005 (  4.8%)   0.0000 (  5.6%)   0.0005 (  4.8%)   0.0005 (  
4.8%)  Natural Loop Information
    0.0003 (  3.0%)   0.0000 (  2.6%)   0.0003 (  3.0%)   0.0003 (  
3.0%)  Loop Invariant Code Motion
    0.0003 (  2.4%)   0.0000 (  3.6%)   0.0003 (  2.5%)   0.0003 (  
2.5%)  Module Verifier
    0.0003 (  2.4%)   0.0000 (  1.6%)   0.0003 (  2.4%)   0.0003 (  
2.4%)  Induction Variable Users
    0.0002 (  2.1%)   0.0000 (  2.6%)   0.0002 (  2.1%)   0.0002 (  
2.2%)  Canonicalize natural loops
    0.0002 (  1.7%)   0.0000 (  7.2%)   0.0002 (  1.9%)   0.0002 (  
1.9%)  Dominator Tree Construction
    0.0002 (  1.6%)   0.0000 (  1.4%)   0.0002 (  1.6%)   0.0002 (  
1.6%)  Canonicalize natural loops
    0.0001 (  0.8%)   0.0000 (  1.8%)   0.0001 (  0.9%)   0.0001 (  
0.9%)  Dominance Frontier Construction
    0.0000 (  0.3%)   0.0000 (  1.2%)   0.0000 (  0.3%)   0.0000 (  
0.3%)  Scalar Evolution Analysis
    0.0000 (  0.1%)   0.0000 (  1.6%)   0.0000 (  0.2%)   0.0000 (  
0.2%)  No Alias Analysis (always returns 'may' alias)
    0.0000 (  0.0%)   0.0000 (  0.2%)   0.0000 (  0.1%)   0.0000 (  
0.0%)  Preliminary module verification
    0.0109 (100.0%)   0.0005 (100.0%)   0.0114 (100.0%)   0.0114 
(100.0%)  Total

If you think that the above is a fair methodology for comparing OSR and 
LSR compile times, I can run additional tests.  Which tests are an open 
question.

Brian