[PATCH] blockfreq: Rewrite block frequency analysis

Wed Mar 19 00:29:23 PDT 2014

On Mar 7, 2014, at 3:44 PM, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote:

> This patch series rewrites the shared implementation of BlockFrequencyInfo
> and MachineBlockFrequencyInfo entirely.
> 
>  - Patches 0001 and 0002 are simple cleanups.
> 
>  - Patch 0003 is a class rename (and file move).  Let me know if I should
>    just merge it with 0006.
> 
>  - Patches 0004 and 0005 introduce some supporting classes.
> 
>  - Patch 0006 rewrites BlockFrequencyInfoImpl entirely.
> 
>  - Patch 0007 is *not* included (it’s headed separately to llvmdev).

These patches all look pretty great to me. I have general comments that aren't very specific to this patch set which I'll make in the RFC thread. I'll also ignore the soft-float issue here.

** PositiveFloat

I won't be able to check your logic, but that's ok. Your unit tests are comprehensive.

Personal nit: I prefer not to think about operator precedence, especially when it contradicts left-right precedence.

+                      N & UINT64_C(1) << (Shift - 1));

** BlockMass

I think it's strange having multiple block frequency helpers in the Support library. The implementation weirdness of this pass is spilling into our general utilities.

Superficially, it seems that BlockMass and BlockFrequency should be the same underlying fixed point arithmetic type. There must be a good reason they aren't (more precision with BlockFrequency), and that needs to be explained. If we can't combine them into a general utility, maybe we should move them into IR. Maybe even put them in the same file.

-Andy

> 
> The old implementation had a fundamental flaw:  precision losses from
> nested loops (and very wide branches) compounded past loop exits (and
> convergence points).
> 
> The @nested_loops testcase at the end of
> test/Analysis/BlockFrequencyAnalysis/basic.ll is motivating.  This
> function has three nested loops, with branch weights in the loop headers
> of 1:4000 (exit:continue).  The old analysis gives nonsense:
> 
>    Printing analysis 'Block Frequency Analysis' for function 'nested_loops':
>    ---- Block Freqs ----
>     entry = 1.0
>     for.cond1.preheader = 1.00103
>     for.cond4.preheader = 5.5222
>     for.body6 = 18095.19995
>     for.inc8 = 4.52264
>     for.inc11 = 0.00109
>     for.end13 = 0.0
> 
> The new analysis gives correct results:
> 
>    Printing analysis 'Block Frequency Analysis' for function 'nested_loops':
>    block-frequency-info: nested_loops
>     - entry: float = 1.0, int = 8
>     - for.cond1.preheader: float = 4001.0, int = 32007
>     - for.cond4.preheader: float = 16008001.0, int = 128064007
>     - for.body6: float = 64048012001.0, int = 512384096007
>     - for.inc8: float = 16008001.0, int = 128064007
>     - for.inc11: float = 4001.0, int = 32007
>     - for.end13: float = 1.0, int = 8
> 
> The new algorithm leverages BlockMass and PositiveFloat to maintain
> precision, separates "probability mass distribution" from "loop
> scaling", and uses dithering to eliminate probability mass loss.
> 
> Additionally, the new algorithm has a complexity advantage over the old.
> The previous algorithm was quadratic in the worst case.  The new
> algorithm is still worst-case quadratic in the presence of irreducible
> control flow, but it's linear otherwise.
> 
> The key difference between the old algorithm and the new is that control
> flow within a loop is evaluated separately from control flow outside,
> limiting propagation of precision problems and allowing loop scale to be
> calculated independently of mass distribution.  Loops are visited
> bottom-up, their loop scales are calculated, and they are replaced by
> pseudo-nodes.  Mass is then distributed through the function, which is
> now a DAG.  Finally, loops are revisited top-down to multiply through
> the loop scales and the masses distributed to pseudo nodes.
> 
> There are some remaining flaws.
> 
>  - Irreducible control flow isn't ignored, but it also isn't modelled
>    correctly.  Nevertheless, the new algorithm should behave better
>    than the old algorithm (at least it evaluates irreducible edges),
>    but mileage may vary.
> 
>  - Loop scale is limited to 4096 per loop (2^12) to avoid exhausting
>    the 64-bit integer precision used downstream.  If downstream users
>    of BlockFrequencyInfo are updated to use PositiveFloat (instead of
>    BlockFrequency), this limitation can be removed.  It's not clear if
>    that's practical.
> 
>  - BlockFrequencyInfo does *not* leverage LoopInfo/MachineLoopInfo and
>    Loop/MachineLoop.  These are currently unsuitable because they use
>    quadratic storage and don't have the API needed to make this
>    algorithm efficient.  I looked into updating them, but downstream
>    users rely on the current API.
> 
>  - The "bias" calculation proposed recently on llvmdev is *not*
>    incorporated here.  A separate patch (0007) takes a stab at that;
>    I’ll post it to llvmdev soon.
> 
> I ran through the LNT benchmarks; there was some movement in both
> directions.
> 
> <0001-blockfreq-Use-const-in-MachineBlockFrequencyInfo.patch><0002-blockfreq-Implement-Pass-releaseMemory.patch><0003-blockfreq-Rename-class-to-BlockFrequencyInfoImpl.patch><0004-blockfreq-Add-PositiveFloat-class.patch><0005-blockfreq-Add-BlockMass.patch><0006-blockfreq-Rewrite-BlockFrequencyInfoImpl.patch>_______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits