[PATCH] Remove 4,096 loop scale limitation.

Tue Mar 31 06:58:03 PDT 2015

> On 2015 Mar 31, at 05:07, Diego Novillo <dnovillo at google.com> wrote:
> 
> Hi dexonsmith,
> 
> This is part 1 of fixes to address the problems described in
> https://llvm.org/bugs/show_bug.cgi?id=22719.
> 
> The restriction to limit loop scales to 4,096 does not really prevent
> overflows anymore, as the underlying algorithm has changed and does
> not seem to suffer from this problem.
> 
> Additionally, artificially restricting loop scales to such a low number
> skews frequency information, making loops of equal hotness appear to
> have very different hotness properties.
> 
> The only loops that are artificially restricted to a scale of 4096 are
> infinite loops (those loops with an exit mass of 0). This prevents
> infinite loops from skewing the frequencies of other regions in the CFG.
> 
> At the end of propagation, frequencies are scaled to values that take no
> more than 32 bits to represent. This is to prevent issues with
> transformations that cannot deal with large numbers.
> 
> Tested on x86_64.
> 

This LGTM with some minor comments below.

> http://reviews.llvm.org/D8718
> 
> Files:
>  include/llvm/Analysis/BlockFrequencyInfoImpl.h
>  lib/Analysis/BlockFrequencyInfoImpl.cpp
>  test/Analysis/BlockFrequencyInfo/bad_input.ll
>  test/Analysis/BlockFrequencyInfo/loops_with_profile_info.ll

> Index: include/llvm/Analysis/BlockFrequencyInfoImpl.h
> ===================================================================
> --- include/llvm/Analysis/BlockFrequencyInfoImpl.h
> +++ include/llvm/Analysis/BlockFrequencyInfoImpl.h
> @@ -718,9 +718,6 @@
>  ///
>  /// It has some known flaws.
>  ///
> -///   - Loop scale is limited to 4096 per loop (2^12) to avoid exhausting
> -///     BlockFrequency's 64-bit integer precision.
> -///

Nice :).

>  ///   - The model of irreducible control flow is a rough approximation.
>  ///
>  ///     Modelling irreducible control flow exactly involves setting up and
> Index: lib/Analysis/BlockFrequencyInfoImpl.cpp
> ===================================================================
> --- lib/Analysis/BlockFrequencyInfoImpl.cpp
> +++ lib/Analysis/BlockFrequencyInfoImpl.cpp
> @@ -331,32 +331,35 @@
>    return true;
>  }
>  
> -/// \brief Get the maximum allowed loop scale.
> -///
> -/// Gives the maximum number of estimated iterations allowed for a loop.  Very
> -/// large numbers cause problems downstream (even within 64-bits).
> -static Scaled64 getMaxLoopScale() { return Scaled64(1, 12); }
> -
>  /// \brief Compute the loop scale for a loop.
>  void BlockFrequencyInfoImplBase::computeLoopScale(LoopData &Loop) {
>    // Compute loop scale.
>    DEBUG(dbgs() << "compute-loop-scale: " << getLoopName(Loop) << "\n");
>  
> +  // Infinite loops need special handling. If we give the back edge an infinite
> +  // mass, they may saturate all the other scales in the function down to 1,
> +  // making all the other region temperatures look exactly the same. Choose an
> +  // arbitrary scale to avoid these issues.
> +  //
> +  // TODO: An alternate way would be to select a symbolic scale which is later
> +  // replaced to be the maximum of all computed scales plus 1. This would
> +  // appropriately describe the loop as having a large scale, without skewing
> +  // the final frequency computation.
> +  const Scaled64 InifiniteLoopScale(1, 12);
> +
>    // LoopScale == 1 / ExitMass
>    // ExitMass == HeadMass - BackedgeMass
>    BlockMass ExitMass = BlockMass::getFull() - Loop.BackedgeMass;
>  
> -  // Block scale stores the inverse of the scale.
> -  Loop.Scale = ExitMass.toScaled().inverse();
> +  // Block scale stores the inverse of the scale. If this is an

Line seems short.  Reflow?

> +  // infinite loop, its exit mass will be zero. In this case, use an
> +  // arbitrary scale for the loop scale.
> +  Loop.Scale =
> +      ExitMass.isEmpty() ? InifiniteLoopScale : ExitMass.toScaled().inverse();
>  
>    DEBUG(dbgs() << " - exit-mass = " << ExitMass << " (" << BlockMass::getFull()
>                 << " - " << Loop.BackedgeMass << ")\n"
>                 << " - scale = " << Loop.Scale << "\n");
> -
> -  if (Loop.Scale > getMaxLoopScale()) {
> -    Loop.Scale = getMaxLoopScale();
> -    DEBUG(dbgs() << " - reduced-to-max-scale: " << getMaxLoopScale() << "\n");
> -  }
>  }
>  
>  /// \brief Package up a loop.
> @@ -427,12 +430,19 @@
>    // differentiated.  However, the register allocator currently deals poorly
>    // with large numbers.  Instead, push Min up a little from 1 to give some
>    // room to differentiate small, unequal numbers.

Scaling based on `Min` is probably fine long-term (when it doesn't cause
saturation at the top).  Maybe this comment should be reworked?

BTW, I haven't looked recently, but I suspect the problem downstream is
that some optimizations add frequencies together and don't deal well
with `BlockFrequency` saturating at UINT64_MAX (or possibly don't use
`BlockFrequency` for math and don't expect overflows).  Doesn't seem
like it would be too hard to expose some API similar to
`Distribution::normalize()` and fix them.

> -  //
> -  // TODO: fix issues downstream so that ScalingFactor can be
> -  // Scaled64(1,64)/Max.
> +  const unsigned MaxBits = 32;

Given that we have 64-bits available (and in the absence of data), I
would have chosen a large range (such as 60) and reduced it as necessary
(and/or fixed optimizations that couldn't handle it).  Is there a reason
you chose 32?

> +  const unsigned SpreadBits = (Max / Min).lg();
>    Scaled64 ScalingFactor = Min.inverse();

`Min.inverse()` uses long division; it kind of feels wrong to compute it
just to ignore the result (even though this wouldn't show up in a
profile).  Up to you though.  Maybe frugality makes the code too ugly.

> -  if ((Max / Min).lg() < 60)
> +  if (SpreadBits <= MaxBits - 3) {
> +    // If the values are small enough, make the scaling factor at least 8 to
> +    // allow distinguishing small values.
>      ScalingFactor <<= 3;
> +  } else if (SpreadBits > MaxBits) {
> +    // If the values need more than MaxBits to be represented, saturate small
> +    // frequency values down to 1 by using a scaling factor that benefits large
> +    // frequency values.
> +    ScalingFactor = Scaled64(1, MaxBits) / Max;
> +  }
>  
>    // Translate the floats to integers.
>    DEBUG(dbgs() << "float-to-int: min = " << Min << ", max = " << Max