[PATCH] blockfreq: Rewrite block frequency analysis

Tue Mar 18 23:53:29 PDT 2014

On 2014 Mar 18, at 17:33, Chandler Carruth <chandlerc at google.com> wrote:

> On Tue, Mar 18, 2014 at 4:52 PM, Andrew Trick <atrick at apple.com> wrote:
> I'd like to see the motivation for avoiding native floating point reiterated. Are we just concerned about determinism across non-754-compliant platforms, or is IEEE 754 insufficient for our needs? If it's just non-compliant platforms, then how many people care if an x87 build generates different code?
> 
> Neither GCC nor LLVM are IEEE 754 compliant host compilers, so I have no idea how you make this work regardless of the hardware platform. GCC has recently added a flag aimed at supporting a more strict mode, but it is quite recent and I have no idea how well the bugs are shaken out. LLVM has no such flag and is a long way from that conformance.

I was concerned about determinism and portability, but I hadn’t
thought about it much.  I just assumed floats were off-limits.

If hardware floats are an option, a long double is almost a drop-in
replacement for PositiveFloat, surely faster, and a negligible
maintenance burden.  I don’t know enough about portability to make
a call on that, though.

>  I know that so far our attempts to work around using a floating-point representation in ad-hoc ways have led to madness. But, it is worth asking one more time: can we avoid running into the dynamic range by artificially limiting the loop scale? i.e. one we know all the scales, is it possible to adjust them to avoid overflow?
> 
> I actually am interested in this approach and its one of the things that I've been pondering since reading Duncan's original email.

The scale *is* limited here.  Really, what I want for this use case
is 64-bits on either side of 1.0 to give space for functions with
either deeply nested loops or very wide branches.

(More on this below.)

> Assuming we need soft-float, I'm sure you'll be able to demonstrate the performance advantage over APFloat. Using PositiveFloat might even generate a smaller dynamic footprint than reusing APFloat.

Yeah, I’m pretty confident it’ll be faster (since it really is a lot
simpler).  I haven’t had a chance to run any numbers yet.  The
maintenance burden is probably the real concern.

> In a side conversation with Duncan I hinted that I am kind of thinking that as well. Essentially, it seems increasingly like there is a useful abitrary precision, accurate, and fast floating point class that could both be used here with precise semantics and in APFloat's consumers with carefully twisted semantics to match "hardware behavior". I'm interested in the potential for layering these things that way, but resistant to starting off with flat duplication.

There are 3 main bits of non-trivial code in PositiveFloat:
multiply64, divide64, and toString.  I could probably find a way to
share these with APFloat.  I just had a look at APFloat, and the
algorithms basically match.  The main complication is that APFloat
handles floats that have >64-bits precision, but there still might
be a way to share core logic.

===

Another option is to avoid floats altogether, as Andy mentioned.

I spent some time this evening looking again at whether any sort of
float is necessary for the algorithm.  If we’re willing to use
approximate loop scales (i.e., powers of two), we can (essentially)
avoid the use of floats.

E.g.,

    entry -> header
    header -> exit (1/5)
    header -> a (1/5)
    header -> b (3/5)
    a -> header
    b -> header

In that graph, the loop scale should be 5.  If we use 4 (a power of
two) instead, it’s possible to do the math without implementing a
floating point divide.  Then we have block frequencies of
(correct => calculated):

  - entry:  1.0 => 1.0
  - header: 5.0 => 4.0
  - a:      1.0 => 0.8
  - b:      3.0 => 2.4
  - exit:   1.0 => 1.0

An earlier iteration used this sort of fuzzy math.  I hit a problem
and thought the soft-float solution was an easy out.  It’s solvable
though.

===

In more detail:  the loop scale can be stored as its lg, a value by
which to shift the mass.  I originally kept the loop scales
separate from the masses, and multiplied the masses by each other
and the loop scales by each other when unwrapping loops (actually,
added the stored values for loop scales, since they’re stored as
lg).  One of my goals was to expose the mass and loop scale
downstream.  However, the masses zeroed out when bootstrapping
clang with LTO, and the register allocator had trouble with the
resulting block frequencies.  Apparently, a lot gets inlined in
bin/opt, leading to lots of loops and lots of branching.

I can fix it by combining the scales and masses dynamically; as the
masses gain zeros in the upper bits, I can shift them left and take
away from the loop scales.

My observation was that this is halfway to a soft-float
implementation, so I just made myself a soft-float that was easy to
test in isolation.

A better idea might have been to send an RFC at that point ;).