[LLVMbugs] [Bug 22716] New: Need a mechanism to represent global profile counts in CFG and MachineCFG

Thu Feb 26 14:36:48 PST 2015

http://llvm.org/bugs/show_bug.cgi?id=22716

            Bug ID: 22716
           Summary: Need a mechanism to represent global profile counts in
                    CFG and MachineCFG
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: All
            Status: NEW
          Severity: normal
          Priority: P
         Component: Global Analyses
          Assignee: dnovillo at google.com
          Reporter: dnovillo at google.com
                CC: llvmbugs at cs.uiuc.edu
    Classification: Unclassified

At the CFG level, LLVM does not have a way to represent BasicBlock and CFG Edge
profile execution counts. The profile count is important for making
inter-procedural  decisions (e.g, comparing callsite hotness of two callsites
in different callers, or compute the global savings when a callsite is
inlined). 

The existing block/edge frequency information is not really fit to represent
relatively hotness of BBs/edges intra-procedurally. Frequency information is
currently scaled, capped and it’s not even possible to use it for comparing
across different regions in the same CFG (because it is computed as a relative
value to the origin block).

Some options we could consider:

1. Introduce a BlockProfileCountInfo data structure similar to
BranchFrequencyInfo.  
Pros: no need to change existing profile data representation infrastructure
Cons: 
 a) one more profile related data structure - leading to more confusion
 b) increases memory usage in compiler 
 c) need to keep freq and count data in sync 

2. Use BlockFrequencyInfo to represent absolute profile count when profile data
is available.
Pros: no new data structure is needed -- maximal reuse of existing classes.
Cons: clients of the frequency data may assume the frequency data is normalized
in a range, and misbehave when that assumption is broken. Basically means more
clean up work when PGO is enabled. When PGO is off, nothing needs to be
changed.

3. Augment BlockFrequencyInfo to represent absolute profile count -- in other
words, the Freqs[] data still represent normalized intra-procedural frequency
data, BlockFrequencyInfo is augmented with the following 1) Function entry
global count 2) function entry freq. 1) and 2) is used to scale the freq to
real profile count in access APIs.
Pros: same as 2)
Cons: additional scaling is needed when accessing profile count data.

The above discusses the BB profile count representation. The remaining question
is when real profile/count data is available, do we still need to use the
frequency propagation algorithm to compute the frequency/count using the real
branch probability  data or directly translate from the BB count (from edge
count) using simple scaling?

Here is the description of the propagation algorithm:

1. walks the loop tree bottom up. For each loop
    a. walks the BB in the loop in rpro order and distribute the 'weight' or
'mass' of BB using branch probability
    b) compute the loop scale/trip count (using exist mass and header mass) 
  (inner loops are treated as one node)
2. walks the remaining BBs in function in rpro order to distribute mass as in 1
3. walks the loop tree top-down to scale up BB freq. The BB freq/mass computed
in 1 assumes loop header mass == 1. The scaling is done using loop
scale/iteration and loop's own mass (aka freq of preheader).

The algorithm has the following problems:
1) it does not handle irreducible loops well -- verified with simple test case
using real branch probability 
2) it limits the loop scale to <= 4096 for legacy reason -- this is a great
limiter to the precision of the result
3) The fixed point representation of the mass leads to precision loss. For the
small test case I tried, there resulting 'count' can be seen with >2%
difference than real profile count with well formed loops
4) the capping and off by one problem in tools/clang/lib/CodeGen/CodeGenPGO.cpp
-- this can distort results greatly.

When real profile data is available, it would probably be best to skip the
freq. propgation algorithm completely and use simple scaling for the above
reasons. It would also benefit compile time - the above propagation has O(V+E)
complexity for regular loops and up to O(V*E) for irreducible loops.

Representation of BlockFrequencyInfoImpl could also be improved

1) FrequenyData is a struct of Scaled64 and unint64_t members.  The fixed point
results in Scaled64 member is never exposed in the API after the frequency is
computed, but still kept in memory
2) it uses two level of indirections for a lookup:

  a) look at a denseMap to get the internal BlockNode from BasicBlock*
  b) use the index from BlockNode to access of vector of FrequencyData

It can be simplified to use map from BasicBlock* to FrequencyData. This will
also simplify profile update.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20150226/033d86f9/attachment.html>