[LLVMdev] Some thinking for callgraph builder

Wed May 24 09:04:09 PDT 2006

Hi,

In response to Chris's suggestion to construct a more precise callgraph 
builder.
I express my personal thinking here on possible refinement. Any advice is 
welcome.

1. Basic data structure changes. Current callgraph builder maintains two null 
nodes, expressing the meaning of  "anything calls !internal linkage 
functions" and "functions not defined in this unit call anything" 
respectively. However, although we know some functions call "anything" and 
others called by "anything", they just don't have explicit callgraph edges 
between them.
So I just want to:
    Merge these two null nodes to one representing "external unknown 
functions". It has edges to everything *might* be visible outside this unit, 
and has edges to itself from everything might call external functions having 
not appeared in this unit.  External functions already declared in this unit 
have their own callnodes and should have explicit edges to/from others. It 
gets its improvement if we have some "pre-knowledge" of external function 
already declared in this unit. For example, we are sure that standard libc 
function "strcmp" never calls *anything*(no matter what linkage it might be) 
in our own unit. 

2. Improvements regarding indirect calls and functions whose address being 
taken. Existing callgraph builder considers them rather conservatively.
I think the possible refinements are:
   a). Further classify the call edges to: MUST, ALIAS, MAY edges.
      MUST: for edges found though direct callsites or if we can prove that a 
function pointer called by some indirect call must alias some function 
(Actually I think this should be promoted to be direct callsite by 
some optimization). We have strong evidence that if A --> B, there must be 
some(if not all) run traces of the program that do make A call B. 
      ALIAS: for edges comes from indirect calls that may alias a set of 
functions. Furthur information could be maintained for which alias set a 
ALIAS edge belongs to. It means we have strong evidence that there must be 
some run traces of the program that make a caller call at least one of the 
functions belonging to an alias set, but we are not sure about which subset 
of it must be get called due to the convertive nature of any alias analysis.
      May: If there is a MAY edge from A to B, it only means we do not have 
strong evidence that A cannot call B. For example, if B's address is passed 
as a parameter to external A.
      Besides these three calling behaviors, if node A does not have an edge 
to B, We *MUST* have strong evidence that A cannot call B. Otherwise at least 
a MAY edge must exist from A to B.     

  b). Get more strong evidence about indirect calls and for functions whose 
address is being taken more than direct calls.
     Indirect call information could be constructed either from partial result 
of DSA as suggested by Andrew. And alternatively, it can use the alias 
information constructed by any AA. I think both interface may be 
provided to let clients chose between the generality and the tight 
integration with wonderful DSA.  
     As for the case of function address being taken. I think further analysis 
could help to find the evidence if the address is only propagated to be 
used within this unit and thus should not be visable to external node.  

  c) Light weight alias analysis for only function pointers may be good for 
clients whose interest is only function pointers alias or the construction of 
callgraph but not general point-to analysis. Because just as Milanova, et al 
pointed out in "Precise and Efficient Call Graph Construction for C programs 
with function pointers", "even inexpensive pointer analyses may provide 
precision comparable to the precision of expensive pointer anlayses, and 
therefore the use of more expensive analyses may be unnecessary." It could 
improve the performance and space used.

-- 
Regards,
Nai