[llvm-dev] CFLAA

Thu Aug 25 13:45:26 PDT 2016

Note that this kind of "graph CSE" is the basis of  HVN/HU/HRU in
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0ahUKEwjEupa6t93OAhUE1mMKHbsCDU4QFggeMAE&url=https%3A%2F%2Fwww.cs.ucsb.edu%2F~benh%2Fresearch%2Fpapers%2Fhardekopf07exploiting.pdf&usg=AFQjCNGNzQ6vgsfxWRMW3a5aA4fGGLADOQ&sig2=3mC64txCKq5daxQqMK7UIA&bvm=bv.130731782,d.cGc

You could do the same analysis on this graph.

On Thu, Aug 25, 2016 at 10:17 AM, David Callahan <dcallahan at fb.com> wrote:

> Here is a summary of the experiment behind this patch
> https://www.facebook.com/notes/david-callahan/llvm-cfl-alias-analysis/
> 10150648030284971
>
> From: Daniel Berlin <dberlin at dberlin.org>
> Date: Thursday, August 25, 2016 at 9:55 AM
>
> To: David Callahan <dcallahan at fb.com>
> Cc: George Burgess IV <george.burgess.iv at gmail.com>, LLVM Dev Mailing
> list <llvm-dev at lists.llvm.org>
> Subject: Re: [llvm-dev] CFLAA
>
> (and sys::cas_flag that STATISTIC uses is a uint32 ...)
>
> On Thu, Aug 25, 2016 at 9:54 AM, Daniel Berlin <dberlin at dberlin.org>
> wrote:
>
>> Okay, dumb question:
>> Are you really getting negative numbers in the second column?
>>
>>     526,766      -136  mem2reg       # PHI nodes inserted
>>
>> http://llvm.org/docs/doxygen/html/PromoteMemoryToRegister_8c
>> pp_source.html
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_docs_doxygen_html_PromoteMemoryToRegister-5F8cpp-5Fsource.html&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=dU2DwQMtmgh6cAhA6xnKC8SHBBR9WVaODVF8fDARIjk&s=2ZH2FuAaE0z-bdSkGBtIDBeN1Qw-28jivYLXVy158sU&e=>
>> (Search for NumPHIInsert).
>>
>> I don't see how it could be negative unless this wrapped around?
>>
>>
>> On Thu, Aug 25, 2016 at 9:49 AM, David Callahan <dcallahan at fb.com> wrote:
>>
>>> I did gathered aggregate statistics reported by “-stats” over the ~400
>>> test files.
>>>
>>> The following table summarizes the impact.  The first column is the
>>>
>>> sum where the new analysis is enabled, the second column is the
>>>
>>> delta from baseline where no CFL alias analysis is performed. I am not
>>>
>>> experienced enough to know which of these are “good” or “bad” indicators
>>> .
>>>
>>> —david
>>>
>>>
>>>
>>>      72,250       685  SLP           # vector instructions generated
>>>
>>>   1,256,401       566  adce          # instructions removed
>>>
>>>  67,020,774  13,835,126  basicaa       # times a GEP is decomposed
>>>
>>>      11,154        26  basicaa       # times the limit to decompose GEPs
>>> is reached
>>>
>>>     153,613       324  bdce          # instructions removed (unused)
>>>
>>>     198,495         2  bdce          # instructions trivialized (dead
>>> bits)
>>>
>>>     298,621         0  cfl-od-aa     Maximum compressed graph
>>>
>>>  58,462,719         0  cfl-od-aa     Number Search Steps
>>>
>>>      48,401         0  cfl-od-aa     # NoAlias results absed on address
>>> roots
>>>
>>>      61,936         0  cfl-od-aa     # NoAlias results on compressed
>>> search path
>>>
>>>   3,768,131         0  cfl-od-aa     # NoAlias results on fast path
>>>
>>>  47,016,909         0  cfl-od-aa     # calls to query()
>>>
>>>  43,172,261         0  cfl-od-aa     # instructions analyzed
>>>
>>>  10,515,257         0  cfl-od-aa     # times there was no graph node for
>>> a value
>>>
>>>   9,895,755         0  cfl-od-aa     Total size of compressed graphs
>>> (edges)
>>>
>>>       2,797         2  correlated-value-propagation  # comparisons
>>> propagated
>>>
>>>      66,515      -126  correlated-value-propagation  # phis propagated
>>>
>>>         912         3  correlated-value-propagation  # sdiv converted to
>>> udiv
>>>
>>>      13,527       501  dse           # other instrs removed
>>>
>>>      40,973       416  dse           # stores deleted
>>>
>>>         126        -2  early-cse     # compare instructions CVP'd
>>>
>>>   1,824,703      -138  early-cse     # instructions CSE'd
>>>
>>>   1,875,417        87  early-cse     # instructions simplified or DCE'd
>>>
>>>      62,505         1  functionattrs  # arguments marked nocapture
>>>
>>>      29,979         1  functionattrs  # arguments marked readonly
>>>
>>>      42,648        37  globaldce     # functions removed
>>>
>>>      40,498        10  globaldce     # global variables removed
>>>
>>>       4,368        35  gvn           # blocks merged
>>>
>>>      21,961        26  gvn           # equalities propagated
>>>
>>>      29,434        45  gvn           # instructions PRE'd
>>>
>>>     631,597     3,307  gvn           # instructions deleted
>>>
>>>     217,618       494  gvn           # instructions simplified
>>>
>>>      51,089       634  gvn           # loads PRE'd
>>>
>>>     135,568     1,526  gvn           # loads deleted
>>>
>>>       2,197         4  indvars       # IV comparisons eliminated
>>>
>>>         826         8  indvars       # congruent IVs eliminated
>>>
>>>       2,538         4  indvars       # exit values replaced
>>>
>>>       1,856         1  indvars       # loop exit tests replaced
>>>
>>>   5,740,738         8  inline        # caller-callers analyzed
>>>
>>>   1,629,169         3  inline        # functions deleted because all
>>> callers found
>>>
>>>   3,563,497         2  inline        # functions inlined
>>>
>>>  10,879,125        86  inline-cost   # call sites analyzed
>>>
>>>      34,766         5  instcombine   # constant folds
>>>
>>>   3,979,078     2,004  instcombine   # dead inst eliminated
>>>
>>>       6,323         2  instcombine   # dead stores eliminated
>>>
>>>       1,522         4  instcombine   # factorizations
>>>
>>>     254,146        66  instcombine   # instructions sunk
>>>
>>>  10,427,131     1,749  instcombine   # insts combined
>>>
>>>      57,943      -205  instcombine   # reassociations
>>>
>>>       1,072         1  instsimplify  # expansions
>>>
>>>     135,129         1  instsimplify  # reassociations
>>>
>>>     121,777       246  instsimplify  # redundant instructions removed
>>>
>>>      27,612       -12  jump-threading  # branch blocks duplicated to
>>> eliminate phi
>>>
>>>      76,000       197  jump-threading  # jumps threaded
>>>
>>>       4,991         8  jump-threading  # terminators folded
>>>
>>>     869,838     1,370  lcssa         # live out of a loop variables
>>>
>>>     345,329       433  licm          # instructions hoisted out of loop
>>>
>>>         702       -27  licm          # instructions sunk out of loop
>>>
>>>      19,520       192  licm          # load insts hoisted or sunk
>>>
>>>         202        37  licm          # memory locations promoted to
>>> registers
>>>
>>>     467,244       246  local         # unreachable basic blocks removed
>>>
>>>       1,586        34  loop-delete   # loops deleted
>>>
>>>          84        27  loop-idiom    # memcpy's formed from loop
>>> load+stores
>>>
>>>         752         7  loop-idiom    # memset's formed from loop stores
>>>
>>>      63,364        -8  loop-rotate   # loops rotated
>>>
>>>       4,602         1  loop-simplify  # nested loops split out
>>>
>>>   1,244,741       472  loop-simplify  # pre-header or exit blocks
>>> inserted
>>>
>>>       2,847         2  loop-unroll   # loops completely unrolled
>>>
>>>       9,668       -29  loop-unroll   # loops unrolled (completely or
>>> otherwise)
>>>
>>>       5,799       -35  loop-unroll   # loops unrolled with run-time trip
>>> counts
>>>
>>>       3,863        25  loop-unswitch  # branches unswitched
>>>
>>>   1,054,060     1,482  loop-unswitch  Total number of instructions
>>> analyzed
>>>
>>>     109,279        -3  loop-vectorize  # loops analyzed for vectorization
>>>
>>>     526,766      -136  mem2reg       # PHI nodes inserted
>>>
>>>   4,150,078        -3  mem2reg       # alloca's promoted with a single
>>> store
>>>
>>>       4,567         6  memcpyopt     # memcpy instructions deleted
>>>
>>>          96         1  memcpyopt     # memcpys converted to memset
>>>
>>>       1,074       173  memcpyopt     # memmoves converted to memcpy
>>>
>>>      39,584         6  memcpyopt     # memsets inferred
>>>
>>>     179,629     2,475  memdep        # block queries that were
>>> completely cached
>>>
>>>       1,020        -3  memdep        # cached, but dirty, non-local ptr
>>> responses
>>>
>>>   9,108,504   146,792  memdep        # fully cached non-local ptr
>>> responses
>>>
>>>  11,678,674    92,225  memdep        # uncached non-local ptr responses
>>>
>>>     399,802     1,832  memory-builtins  # arguments with unsolved size
>>> and offset
>>>
>>>      10,844   -24,169  memory-builtins  # load instructions with
>>> unsolved size and offset
>>>
>>>     188,181        54  reassociate   # insts reassociated
>>>
>>>      87,009       -82  scalar-evolution  # loops with predictable loop
>>> counts
>>>
>>>     402,724        71  scalar-evolution  # loops without predictable
>>> loop counts
>>>
>>>     133,310        72  sccp          # basic blocks unreachable
>>>
>>>     275,949       263  sccp          # instructions removed
>>>
>>>   2,056,414       723  simplifycfg   # blocks simplified
>>>
>>>       5,292       -36  simplifycfg   # common instructions sunk down to
>>> the end block
>>>
>>>      15,110         1  simplifycfg   # speculative executed instructions
>>>
>>>      43,068        -2  sroa          Maximum number of uses of a
>>> partition
>>>
>>>  11,754,901      -180  sroa          # alloca partition uses rewritten
>>>
>>>   4,623,115       -11  sroa          # alloca partitions formed
>>>
>>>   5,927,727       -11  sroa          # allocas analyzed for replacement
>>>
>>>   4,576,406        -5  sroa          # allocas promoted to SSA values
>>>
>>>  13,770,636      -227  sroa          # instructions deleted
>>>
>>>       3,797        -1  strip-dead-prototypes  # dead prototypes removed
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> From: Daniel Berlin <dberlin at dberlin.org>
>>> Date: Thursday, August 25, 2016 at 9:06 AM
>>> To: David Callahan <dcallahan at fb.com>
>>> Cc: George Burgess IV <george.burgess.iv at gmail.com>, LLVM Dev Mailing
>>> list <llvm-dev at lists.llvm.org>
>>> Subject: Re: [llvm-dev] CFLAA
>>>
>>> Hey David,
>>> I'll take a look at the patch :)
>>> Sounds like fun work.
>>>
>>> As George says, improving AA significantly will almost always cause
>>> significant performance regressions at first, in almost any compiler.
>>>
>>> Compilers knobs, passes,  usually get tuned for x amount of freedom, and
>>> if you give them 10x, they start moving things too far, vectorizing too
>>> much, spilling, etc.
>>>
>>> This was definitely the case for GCC, where adding a precise
>>> interprocedural field-sensitive analysis initially regressed performance by
>>> a few percent on average.
>>>
>>> I know it was also the case for XLC at IBM, etc.
>>>
>>> Like anything else, just gotta figure out what passes are going nuts,
>>> and rework them to have better heuristics/etc.
>>> The end result is performance improvements, but the path takes a bit of
>>> time.
>>>
>>> If you need a way to see whether your analysis has actually done an okay
>>> job in the meantime, usually a good way to see if you are doing well or not
>>> is to see how many loads/stores get eliminated or moved by various passes
>>> before and after.
>>>
>>> If the number is significantly higher, great.
>>> If the number is significantly lower, something has likely gone wrong :)
>>>
>>>
>>> On Thu, Aug 25, 2016 at 8:11 AM, David Callahan via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> (Adding “LLVM Dev”)
>>>>
>>>> My variant is up as https://reviews.llvm.org/D23876
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D23876&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=w5NvhJ0O9-ynWwh32R64KxDnRJN4Mv9OxUgD44L1GSI&e=>
>>>> —david
>>>>
>>>>
>>>> From: George Burgess IV <george.burgess.iv at gmail.com>
>>>> Date: Wednesday, August 24, 2016 at 3:17 PM
>>>> To: David Callahan <dcallahan at fb.com>
>>>> Subject: Re: CFLAA
>>>>
>>>> Hi!
>>>>
>>>> > I see there is on going work with alias analysis and it appears the
>>>> prior CFLAA has been abandoned.
>>>>
>>>> There was quite a bit of refactoring done, yeah. The original CFLAA is
>>>> now called CFLSteens, and graph construction was moved to its own bit. We
>>>> also have CFLAnders, which is based more heavily on the paper by Zheng and
>>>> Rugina (e.g. no stratifiedsets magic).
>>>>
>>>> > I have a variant of it where I reworked how compression was done to
>>>> be less conservative, reworked the interprocedural to do simulated but
>>>> bounded inlining, and added code to do on-demand testing of CFL paths on
>>>> both compressed and full graphs.
>>>>
>>>> Awesome!
>>>>
>>>> > Happy to share the patch with you if you are interested as well as
>>>> some data collected
>>>>
>>>> Yes, please. Would you mind if I CC'ed llvm-dev on this thread (and a
>>>> few people specifically, who also might find this interesting)?
>>>>
>>>> > However I was not able to see any performance improvements in the
>>>> code. In fact on a various benchmarks there were noticeable regressions in
>>>> measured performance of the generated code. Have you noticed any similar
>>>> problems?
>>>>
>>>> I know that a number of people people in the community expressed
>>>> concerns about how other passes will perform with better AA results (e.g.
>>>> If LICM becomes more aggressive, register pressure may increase, which may
>>>> cause us to spill when we haven't before, etc). So, such a problem isn't
>>>> unthinkable. :)
>>>>
>>>> On Wed, Aug 24, 2016 at 2:56 PM, David Callahan <dcallahan at fb.com>
>>>> wrote:
>>>>
>>>>> Hi Greg,
>>>>>
>>>>>
>>>>>
>>>>> I see there is on going work with alias analysis and it appears the
>>>>> prior CFLAA has been abandoned.
>>>>>
>>>>>
>>>>>
>>>>> I have a variant of it where I reworked how compression was done to be
>>>>> less conservative, reworked the interprocedural to do simulated but bounded
>>>>> inlining, and added code to do on-demand testing of CFL paths on both
>>>>> compressed and full graphs.
>>>>>
>>>>>
>>>>>
>>>>> I reached a point where the ahead-of-time compression was linear but
>>>>> still very accurate compared to on-demand path search and there were
>>>>> noticeable improvements in the alias analysis results and impacted
>>>>> transformations.  Happy to share the patch with you if you are interested
>>>>> as well as some data collected.
>>>>>
>>>>>
>>>>>
>>>>> However I was not able to see any performance improvements in the
>>>>> code. In fact on a various benchmarks there were noticeable regressions in
>>>>> measured performance of the generated code. Have you noticed any similar
>>>>> problems?
>>>>>
>>>>>
>>>>>
>>>>> --david
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=tNsOAenrwSMjlSuk3LT6kVtwhCkamHx4Et1smBmoCXQ&e=>
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160825/598d29a6/attachment.html>