[llvm-dev] CFLAA
Daniel Berlin via llvm-dev
llvm-dev at lists.llvm.org
Thu Aug 25 09:55:31 PDT 2016
(and sys::cas_flag that STATISTIC uses is a uint32 ...)
On Thu, Aug 25, 2016 at 9:54 AM, Daniel Berlin <dberlin at dberlin.org> wrote:
> Okay, dumb question:
> Are you really getting negative numbers in the second column?
>
> 526,766 -136 mem2reg # PHI nodes inserted
>
> http://llvm.org/docs/doxygen/html/PromoteMemoryToRegister_8cpp_source.html
> (Search for NumPHIInsert).
>
> I don't see how it could be negative unless this wrapped around?
>
>
> On Thu, Aug 25, 2016 at 9:49 AM, David Callahan <dcallahan at fb.com> wrote:
>
>> I did gathered aggregate statistics reported by “-stats” over the ~400
>> test files.
>>
>> The following table summarizes the impact. The first column is the
>>
>> sum where the new analysis is enabled, the second column is the
>>
>> delta from baseline where no CFL alias analysis is performed. I am not
>>
>> experienced enough to know which of these are “good” or “bad” indicators.
>>
>> —david
>>
>>
>>
>> 72,250 685 SLP # vector instructions generated
>>
>> 1,256,401 566 adce # instructions removed
>>
>> 67,020,774 13,835,126 basicaa # times a GEP is decomposed
>>
>> 11,154 26 basicaa # times the limit to decompose GEPs
>> is reached
>>
>> 153,613 324 bdce # instructions removed (unused)
>>
>> 198,495 2 bdce # instructions trivialized (dead
>> bits)
>>
>> 298,621 0 cfl-od-aa Maximum compressed graph
>>
>> 58,462,719 0 cfl-od-aa Number Search Steps
>>
>> 48,401 0 cfl-od-aa # NoAlias results absed on address
>> roots
>>
>> 61,936 0 cfl-od-aa # NoAlias results on compressed
>> search path
>>
>> 3,768,131 0 cfl-od-aa # NoAlias results on fast path
>>
>> 47,016,909 0 cfl-od-aa # calls to query()
>>
>> 43,172,261 0 cfl-od-aa # instructions analyzed
>>
>> 10,515,257 0 cfl-od-aa # times there was no graph node for
>> a value
>>
>> 9,895,755 0 cfl-od-aa Total size of compressed graphs
>> (edges)
>>
>> 2,797 2 correlated-value-propagation # comparisons
>> propagated
>>
>> 66,515 -126 correlated-value-propagation # phis propagated
>>
>> 912 3 correlated-value-propagation # sdiv converted to
>> udiv
>>
>> 13,527 501 dse # other instrs removed
>>
>> 40,973 416 dse # stores deleted
>>
>> 126 -2 early-cse # compare instructions CVP'd
>>
>> 1,824,703 -138 early-cse # instructions CSE'd
>>
>> 1,875,417 87 early-cse # instructions simplified or DCE'd
>>
>> 62,505 1 functionattrs # arguments marked nocapture
>>
>> 29,979 1 functionattrs # arguments marked readonly
>>
>> 42,648 37 globaldce # functions removed
>>
>> 40,498 10 globaldce # global variables removed
>>
>> 4,368 35 gvn # blocks merged
>>
>> 21,961 26 gvn # equalities propagated
>>
>> 29,434 45 gvn # instructions PRE'd
>>
>> 631,597 3,307 gvn # instructions deleted
>>
>> 217,618 494 gvn # instructions simplified
>>
>> 51,089 634 gvn # loads PRE'd
>>
>> 135,568 1,526 gvn # loads deleted
>>
>> 2,197 4 indvars # IV comparisons eliminated
>>
>> 826 8 indvars # congruent IVs eliminated
>>
>> 2,538 4 indvars # exit values replaced
>>
>> 1,856 1 indvars # loop exit tests replaced
>>
>> 5,740,738 8 inline # caller-callers analyzed
>>
>> 1,629,169 3 inline # functions deleted because all
>> callers found
>>
>> 3,563,497 2 inline # functions inlined
>>
>> 10,879,125 86 inline-cost # call sites analyzed
>>
>> 34,766 5 instcombine # constant folds
>>
>> 3,979,078 2,004 instcombine # dead inst eliminated
>>
>> 6,323 2 instcombine # dead stores eliminated
>>
>> 1,522 4 instcombine # factorizations
>>
>> 254,146 66 instcombine # instructions sunk
>>
>> 10,427,131 1,749 instcombine # insts combined
>>
>> 57,943 -205 instcombine # reassociations
>>
>> 1,072 1 instsimplify # expansions
>>
>> 135,129 1 instsimplify # reassociations
>>
>> 121,777 246 instsimplify # redundant instructions removed
>>
>> 27,612 -12 jump-threading # branch blocks duplicated to
>> eliminate phi
>>
>> 76,000 197 jump-threading # jumps threaded
>>
>> 4,991 8 jump-threading # terminators folded
>>
>> 869,838 1,370 lcssa # live out of a loop variables
>>
>> 345,329 433 licm # instructions hoisted out of loop
>>
>> 702 -27 licm # instructions sunk out of loop
>>
>> 19,520 192 licm # load insts hoisted or sunk
>>
>> 202 37 licm # memory locations promoted to
>> registers
>>
>> 467,244 246 local # unreachable basic blocks removed
>>
>> 1,586 34 loop-delete # loops deleted
>>
>> 84 27 loop-idiom # memcpy's formed from loop
>> load+stores
>>
>> 752 7 loop-idiom # memset's formed from loop stores
>>
>> 63,364 -8 loop-rotate # loops rotated
>>
>> 4,602 1 loop-simplify # nested loops split out
>>
>> 1,244,741 472 loop-simplify # pre-header or exit blocks inserted
>>
>> 2,847 2 loop-unroll # loops completely unrolled
>>
>> 9,668 -29 loop-unroll # loops unrolled (completely or
>> otherwise)
>>
>> 5,799 -35 loop-unroll # loops unrolled with run-time trip
>> counts
>>
>> 3,863 25 loop-unswitch # branches unswitched
>>
>> 1,054,060 1,482 loop-unswitch Total number of instructions
>> analyzed
>>
>> 109,279 -3 loop-vectorize # loops analyzed for vectorization
>>
>> 526,766 -136 mem2reg # PHI nodes inserted
>>
>> 4,150,078 -3 mem2reg # alloca's promoted with a single
>> store
>>
>> 4,567 6 memcpyopt # memcpy instructions deleted
>>
>> 96 1 memcpyopt # memcpys converted to memset
>>
>> 1,074 173 memcpyopt # memmoves converted to memcpy
>>
>> 39,584 6 memcpyopt # memsets inferred
>>
>> 179,629 2,475 memdep # block queries that were completely
>> cached
>>
>> 1,020 -3 memdep # cached, but dirty, non-local ptr
>> responses
>>
>> 9,108,504 146,792 memdep # fully cached non-local ptr
>> responses
>>
>> 11,678,674 92,225 memdep # uncached non-local ptr responses
>>
>> 399,802 1,832 memory-builtins # arguments with unsolved size
>> and offset
>>
>> 10,844 -24,169 memory-builtins # load instructions with unsolved
>> size and offset
>>
>> 188,181 54 reassociate # insts reassociated
>>
>> 87,009 -82 scalar-evolution # loops with predictable loop
>> counts
>>
>> 402,724 71 scalar-evolution # loops without predictable loop
>> counts
>>
>> 133,310 72 sccp # basic blocks unreachable
>>
>> 275,949 263 sccp # instructions removed
>>
>> 2,056,414 723 simplifycfg # blocks simplified
>>
>> 5,292 -36 simplifycfg # common instructions sunk down to
>> the end block
>>
>> 15,110 1 simplifycfg # speculative executed instructions
>>
>> 43,068 -2 sroa Maximum number of uses of a partition
>>
>> 11,754,901 -180 sroa # alloca partition uses rewritten
>>
>> 4,623,115 -11 sroa # alloca partitions formed
>>
>> 5,927,727 -11 sroa # allocas analyzed for replacement
>>
>> 4,576,406 -5 sroa # allocas promoted to SSA values
>>
>> 13,770,636 -227 sroa # instructions deleted
>>
>> 3,797 -1 strip-dead-prototypes # dead prototypes removed
>>
>>
>>
>>
>>
>>
>>
>> From: Daniel Berlin <dberlin at dberlin.org>
>> Date: Thursday, August 25, 2016 at 9:06 AM
>> To: David Callahan <dcallahan at fb.com>
>> Cc: George Burgess IV <george.burgess.iv at gmail.com>, LLVM Dev Mailing
>> list <llvm-dev at lists.llvm.org>
>> Subject: Re: [llvm-dev] CFLAA
>>
>> Hey David,
>> I'll take a look at the patch :)
>> Sounds like fun work.
>>
>> As George says, improving AA significantly will almost always cause
>> significant performance regressions at first, in almost any compiler.
>>
>> Compilers knobs, passes, usually get tuned for x amount of freedom, and
>> if you give them 10x, they start moving things too far, vectorizing too
>> much, spilling, etc.
>>
>> This was definitely the case for GCC, where adding a precise
>> interprocedural field-sensitive analysis initially regressed performance by
>> a few percent on average.
>>
>> I know it was also the case for XLC at IBM, etc.
>>
>> Like anything else, just gotta figure out what passes are going nuts, and
>> rework them to have better heuristics/etc.
>> The end result is performance improvements, but the path takes a bit of
>> time.
>>
>> If you need a way to see whether your analysis has actually done an okay
>> job in the meantime, usually a good way to see if you are doing well or not
>> is to see how many loads/stores get eliminated or moved by various passes
>> before and after.
>>
>> If the number is significantly higher, great.
>> If the number is significantly lower, something has likely gone wrong :)
>>
>>
>> On Thu, Aug 25, 2016 at 8:11 AM, David Callahan via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> (Adding “LLVM Dev”)
>>>
>>> My variant is up as https://reviews.llvm.org/D23876
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D23876&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=w5NvhJ0O9-ynWwh32R64KxDnRJN4Mv9OxUgD44L1GSI&e=>
>>> —david
>>>
>>>
>>> From: George Burgess IV <george.burgess.iv at gmail.com>
>>> Date: Wednesday, August 24, 2016 at 3:17 PM
>>> To: David Callahan <dcallahan at fb.com>
>>> Subject: Re: CFLAA
>>>
>>> Hi!
>>>
>>> > I see there is on going work with alias analysis and it appears the
>>> prior CFLAA has been abandoned.
>>>
>>> There was quite a bit of refactoring done, yeah. The original CFLAA is
>>> now called CFLSteens, and graph construction was moved to its own bit. We
>>> also have CFLAnders, which is based more heavily on the paper by Zheng and
>>> Rugina (e.g. no stratifiedsets magic).
>>>
>>> > I have a variant of it where I reworked how compression was done to be
>>> less conservative, reworked the interprocedural to do simulated but bounded
>>> inlining, and added code to do on-demand testing of CFL paths on both
>>> compressed and full graphs.
>>>
>>> Awesome!
>>>
>>> > Happy to share the patch with you if you are interested as well as
>>> some data collected
>>>
>>> Yes, please. Would you mind if I CC'ed llvm-dev on this thread (and a
>>> few people specifically, who also might find this interesting)?
>>>
>>> > However I was not able to see any performance improvements in the
>>> code. In fact on a various benchmarks there were noticeable regressions in
>>> measured performance of the generated code. Have you noticed any similar
>>> problems?
>>>
>>> I know that a number of people people in the community expressed
>>> concerns about how other passes will perform with better AA results (e.g.
>>> If LICM becomes more aggressive, register pressure may increase, which may
>>> cause us to spill when we haven't before, etc). So, such a problem isn't
>>> unthinkable. :)
>>>
>>> On Wed, Aug 24, 2016 at 2:56 PM, David Callahan <dcallahan at fb.com>
>>> wrote:
>>>
>>>> Hi Greg,
>>>>
>>>>
>>>>
>>>> I see there is on going work with alias analysis and it appears the
>>>> prior CFLAA has been abandoned.
>>>>
>>>>
>>>>
>>>> I have a variant of it where I reworked how compression was done to be
>>>> less conservative, reworked the interprocedural to do simulated but bounded
>>>> inlining, and added code to do on-demand testing of CFL paths on both
>>>> compressed and full graphs.
>>>>
>>>>
>>>>
>>>> I reached a point where the ahead-of-time compression was linear but
>>>> still very accurate compared to on-demand path search and there were
>>>> noticeable improvements in the alias analysis results and impacted
>>>> transformations. Happy to share the patch with you if you are interested
>>>> as well as some data collected.
>>>>
>>>>
>>>>
>>>> However I was not able to see any performance improvements in the code.
>>>> In fact on a various benchmarks there were noticeable regressions in
>>>> measured performance of the generated code. Have you noticed any similar
>>>> problems?
>>>>
>>>>
>>>>
>>>> --david
>>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=lFyiPUrFdOHdaobP7i4hoA&m=3IIr_u9iBJMmiJs5esz2CusHub4rwjMYvjBstOaOQTQ&s=tNsOAenrwSMjlSuk3LT6kVtwhCkamHx4Et1smBmoCXQ&e=>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160825/25e19284/attachment-0001.html>
More information about the llvm-dev
mailing list