[cfe-dev] [Analyzer] Attempting to speed up static analysis

Wed Aug 26 20:59:24 PDT 2015

Hi,

Thank you Ted for this great writeup on the analyzer internals! I have
noticed one possible algorithmic performance improvement. See my comment
inline.

On 26 August 2015 at 19:53, Ted Kremenek via cfe-dev <cfe-dev at lists.llvm.org
> wrote:

> Hi Ben,
>
> Thank you for taking a crack at looking at how to speed up the analyzer.
> The last time this was done was years ago, and I’m glad to see some fresh
> eyes taking a look at it.
>
> I think to answer your questions I think it is worth stepping back and
> understanding a bit better how the pieces of the analyzer work together,
> and how they accomplish things as a whole.  While the profile data is
> illuminating, I don’t think it provides a total picture of what is going on.
>
> To put things in perspective, the static analyzer employs a path-sensitive
> dataflow analysis, or analogously, a symbolic execution.  In the worst case
> this is a super exponential algorithm.  Analysis is essentially driven by a
> worklist algorithm with a finite number of items that can be removed from
> the worklist, which is essentially bounds the amount of work.  That still
> means the analyzer can take a long time, but it is guaranteed to terminate.
>
> Because the algorithm is super exponential, in practice the real
> performance wins are usually algorithmic.  Essentially, they can fall into
> two categories:
>
> 1. Don’t repeat the same work.  The main example of this in the analyzer
> right now is to perform memoization so that the same path is not analyzed
> more than once.  The FoldingSet code that you see show up in the profile is
> crucial to this memoization.  It’s not surprising that it shows up a lot,
> but keep in mind that it is being used to do intelligent caching so that we
> don’t get super exponential complexity in practice.
>
> 2. Do the same work more intelligently in a way that, algorithmically,
> takes less time.  One example of this is that currently the analyzer
> unrolls loops 3 times, but there are potential ways to analyze some loops
> without unrolling.  You thus can get the same analysis quality with less
> work, and increase scalability.
>
> Speeding up the raw data structures in isolation can sometimes be a win,
> but usually this results in a constant factor win at best.  That really
> doesn’t chip away at the scalability problem.  From your email, it sounds
> like you’re initially interested in reducing the current analysis time with
> keeping the analyzer doing the same amount of work.  That’s great, but also
> keep in mind there is a bunch of analysis that is not done right now
> because the analyzer doesn’t scale.  Finding ways to truly make the
> analyzer scale to larger problems requires much deeper changes to the
> analyzer, but will likely result in far more profound performance wins in
> the end.
>
> Another crucial thing to take into account when profiling the analyzer is
> memory usage.  Some of the current tradeoffs were reached because keeping
> the memory usage down is a critical part of making the analyzer perform.
> There’s an immense amount of parallelism in the analyzer’s analysis; the
> most basic one that is used now is that multiple translation units can be
> analyzed in parallel, which means N number of clang processes doing static
> analysis (where N is the number of cores the machine can saturate).  If the
> memory usage is too high, the machine will start to thrash.  More
> intelligent scheduling of analyzer processes is also another way to
> increase the analyzer’s scalability in practice, and this is not a task
> anyone has taken on.  There’s also an immense amount of parallelism to be
> exploited when analyzing a single translation unit.  That can be done in
> two ways: parallelize the current worklist algorithm, which will be tricky
> (but not impossible) since many of the data structures used are not
> thread-safe, or move more towards a summary-based algorithm approach the a
> few of us having been chatting about for years (which would be the basis of
> doing whole program, interprocedural analysis).  Both are non-trivial
> efforts, but are likely to achieve HUGE wins in performance and scalability
> because they algorithmically scale up the performance of the analyzer.
>
> I’ll respond to your other points inline, since they have important
> context.
>
>
> On Aug 26, 2015, at 2:34 PM, Ben Craig via cfe-dev <cfe-dev at lists.llvm.org>
> wrote:
>
> I am trying to speed up the static analyzer.  I also haven’t really worked
> on LLVM or Clang much in the past, so some of this may be old news, or
> naïve, or just plain wrong.  Regardless, here are some of the things I’ve
> discovered while profiling and experimenting in the code base:
>
> 1.       Getting consistent static analysis times is difficult.  This has
> been making it difficult for me to quantify the change in performance as I
> make changes.  If anyone has suggestions, or even a standard benchmark to
> use, then it would help me out a lot.
> 2.       For my toy test case, roughly 10% of the CPU’s time is spent in
> the checkers (inclusive time / time including children).  This seems low.
> 3.       Roughly 20% (inclusive) is spent in the Immutable family of data
> structures.
> 4.       Roughly 25% (inclusive) is spent in the llvm::FoldingSet code.
> Much of this is memory / cache bound.
>
>
> Neither 3 or 4 is surprising.  This is basically the meat-and-potatoes of
> the analyzer engine.  Making these faster would obviously speed up the
> analyzer.
>
> 5.       In my toy test case, ExplodedGraph::getNode returns a new node
> ~99% of the time.  The majority of the old nodes found are purge nodes, but
> there are a few non-purge nodes that get returned.
>
> With all this in mind, it seems reasonable for me to focus on the data
> structures behind the static analyzer, rather than on the checkers.
>
>
> To be honest, I’m not convinced this is the best investment of your time.
> I don’t want to discourage you, but aren’t going to get the huge wins you
> allude to in the following sentence:
>
>   I suspect that there are some algorithmic changes hiding that could
> improve the speed by orders of magnitude, if I only understood the
> underlying code and their assumptions better.
>
>
> That said, speeding up the core data structures could be instructive… but
> you must do it with GREAT care.  The data structures chosen have specific
> semantic properties which make them viable.  I’ll comment on them more
> below.  But I do think there are wins to be made here, but a key
> requirement is that the memory usage of the analyzer doesn’t greatly
> increase.  In fact, I’d be willing to slow the analyzer down slightly if we
> could save more memory.  Using less memory makes the analyzer more viable
> on more complicated projects where we tend to hit the maximum analysis time.
>
>
> For example, with ExplodedGraph::Node, I think I am averaging 3 cache
> misses per ExplodedGraph::getNode call in order to support deduplication,
> even though less than 1% of lookups are going to be deduplicated.  Does
> anyone on this list know if there are some reasonable criteria to know in
> advance if a node needs to be stored for later deduplication?  If we know
> that a node won’t / doesn’t need to be reduplicated, then we could insert
> the node blindly into the hash map without looking at all the colliding
> nodes.
>
>
> Yes, there are a bunch of heuristics you can employ, but I think you’ll
> need to do a lot more profiling to be sure.
>
> Essentially the de-duplication happens when paths can possibly merge
> together.  I can see this in three places:
>
> (1) At confluence points in the CFG, where multiple branches merge
> together at one basic block.  At that point, multiple paths may have the
> same ProgramState, and thus one of them can “cache out”.  Profiling would
> show this in practice.
>
> (2) After nodes generated by a checker.  Checkers can generate new
> ProgramStates that after reaching some critical point in the analysis have
> information removed or otherwise reach an equilibrium, and thus
> ExplodedNodes from different paths may suddenly have the same ProgramStates
> (and thus cache out).  This is only a theory, however, in how much of an
> impact this really has in practice.  Profiling would show this in practice.
>
> (3) After a ProgramState has been cleaned out using RemoveDeadBindings.
> The ProgramState returned by RemoveDeadBindings may have a bunch of data
> cleansed from it that made it more unique from other states.  At that
> point, the path is more likely to cache out.  Again a theory, and one that
> can be verified by more profiling.
>
> When profiling, you’ll also want to analyze a variety of codebases, as the
> characteristics could be very different.
>
>
> The other big area that I’m confused by is the use and purpose of the
> ImmutableMap typedef Environment::BindingsTy, which maps EnvironmentEntry
> keys to SVal values.  5% of my time is getting spent in this instantiations
> ImmutableMap::Factory::add, largely because of a call to getCanonicalTree.
> Why is ImmutableMap being used instead of alternative associative types?
> Of particular note is that even iteration is expensive because ImmutableMap
> nodes can’t know their parent.  There are some other ImmutableMaps being
> used, but I’m hoping that explanations of the EnvironmentEntry->SVal
> instantiation will help me understand the other instantiations.
>
>
> Stepping back, ProgramStates are immutable values.  The symbolic
> simulation works by keeping all ProgramStates along a path, and generated a
> reachable state graph (the ExplodedGraph).  Since ProgramStates are likely
> to differ only slightly from each other, we use a functional data
> structure, ImmutableMap, that represents a mapping using a functional AVL
> tree.  When you add a value to an ImmutableMap, you really just create a
> new ImmutableMap and the original stays intact.   The trick is that
> ImmutableMaps *share* large amounts of their map representation (subtrees
> of the AVL tree) between each other, which represents a significant memory
> savings.  Alternatively, we could use a reference-counted std::map or
> something, and when we needed to add a value to the map to create a new
> ProgramState would could copy the entire map.  That would likely be very
> wasteful in practice.  Also, ImmutableMap uses a BumpAllocator, which is
> very fast for allocations.  This is the reason we use ImmutableMap.
>

So ImmutableMap right now uses an AVL tree, which has O(log(n)) complexity
for lookup, insertion and deletion. Finger search trees, however, can be
implemented as immutable data structures, and they have amortized O(log(n))
lookup and amortized O(1) insertion and deletion (source:
https://en.wikipedia.org/wiki/Finger_search_tree). If inserting a value
into the map is the most frequent operation this could be a huge win.
Ben, if I am not mistaken, this might be worth to investigate.

>
> But… your profiling data hits on something I saw before.
>  getCanonicalTree() gets called when we unique the maps.  This uniquing is
> critical for de-duplication of ProgramStates, but we don’t always need to
> de-duplicate ProgramStates unless we think there is a strong possibility we
> will want to merge two paths together.  That said, recall I mentioned that
> saving memory was also important.  We also want to de-duplicate
> ImmutableMaps, since they are used everywhere, for memory savings.  Again,
> I think profiling data could be the key here, and we can be in the position
> where we de-duplicate some ImmutableMaps we use less often than others
> based depending on what the profiling data says.  But the profiling data
> must include memory usage as a factor; otherwise it would be easy to
> optimize these CPU numbers but tank memory usage.
>
> Another thing to explore is that for smaller maps maybe we don’t want an
> ImmutableMap at all, but just something like an ImmutableList to record a
> bunch of key value pairs.  Once we de-duplicate states, we could
> canonicalize around use ImmutableMaps when creating a canonicalized
> ProgramState, or we can switch to ImmutableMap once the list gets too big.
> For example, Environment uses an ImmutableMap, but there is an access
> pattern to values in an Environment.  Typically we just want the
> subexpressions of the current expression.  If we are working with a binary
> operator or anything with multiple subexpressions an ImmutableMap might be
> good, but if there is only one subexpression sometimes an ImmutableMap is
> overkill.  Or perhaps we can create a data structure that is a mixture of
> both ImmutableLists and ImmutableMaps, chaining them together. Why would we
> do this?  It’s less work to manage a small list and an immutable map, but
> it would be an increase in implementation complexity (that could possibly
> be hidden by the right abstractions).
>
>
> Question summary:
> 1.       Is there a good, existing benchmark for static analysis that I
> should be using?
>
>
> In the past, I’ve often looked at:
>
> - sqlite3.c
> - postgresql
>
> These were good stable C baselines.  I’ve also analyzed the Python
> codebase in the past, which also exhibited some of the more algorithmic
> issues in the analyzer (analyzing the interpreter code in Python has to
> deal with a bunch of branches and switch statements).  Sqlite3 is nice
> since it is one file.
>
> LLVM/Clang is also a good codebase for C++.  I think we should also pick
> some other C++ codebases.
>
>
> 2.       Is there some reasonable criteria to know in advance if an
> ExplodedNode needs to be stored for later reduplication?
>
>
> See my comment above.
>
> 3.       Why is ImmutableMap<EnvironmentEntry, SVal> being used instead
> of std::map, std::hash_map, or even llvmFoldingSet?
>
>
> Hopefully I’ve answered this question.
>
> Cheers,
> Ted
>
>
> Employee of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
> Foundation Collaborative Project
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_cfe-2Ddev&d=BQIGaQ&c=eEvniauFctOgLOKGJOplqw&r=UVc407_CCx3FapxjS2xZ9jo4Q91upSGpJHRF8fPPYVY&m=rDsVjZ5h8hLZXMQlZz7pvL5iklwjHNwyE9dfwASlyQQ&s=evEZF87HnXhB6gk0zZ-ejVmUiqWR-Qe5GWiFYtMqUBk&e=
>
>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20150826/9ebc3344/attachment.html>