[cfe-commits] [Patch 4 of 5] Proposed fix for PR12850

Mon Sep 3 09:41:43 PDT 2012

On Saturday, August 25, 2012 3:51 AM, Richard Smith wrote:
> Hi Andy,
>
> Editorializing somewhat, I think it's unfortunate that people expect
> constexpr evaluation to be asymptotically more efficient than the
> runtime evaluation of the same functions, and I'm a little concerned
> that this caching will lead people to write code which has surprising
> poor performance in the cases where a constexpr function gets invoked
> dynamically. But... g++ already implements a similar optimization, so
> we'll probably need this to handle code written for it.
>
> Have you measured the impact of this cache on "well-written" constexpr
> functions (those not requiring memoization for efficiency)? What's the
> memory usage and performance impact on, say,
> test/SemaCXX/constexpr-turing.cpp, with a more complex Turing machine?
> Can you provide some figures for typical and worst-case slowdowns?

Ok, I've done some performance testing and here are some rough values.

There are two clang test-cases which are interesting with regard to the
memoisation feature, namely constexpr-turing and constexpr-nqueens.  I
imagine that you would consider these to be "well written" -- it seems,
at any rate, that you are the author, so I am guessing so! :o)

With constexpr-turing, we have a test-case that is unable to make use
of the cached evaluations.  In this case, I see a 5% slow-down in
compilation, but also a moderate additional ~200Kb memory consumption.

With constexpr-nqueens, we have a much more positive result in terms
of a 19% speed-up in compilation, but at a cost of ~13Mb of additional
memory.

In my own code, I have seen anything in the range of 2% slow-down up to
25% speed-up -- this is, of course, only for "well written" constexpr
functions since I get a 2000% to 5000% speed-up for the rest, at least
for those I was willing to test without caching (the others being too
slow)!

In terms of additional compiler memory, I've seen up to 100Mb increase
but this is really down to constexpr algorithms in use.

As a comparison, constexpr-torture consumes an additional ~10Mb, but
finishes in ~1.5 seconds as opposed to upwards of 3 hours (the point
where I terminated it).

We could still take the route of an attribute/f-switch pair to
override the behaviour of the cache.  My suggestion, if you feel the
performance changes are too costly, would be a -fconstexpr-cache on/
off switch for global use and a __attribute__((constexpr-cache on/
off)) for local use.

> There is a nasty corner case where the value of a global changes
> between successive calls to a constexpr function (this happens if a
> variable is read after it's zero-initialized but before it's
> dynamically initialized, from a constexpr function called within its
> own initializer). Your patch doesn't seem to handle that (though
> perhaps I just missed it).

I fixed the code to support your corner case.  I'm still a little bit
surprised that it isn't outlawed, but I've seen your other posts on the
net talking about what's permitted and not, so I'm more than willing to
accept your better understanding.  For myself, I'll class it as "don't
try this at home"!

> Have you considered using llvm::FoldingSet for the cache? That would
> be particularly useful if we ever want to allow arbitrary literal
> types as non-type template arguments, or other such cases which might
> require reducing an APValue to a FoldingSetNodeID.

I believe this means supporting FoldingSetTrait for APValues, which is
not currently done -- I don't have the time to do this at present.  Can
it be left to another day?

> There seems to be a fair amount of APValue copying going on in this
> change. For large APValues, that's not cheap. Perhaps you could
> allocate the cache entry up-front, and evaluate the function arguments
> and return value directly into its APValue slots, or swap them in?

Done -- this took the constexpr-nqueens test-case from ~11% to ~19%
speed-up, for example.

> Does this have an impact on the constexpr recursion limit? For instance,
> if f(500) recurses to depth 500, and g(500) recurses to depth 500 then
> calls f(500), will the result of calling g(500) depend on whether I've
> called f(500) already in this translation unit? That seems like a highly
> unfortunate property, but it's consistent with how template
> instantiations behave at least.

Yes, there is an impact on recursion limit, but not in a negative way
(in my opinion!), meaning that if f(500) is already cached, then calling
g(500) doesn't run to a recursion depth of 1000, only 501, since f(500)
is picked out of the cache and not recursively evaluated.

> Thanks for looking into this!

No problems.  I'll try and split out the revised patches and post them
shortly...  Any further things before I do so?

Cheers
Andy