[cfe-commits] [Patch 4 of 5] Proposed fix for PR12850

Richard Smith richard at metafoo.co.uk
Tue Sep 4 18:21:10 PDT 2012


On Mon, Sep 3, 2012 at 9:41 AM, Andy Gibbs <andyg1001 at hotmail.co.uk> wrote:

> On Saturday, August 25, 2012 3:51 AM, Richard Smith wrote:
> > Have you measured the impact of this cache on "well-written" constexpr
> > functions (those not requiring memoization for efficiency)? What's the
> > memory usage and performance impact on, say,
> > test/SemaCXX/constexpr-turing.cpp, with a more complex Turing machine?
> > Can you provide some figures for typical and worst-case slowdowns?
>
> Ok, I've done some performance testing and here are some rough values.
>
> There are two clang test-cases which are interesting with regard to the
> memoisation feature, namely constexpr-turing and constexpr-nqueens.  I
> imagine that you would consider these to be "well written" -- it seems,
> at any rate, that you are the author, so I am guessing so! :o)
>

They have the property of not needing memoization in order to be efficient,
which is the one I was looking for.


> With constexpr-turing, we have a test-case that is unable to make use
> of the cached evaluations.  In this case, I see a 5% slow-down in
> compilation, but also a moderate additional ~200Kb memory consumption.
>

That's about 2KB per step, which seems rather a lot. Almost every function
call in this testcase is passing references to temporaries as arguments.
Perhaps that would be a reasonable criterion under which to disable the
caching -- would we realistically ever get cache hits with temporaries as
arguments? In practice, I think this would mean dropping your local cache.

Some part of the problem here is that APValue is a terrible format for
long-term storage of constant values. I think we could get around a 5x
reduction in the memory usage of APValues by replacing the representation
with something denser. (I had a patch for this out for review ~9 months
ago, but it'll take quite a bit of work to get it to apply cleanly to
trunk...)


> With constexpr-nqueens, we have a much more positive result in terms
> of a 19% speed-up in compilation, but at a cost of ~13Mb of additional
> memory.
>

That's a worryingly high memory overhead here; again, we have lots of
temporaries being passed as arguments in this case.

Do you know where the speedup comes from? There should be very few repeated
constexpr function evaluations in this testcase. Are we evaluating some of
the top-level expressions (such as q8's initializer or the static_assert
condition) multiple times? I've observed in the past some pretty
pathological behavior where we evaluate non-type template arguments
repeatedly.

(In short, I wonder whether there's another bug here which this caching is
working around.)


> In my own code, I have seen anything in the range of 2% slow-down up to
> 25% speed-up -- this is, of course, only for "well written" constexpr
> functions since I get a 2000% to 5000% speed-up for the rest, at least
> for those I was willing to test without caching (the others being too
> slow)!
>
> In terms of additional compiler memory, I've seen up to 100Mb increase
> but this is really down to constexpr algorithms in use.
>
[...]

> We could still take the route of an attribute/f-switch pair to
> override the behaviour of the cache.  My suggestion, if you feel the
> performance changes are too costly, would be a -fconstexpr-cache on/
> off switch for global use and a __attribute__((constexpr-cache on/
> off)) for local use.


I would strongly prefer to not add user-visible knobs here. If we can add a
few simple rules for when to cache and when not to, I think that would be
much better.


> > There is a nasty corner case where the value of a global changes
> > between successive calls to a constexpr function (this happens if a
> > variable is read after it's zero-initialized but before it's
> > dynamically initialized, from a constexpr function called within its
> > own initializer). Your patch doesn't seem to handle that (though
> > perhaps I just missed it).
>
> I fixed the code to support your corner case.  I'm still a little bit
> surprised that it isn't outlawed, but I've seen your other posts on the
> net talking about what's permitted and not, so I'm more than willing to
> accept your better understanding.  For myself, I'll class it as "don't
> try this at home"!


It's definitely in that category, and the standard itself isn't perfectly
clear. Independent of what the standard says, this is important for cases
which aren't constant expressions, but which we constant-fold anyway, where
we must correctly model the
zero-initialization-precedes-dynamic-initialization semantics.


> > Have you considered using llvm::FoldingSet for the cache? That would
> > be particularly useful if we ever want to allow arbitrary literal
> > types as non-type template arguments, or other such cases which might
> > require reducing an APValue to a FoldingSetNodeID.
>
> I believe this means supporting FoldingSetTrait for APValues, which is
> not currently done -- I don't have the time to do this at present.  Can
> it be left to another day?


I think this would significantly simplify your implementation (you wouldn't
need both a hash function and an equality comparison for APValue). You are
performing some unnecessary APValue copies still, due to using a DenseMap
as your cache, with APValues in your key and value types. It would be
better to use the ASTContext's BumpPtrAllocator for allocating your cache
entries, which is easy to do if you can switch to a FoldingSet.


> > There seems to be a fair amount of APValue copying going on in this
> > change. For large APValues, that's not cheap. Perhaps you could
> > allocate the cache entry up-front, and evaluate the function arguments
> > and return value directly into its APValue slots, or swap them in?
>
> Done -- this took the constexpr-nqueens test-case from ~11% to ~19%
> speed-up, for example.
>
> > Does this have an impact on the constexpr recursion limit? For instance,
> > if f(500) recurses to depth 500, and g(500) recurses to depth 500 then
> > calls f(500), will the result of calling g(500) depend on whether I've
> > called f(500) already in this translation unit? That seems like a highly
> > unfortunate property, but it's consistent with how template
> > instantiations behave at least.
>
> Yes, there is an impact on recursion limit, but not in a negative way
> (in my opinion!), meaning that if f(500) is already cached, then calling
> g(500) doesn't run to a recursion depth of 1000, only 501, since f(500)
> is picked out of the cache and not recursively evaluated.


This still seems like it could lead to a flaky compilation experience (we
hit the limit if we include a.h then b.h, but not if we include them in the
other order). This problem already exists for templates, but I suspect
it'll be worse for constexpr, since I would expect deep recursion to occur
more often. Essentially, I would like the caching to be *entirely*
undetectable (other than in performance).


> > Thanks for looking into this!
>
> No problems.  I'll try and split out the revised patches and post them
> shortly...  Any further things before I do so?


Well, some more comments on the new patches:

Firstly, there's not really any point in splitting up the patches along
file boundaries as you have done. There's value in splitting up patches
when they can be decomposed into pieces which can stand alone (in
particular, the pieces should have their own tests and should be suitable
for committing separately) -- that makes reviewing easier.

Please don't use a static data member for your global cache. The global
cache should live on the ASTContext.

+      APValue FillerA, FillerB;
+      if (A.hasArrayFiller()) FillerA = A.getArrayFiller();
+      if (B.hasArrayFiller()) FillerB = B.getArrayFiller();

Use APValue *FillerA; etc. here, to avoid copying the APValue.

+    case APValue::Struct: {
+      unsigned bases = A.getStructNumBases();

This variable (and others) should be capitalized, per the coding standard.

+            if (SA->getLength()        != SB->getLength()        ||
+                SA->getCharByteWidth() != SB->getCharByteWidth() ||
+                SA->getBytes()         != SB->getBytes())
+              return false;

You can't use the contents of a string literal to determine equality for
APValue::LValue, since different string literals with the same contents do
not always produce the same result (the function could test the address).

I don't like changing the diagnostic consumer in order to implement your
cache tests. I would prefer you found another way to do this -- perhaps
adding a #pragma to produce a diagnostic containing this information?

Thanks!
Richard
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20120904/1996331c/attachment.html>


More information about the cfe-commits mailing list