[PATCH] D16204: ValueTracking: Use fixed array for assumption exclude set in Query; NFC

Thu Jan 14 17:58:13 PST 2016

MatzeB added a comment.

In http://reviews.llvm.org/D16204#327541, @reames wrote:

> In http://reviews.llvm.org/D16204#327523, @MatzeB wrote:
>
> > In http://reviews.llvm.org/D16204#327512, @reames wrote:
> >
> > > In http://reviews.llvm.org/D16204#327492, @MatzeB wrote:
> > >
> > > > In http://reviews.llvm.org/D16204#327491, @reames wrote:
> > > >
> > > > > Given MaxDepth is less than the Small size specified for the SmallPtrSet, I'm not sure why this is faster?  The only difference I can come up with in the new code vs the old is a) the check for isSmall and b) the fact the insert code is outlined for SmallPtrSet.
> > > > >
> > > > > I'd really doubt that (a) is the cause.  If it is (b), we should just fix the SmallPtrSet impl to inline the common case of the insertion logic.
> > > > >
> > > > > Please try measuring with a tweaked version of SmallPtrSet.  Unless you can show that the hand rolled code is still much faster, I'm hesitant to take this change.
> > > >
> > > >
> > > > In the majority of the cases the set isn't used at all, I didn't see any performance differences when tweaking the insertion/query logic.
> > >
> > >
> > > I have to admit, this really surprises me.
> >
> >
> > 1. I did not modify any ADT headers, I only experimented within ValueTracking.cpp. With tuning the insertion/query logic I meant working on the Query structure inside ValueTracking.cpp, and if you read the code you need to have an llvm.assume intrinsic or 1 specific comparison for the set to be used so unsurprisingly that does not happen often.
>
>
> Ok, I think we're talking past each other.  I specifically asked you to run an experiment with a slightly altered implementation of SmallPtrSet to help isolate the cause of the performance improvement.  I thought you had said you'd done so, now it sounds like you have not.  I would like you to run that experiment or otherwise argue why the outlined insertion implementation isn't the primary cost you're saving here.

I will do that another time, as that is not relevant here. I just added a counter and as suspected, we nearly never insert or query anything from the set. I just added a counter and found 0 queries or insertion into the set in all my testcases. So all we have to worry in the normal case is construction and size of the set.

> 

> 

> > computeKnownBits() is indeed called a lot and if you compare the initialisation of a SmallPtrSet:

> 

> > 

> 

> >    : SmallArray(SmallStorage), CurArray(SmallStorage), CurArraySize(SmallSize)

> 

> >   // ...

> 

> >        if (!isSmall() && NumElements*4 < CurArraySize && CurArraySize > 32)

> 

> >         return shrink_and_clear();

> 

> >   

> 

> >       // Fill the array with empty markers.

> 

> >       memset(CurArray, -1, CurArraySize*sizeof(void*));

> 

> >       NumElements = 0;

> 

> >       NumTombstones = 0;

> 

> > 

> 

> > 

> 

> > compared to:

> 

> > 

> 

> >   : NumExcluded(0)

> 

> >    

> 

> > 

> 

> > You also get additional const void **SmallArray;, const void **CurArray;, unsigned CurArraySize;, unsigned NumElements;, unsigned NumTombstones; members for the SmallPtrSet.

> 

> 

> In the small case, all of these should be zero initialized except for CurArray.  This should reduce to a handful of stores.  I'm not going to say it's impossible that such a minor difference matters, but I'm suspicious.  You could convince me by adding a couple of dummy fields to the Query and showing that they do have the performance change you're claiming they do.

You are right to be sceptical: I did some experiments and found that doing doing some memsetting, or adding additional members to the Query structure has no measurable performance implications. However adding an unused SmallPtr<void*,8> Dummy; member triggers the performance regressions, so something more complicated appears to be going on here...

> 

> 

> > I don't know if this is really the instructions or the stack space saving and I am hoping it's not just cache effects because the stack alignment changed. But I did take care to measure a release clang, on a machine with most daemons killed, ASLR disabled, 3 different longer testcases (combine.c from 403.gcc, sqlite3 amalgamation, webkit javascript interpreter in a single file) run multiple times, so I am confident the measurements themself are real.

> 

> 

> This is good information.  I'll believe you the difference is real; I'm just trying to understand why.  Once we know why, we can decide if this is the right fix and whether there's other ways we could continue to improve the performance of this code.

> 

> > I am not sure what you mean by Stack Discipline, can you elaborate (or give me a link to an explanation of the concept)?

> 

> 

> What I meant in this case is that we appear to be using a Query object to wrap an existing one with a single extra exclusion added to it.  This is conceptual a stack of exclusions along with a single unchanging Query.  We could explicitly model this as a Query without exclusions and an explicit exclusions stack that we push/pop from on each recursive call.  We could even introduce a ExclusionStackHelper to do the push/pop for us implicitly.

In fact I even had the code changed to this. But as the code paths that actually create new Query objects are nearly never hit (see above) I didn't see any performance difference.

Repository:
  rL LLVM

http://reviews.llvm.org/D16204