[PATCH] D16204: ValueTracking: Use fixed array for assumption exclude set in Query; NFC

Thu Jan 14 17:20:53 PST 2016

MatzeB added a comment.

In http://reviews.llvm.org/D16204#327512, @reames wrote:

> In http://reviews.llvm.org/D16204#327492, @MatzeB wrote:
>
> > In http://reviews.llvm.org/D16204#327491, @reames wrote:
> >
> > > Given MaxDepth is less than the Small size specified for the SmallPtrSet, I'm not sure why this is faster?  The only difference I can come up with in the new code vs the old is a) the check for isSmall and b) the fact the insert code is outlined for SmallPtrSet.
> > >
> > > I'd really doubt that (a) is the cause.  If it is (b), we should just fix the SmallPtrSet impl to inline the common case of the insertion logic.
> > >
> > > Please try measuring with a tweaked version of SmallPtrSet.  Unless you can show that the hand rolled code is still much faster, I'm hesitant to take this change.
> >
> >
> > In the majority of the cases the set isn't used at all, I didn't see any performance differences when tweaking the insertion/query logic.
>
>
> Wow, that was a fast turn around.  How'd you managed to build something after touching an ADT header that fast much less performance test it?  I want to know your secret.  :)
>
> I have to admit, this really surprises me.
>
> > The gains appear to come from the simplified initialization (and maybe reduced stack usage?).
>
>
> Can you back up this claim?  I'm open to being convinced this is true, but it would imply that we're creating *a lot* of Query objects.  Do you believe that to be true?  If so, we could explore using a stack discipline for the exclusions rather than passing around copies of the full Query.

1. I did not modify any ADT headers, I only experimented within ValueTracking.cpp. With tuning the insertion/query logic I meant working on the Query structure inside ValueTracking.cpp, and if you read the code you need to have an llvm.assume intrinsic or 1 specific comparison for the set to be used so unsurprisingly that does not happen often.

computeKnownBits() is indeed called a lot and if you compare the initialisation of a SmallPtrSet:

   : SmallArray(SmallStorage), CurArray(SmallStorage), CurArraySize(SmallSize)
  // ...
       if (!isSmall() && NumElements*4 < CurArraySize && CurArraySize > 32)
        return shrink_and_clear();

      // Fill the array with empty markers.
      memset(CurArray, -1, CurArraySize*sizeof(void*));
      NumElements = 0;
      NumTombstones = 0;

compared to:

  : NumExcluded(0)

You also get additional const void **SmallArray;, const void **CurArray;, unsigned CurArraySize;, unsigned NumElements;, unsigned NumTombstones; members for the SmallPtrSet.

I don't know if this is really the instructions or the stack space saving and I am hoping it's not just cache effects because the stack alignment changed. But I did take care to measure a release clang, on a machine with most daemons killed, ASLR disabled, 3 different longer testcases (combine.c from 403.gcc, sqlite3 amalgamation, webkit javascript interpreter in a single file) run multiple times, so I am confident the measurements themself are real.

Repository:
  rL LLVM

http://reviews.llvm.org/D16204