[Openmp-dev] Some smaller patches.

Tue Feb 17 16:13:22 PST 2015

----- Original Message -----
> From: "Jonathan L Peyton" <jonathan.l.peyton at intel.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: openmp-dev at dcs-maillist2.engr.illinois.edu, openmp-commits at dcs-maillist2.engr.illinois.edu
> Sent: Tuesday, February 17, 2015 4:43:25 PM
> Subject: RE: Some smaller patches.
> 
> > Are these now going to be always on by default? Can we add a build
> > flag to disable them?
> They are on by default in the build.pl (Makefile) build system.  I
> left them out of the CMake build system because I figured people
> could add them in if they wanted to via CFLAGS, etc.
> There isn't a real convenient way to add build flags to the build.pl
> system because it does not contain a configuration stage.  So I
> honestly don't know what to do about it.
> Also, sorry about not putting this one on Phabricator.

Okay, fair enough. We should add an option to the CMake build to make this easy. I suppose that, from the LLVM project perspective, we'll never really advocate using the build.pl system. We'll simply say it is there for the purpose of matching, to the extent possible, the builds that Intel distributes. So LGTM (and add some option to the cmake builds to match).

> 
> > For one thing, this comment explaining what is going on should
> > really be in the code too. Second, I don't understand why this is
> > desirable, can you please explain.
> Ok to the comment going in the code.  Basically you are right.  We
> want the threads to all be offset differently from page alignment.
> For example (assuming 64 byte cache lines):
> Thread0 offset 0 from its base page
> Thread1 offset 64 bytes from its base page
> Thread2 offset 128 bytes from its base page
> Thread3 offset 192 bytes from its base page
> Thread4 ...
> 
> It is useful for Intel(R) Many Integrated Core Architecture (lots of
> threads), but had no negative effects for regular Linux builds
> either.  The goal is to help reduce cache conflicts/cache thrashing
> of local stack data of individual threads where the local stack data
> is offset by the same amount relative to a page for each thread.
> 
> #pragma omp parallel
> {
>    // lots of threads here ~ 240 threads on MIC
>    double update_me;
>    for(...) {
>       read update_me;
>        ...
>       write update_me;
>    }
> }
> 
> If update_me is offset by the same amount for each thread's stack,
> then a possible cache thrashing condition could occur.  It's remote,
> but possible.  And again, it only really affects MIC.  Since it
> didn't hurt performance in our own testing, we left it in there for
> Linux.
> 
> > #if KMP_OS_LINUX || KMP_OS_FREEBSD
> >   if ( __kmp_stkoffset > 0 && gtid > 0 ) {
> >         padding = alloca( gtid * __kmp_stkoffset );
> >     }
> > #endif
> >
> > this padding variable is dead, and I'd hope that the compiler
> > removes it.
> Yes, it sure does... I've now confirmed that.  Somehow, the:
>             stack_size += gtid * __kmp_stkoffset;
>                 status = pthread_attr_setstacksize( & thread_attr,
>                 stack_size );
> lines triggers the offset.  I've tested the offset values by reading
> out %rsp.
> 
> I believe what the code is trying to do is perform the offset with
> the alloca rather than the setstacksize() call.
> The calling tree should look like:
> 1) [master thread] Master thread calls __kmp_create_worker()
> 2) [master thread] __kmp_create_worker() sets the stacksize and calls
> pthread_create()
> 3) pthread_create() is called with __kmp_launch_worker() as the
> function to perform
> 4) [worker thread] __kmp_launch_worker() performs alloca() to offset
> current thread's stack (or it's supposed to);
> 5) [worker thread] enter loop waiting for work, but all worker
> thread's stack sizes are supposed to be identical even though the
> offset was performed.
> In other words:
>   When thread 1 is pthread_created, its stacksize = __kmp_stksize +
>   1*64, then alloca(1*64) shortens the stacksize back to
>   kmp_stksize;
>   When thread 2 is pthread_created, its stacksize = __kmp_stksize +
>   2*64, then alloca(2*64) shortens the stacksize back to
>   kmp_stksize;
>   When thread 3 is pthread_created, its stacksize = __kmp_stksize +
>   3*64, then alloca(3*64) shortens the stacksize back to
>   kmp_stksize;
> .... and so on.

Ah, okay. Now I understand. This really needs to be well explained by the comments in the code -- the text from this e-mail would be great ;)

> 
> We need to fix this obviously.
> Whats the best way to enforce the alloca() to take place?  Is there a
> common trick to do this?

Yes, you pass the pointer to some external function. Since the compiler cannot prove that the buffer is unused, it must keep it.

 -Hal

> 
> -- Johnny
> 
> -----Original Message-----
> From: Hal Finkel [mailto:hfinkel at anl.gov]
> Sent: Sunday, February 15, 2015 11:55 AM
> To: Peyton, Jonathan L
> Cc: openmp-dev at dcs-maillist2.engr.illinois.edu;
> openmp-commits at dcs-maillist2.engr.illinois.edu
> Subject: Re: Some smaller patches.
> 
> ----- Original Message -----
> > From: "Jonathan L Peyton" <jonathan.l.peyton at intel.com>
> > To: "Hal Finkel" <hfinkel at anl.gov>
> > Cc: openmp-dev at dcs-maillist2.engr.illinois.edu,
> > openmp-commits at dcs-maillist2.engr.illinois.edu
> > Sent: Friday, February 13, 2015 6:02:53 PM
> > Subject: Some smaller patches.
> > 
> > Hal,
> > 
> > I have some small patches here. I put the bigger ones on
> > Phabricator.
> > If you want all of these on Phabricator I’ll start doing that
> > instead.
> > 
> > 
> > 
> > 1) security_flags.patch – added some flags for security on Linux
> > and
> > Mac link stages.
> 
> Are these now going to be always on by default? Can we add a build
> flag to disable them?
> 
> P.S. Even though this patch is small, it is also a good candidate for
> Phabricator because it is hard, just from the patch itself, what
> build targets are being modified.
> 
> > 
> > 2) stack_offset.patch – changes default stack offset for threads on
> > non-Mac architectures to a CACHE_LINE. This puts threads at
> > different
> > offsets from a page during creation.
> 
> For one thing, this comment explaining what is going on should really
> be in the code too. Second, I don't understand why this is
> desirable, can you please explain.
> 
> Also, some of the uses of this variable seem questionable
> (KMP_DEFAULT_STKOFFSET is used to set __kmp_stkoffset), and we have
> this in z_Linux_util.c:
> 
> #if KMP_OS_LINUX || KMP_OS_FREEBSD
>     if ( __kmp_stkoffset > 0 && gtid > 0 ) {
>         padding = alloca( gtid * __kmp_stkoffset );
>     }
> #endif
> 
> this padding variable is dead, and I'd hope that the compiler removes
> it.
> 
> And then, as you imply, this is used to adjust the thread's stack
> size:
> 
>             /* Set stack size for this thread now. */
>             stack_size += gtid * __kmp_stkoffset; ...
>                 status = pthread_attr_setstacksize( & thread_attr,
>                 stack_size );
> 
> any while I understand that this might also affect the starting
> offset of subsequent thread's stacks, I don't see what would be true
> (I assume that the OS will create each thread's stack so that the
> start of the stack is really page aligned). Again, why you're
> multiplying by gtid is not explained. Are you trying to reduce false
> sharing?
> 
> In short, this needs more comments (at least).
> 
> > 
> > 3) omp_flush_fix.patch – removes unused varargs from #pragma omp
> > flush
> > api function.
> > 
> 
> LGTM.
> 
> Thanks again,
> Hal
> 
> > 
> > 
> > -- Johnny
> > 
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory