[Openmp-commits] Some smaller patches.

Tue Feb 17 14:43:25 PST 2015

> Are these now going to be always on by default? Can we add a build flag to disable them?
They are on by default in the build.pl (Makefile) build system.  I left them out of the CMake build system because I figured people could add them in if they wanted to via CFLAGS, etc.
There isn't a real convenient way to add build flags to the build.pl system because it does not contain a configuration stage.  So I honestly don't know what to do about it.
Also, sorry about not putting this one on Phabricator.

> For one thing, this comment explaining what is going on should really be in the code too. Second, I don't understand why this is desirable, can you please explain.
Ok to the comment going in the code.  Basically you are right.  We want the threads to all be offset differently from page alignment. For example (assuming 64 byte cache lines):
Thread0 offset 0 from its base page
Thread1 offset 64 bytes from its base page
Thread2 offset 128 bytes from its base page
Thread3 offset 192 bytes from its base page
Thread4 ...

It is useful for Intel(R) Many Integrated Core Architecture (lots of threads), but had no negative effects for regular Linux builds either.  The goal is to help reduce cache conflicts/cache thrashing of local stack data of individual threads where the local stack data is offset by the same amount relative to a page for each thread.

#pragma omp parallel
{
   // lots of threads here ~ 240 threads on MIC
   double update_me;
   for(...) {
      read update_me;
       ...
      write update_me;
   }
}

If update_me is offset by the same amount for each thread's stack, then a possible cache thrashing condition could occur.  It's remote, but possible.  And again, it only really affects MIC.  Since it didn't hurt performance in our own testing, we left it in there for Linux.

> #if KMP_OS_LINUX || KMP_OS_FREEBSD
>   if ( __kmp_stkoffset > 0 && gtid > 0 ) {
>         padding = alloca( gtid * __kmp_stkoffset );
>     }
> #endif
>
> this padding variable is dead, and I'd hope that the compiler removes it.
Yes, it sure does... I've now confirmed that.  Somehow, the:
            stack_size += gtid * __kmp_stkoffset;
                status = pthread_attr_setstacksize( & thread_attr, stack_size );
lines triggers the offset.  I've tested the offset values by reading out %rsp.

I believe what the code is trying to do is perform the offset with the alloca rather than the setstacksize() call.
The calling tree should look like:
1) [master thread] Master thread calls __kmp_create_worker()
2) [master thread] __kmp_create_worker() sets the stacksize and calls pthread_create()
3) pthread_create() is called with __kmp_launch_worker() as the function to perform
4) [worker thread] __kmp_launch_worker() performs alloca() to offset current thread's stack (or it's supposed to);
5) [worker thread] enter loop waiting for work, but all worker thread's stack sizes are supposed to be identical even though the offset was performed.
In other words:
  When thread 1 is pthread_created, its stacksize = __kmp_stksize + 1*64, then alloca(1*64) shortens the stacksize back to kmp_stksize;
  When thread 2 is pthread_created, its stacksize = __kmp_stksize + 2*64, then alloca(2*64) shortens the stacksize back to kmp_stksize;
  When thread 3 is pthread_created, its stacksize = __kmp_stksize + 3*64, then alloca(3*64) shortens the stacksize back to kmp_stksize;
.... and so on.

We need to fix this obviously.
Whats the best way to enforce the alloca() to take place?  Is there a common trick to do this?

-- Johnny

-----Original Message-----
From: Hal Finkel [mailto:hfinkel at anl.gov] 
Sent: Sunday, February 15, 2015 11:55 AM
To: Peyton, Jonathan L
Cc: openmp-dev at dcs-maillist2.engr.illinois.edu; openmp-commits at dcs-maillist2.engr.illinois.edu
Subject: Re: Some smaller patches.

----- Original Message -----
> From: "Jonathan L Peyton" <jonathan.l.peyton at intel.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: openmp-dev at dcs-maillist2.engr.illinois.edu, 
> openmp-commits at dcs-maillist2.engr.illinois.edu
> Sent: Friday, February 13, 2015 6:02:53 PM
> Subject: Some smaller patches.
> 
> Hal,
> 
> I have some small patches here. I put the bigger ones on Phabricator.
> If you want all of these on Phabricator I’ll start doing that instead.
> 
> 
> 
> 1) security_flags.patch – added some flags for security on Linux and 
> Mac link stages.

Are these now going to be always on by default? Can we add a build flag to disable them?

P.S. Even though this patch is small, it is also a good candidate for Phabricator because it is hard, just from the patch itself, what build targets are being modified.

> 
> 2) stack_offset.patch – changes default stack offset for threads on 
> non-Mac architectures to a CACHE_LINE. This puts threads at different 
> offsets from a page during creation.

For one thing, this comment explaining what is going on should really be in the code too. Second, I don't understand why this is desirable, can you please explain.

Also, some of the uses of this variable seem questionable (KMP_DEFAULT_STKOFFSET is used to set __kmp_stkoffset), and we have this in z_Linux_util.c:

#if KMP_OS_LINUX || KMP_OS_FREEBSD
    if ( __kmp_stkoffset > 0 && gtid > 0 ) {
        padding = alloca( gtid * __kmp_stkoffset );
    }
#endif

this padding variable is dead, and I'd hope that the compiler removes it.

And then, as you imply, this is used to adjust the thread's stack size:

            /* Set stack size for this thread now. */
            stack_size += gtid * __kmp_stkoffset; ...
                status = pthread_attr_setstacksize( & thread_attr, stack_size );

any while I understand that this might also affect the starting offset of subsequent thread's stacks, I don't see what would be true (I assume that the OS will create each thread's stack so that the start of the stack is really page aligned). Again, why you're multiplying by gtid is not explained. Are you trying to reduce false sharing?

In short, this needs more comments (at least).

> 
> 3) omp_flush_fix.patch – removes unused varargs from #pragma omp flush 
> api function.
> 

LGTM.

Thanks again,
Hal

> 
> 
> -- Johnny
> 
> 

--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory