[Openmp-commits] Some smaller patches.

Wed Feb 18 01:37:38 PST 2015

>> We need to fix this obviously.
>> Whats the best way to enforce the alloca() to take place?  Is there a 
>> common trick to do this?

> Yes, you pass the pointer to some external function. Since the compiler cannot prove that the buffer
> is unused, it must keep it.

I generally cheat worse than that and use something like this

// Ensure that there is a reference to the variable passed in which the compiler cannot discard.
static void reference(void * value)
{
    __asm__ volatile ("# reference"::"r"(value));
}

though it relies on GCC style inline asm, so won't work when compiling for Windows.
OTOH, it is immune to -ipo compilation which could, potentially, remove the call (and thus the reference).

Another possibility (maybe the best), would be to 

1) declare "padding" as int volatile *
2) cast the alloca result to (int volatile *)
3) do a *padding = 0;

The compiler should never remove a store into a volatile type, so it can't remove the reference to the alloca-ed space, and therefore can't decide the space is dead. (At least, that's the theory).

-- Jim

James Cownie <james.h.cownie at intel.com>
SSG/DPD/TCAR (Technical Computing, Analyzers and Runtimes)
Tel: +44 117 9071438

-----Original Message-----
From: openmp-commits-bounces at cs.uiuc.edu [mailto:openmp-commits-bounces at cs.uiuc.edu] On Behalf Of Hal Finkel
Sent: Wednesday, February 18, 2015 12:13 AM
To: Peyton, Jonathan L
Cc: openmp-commits at dcs-maillist2.engr.illinois.edu; openmp-dev at dcs-maillist2.engr.illinois.edu
Subject: Re: [Openmp-commits] Some smaller patches.

----- Original Message -----
> From: "Jonathan L Peyton" <jonathan.l.peyton at intel.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: openmp-dev at dcs-maillist2.engr.illinois.edu, 
> openmp-commits at dcs-maillist2.engr.illinois.edu
> Sent: Tuesday, February 17, 2015 4:43:25 PM
> Subject: RE: Some smaller patches.
> 
> > Are these now going to be always on by default? Can we add a build 
> > flag to disable them?
> They are on by default in the build.pl (Makefile) build system.  I 
> left them out of the CMake build system because I figured people could 
> add them in if they wanted to via CFLAGS, etc.
> There isn't a real convenient way to add build flags to the build.pl 
> system because it does not contain a configuration stage.  So I 
> honestly don't know what to do about it.
> Also, sorry about not putting this one on Phabricator.

Okay, fair enough. We should add an option to the CMake build to make this easy. I suppose that, from the LLVM project perspective, we'll never really advocate using the build.pl system. We'll simply say it is there for the purpose of matching, to the extent possible, the builds that Intel distributes. So LGTM (and add some option to the cmake builds to match).

> 
> > For one thing, this comment explaining what is going on should 
> > really be in the code too. Second, I don't understand why this is 
> > desirable, can you please explain.
> Ok to the comment going in the code.  Basically you are right.  We 
> want the threads to all be offset differently from page alignment.
> For example (assuming 64 byte cache lines):
> Thread0 offset 0 from its base page
> Thread1 offset 64 bytes from its base page
> Thread2 offset 128 bytes from its base page
> Thread3 offset 192 bytes from its base page
> Thread4 ...
> 
> It is useful for Intel(R) Many Integrated Core Architecture (lots of 
> threads), but had no negative effects for regular Linux builds either.  
> The goal is to help reduce cache conflicts/cache thrashing of local 
> stack data of individual threads where the local stack data is offset 
> by the same amount relative to a page for each thread.
> 
> #pragma omp parallel
> {
>    // lots of threads here ~ 240 threads on MIC
>    double update_me;
>    for(...) {
>       read update_me;
>        ...
>       write update_me;
>    }
> }
> 
> If update_me is offset by the same amount for each thread's stack, 
> then a possible cache thrashing condition could occur.  It's remote, 
> but possible.  And again, it only really affects MIC.  Since it didn't 
> hurt performance in our own testing, we left it in there for Linux.
> 
> > #if KMP_OS_LINUX || KMP_OS_FREEBSD
> >   if ( __kmp_stkoffset > 0 && gtid > 0 ) {
> >         padding = alloca( gtid * __kmp_stkoffset );
> >     }
> > #endif
> >
> > this padding variable is dead, and I'd hope that the compiler 
> > removes it.
> Yes, it sure does... I've now confirmed that.  Somehow, the:
>             stack_size += gtid * __kmp_stkoffset;
>                 status = pthread_attr_setstacksize( & thread_attr,
>                 stack_size );
> lines triggers the offset.  I've tested the offset values by reading 
> out %rsp.
> 
> I believe what the code is trying to do is perform the offset with the 
> alloca rather than the setstacksize() call.
> The calling tree should look like:
> 1) [master thread] Master thread calls __kmp_create_worker()
> 2) [master thread] __kmp_create_worker() sets the stacksize and calls
> pthread_create()
> 3) pthread_create() is called with __kmp_launch_worker() as the 
> function to perform
> 4) [worker thread] __kmp_launch_worker() performs alloca() to offset 
> current thread's stack (or it's supposed to);
> 5) [worker thread] enter loop waiting for work, but all worker 
> thread's stack sizes are supposed to be identical even though the 
> offset was performed.
> In other words:
>   When thread 1 is pthread_created, its stacksize = __kmp_stksize +
>   1*64, then alloca(1*64) shortens the stacksize back to
>   kmp_stksize;
>   When thread 2 is pthread_created, its stacksize = __kmp_stksize +
>   2*64, then alloca(2*64) shortens the stacksize back to
>   kmp_stksize;
>   When thread 3 is pthread_created, its stacksize = __kmp_stksize +
>   3*64, then alloca(3*64) shortens the stacksize back to
>   kmp_stksize;
> .... and so on.

Ah, okay. Now I understand. This really needs to be well explained by the comments in the code -- the text from this e-mail would be great ;)

> 
> We need to fix this obviously.
> Whats the best way to enforce the alloca() to take place?  Is there a 
> common trick to do this?

Yes, you pass the pointer to some external function. Since the compiler cannot prove that the buffer is unused, it must keep it.

 -Hal

> 
> -- Johnny
> 
> -----Original Message-----
> From: Hal Finkel [mailto:hfinkel at anl.gov]
> Sent: Sunday, February 15, 2015 11:55 AM
> To: Peyton, Jonathan L
> Cc: openmp-dev at dcs-maillist2.engr.illinois.edu;
> openmp-commits at dcs-maillist2.engr.illinois.edu
> Subject: Re: Some smaller patches.
> 
> ----- Original Message -----
> > From: "Jonathan L Peyton" <jonathan.l.peyton at intel.com>
> > To: "Hal Finkel" <hfinkel at anl.gov>
> > Cc: openmp-dev at dcs-maillist2.engr.illinois.edu,
> > openmp-commits at dcs-maillist2.engr.illinois.edu
> > Sent: Friday, February 13, 2015 6:02:53 PM
> > Subject: Some smaller patches.
> > 
> > Hal,
> > 
> > I have some small patches here. I put the bigger ones on 
> > Phabricator.
> > If you want all of these on Phabricator I’ll start doing that 
> > instead.
> > 
> > 
> > 
> > 1) security_flags.patch – added some flags for security on Linux and 
> > Mac link stages.
> 
> Are these now going to be always on by default? Can we add a build 
> flag to disable them?
> 
> P.S. Even though this patch is small, it is also a good candidate for 
> Phabricator because it is hard, just from the patch itself, what build 
> targets are being modified.
> 
> > 
> > 2) stack_offset.patch – changes default stack offset for threads on 
> > non-Mac architectures to a CACHE_LINE. This puts threads at 
> > different offsets from a page during creation.
> 
> For one thing, this comment explaining what is going on should really 
> be in the code too. Second, I don't understand why this is desirable, 
> can you please explain.
> 
> Also, some of the uses of this variable seem questionable 
> (KMP_DEFAULT_STKOFFSET is used to set __kmp_stkoffset), and we have 
> this in z_Linux_util.c:
> 
> #if KMP_OS_LINUX || KMP_OS_FREEBSD
>     if ( __kmp_stkoffset > 0 && gtid > 0 ) {
>         padding = alloca( gtid * __kmp_stkoffset );
>     }
> #endif
> 
> this padding variable is dead, and I'd hope that the compiler removes 
> it.
> 
> And then, as you imply, this is used to adjust the thread's stack
> size:
> 
>             /* Set stack size for this thread now. */
>             stack_size += gtid * __kmp_stkoffset; ...
>                 status = pthread_attr_setstacksize( & thread_attr,
>                 stack_size );
> 
> any while I understand that this might also affect the starting offset 
> of subsequent thread's stacks, I don't see what would be true (I 
> assume that the OS will create each thread's stack so that the start 
> of the stack is really page aligned). Again, why you're multiplying by 
> gtid is not explained. Are you trying to reduce false sharing?
> 
> In short, this needs more comments (at least).
> 
> > 
> > 3) omp_flush_fix.patch – removes unused varargs from #pragma omp 
> > flush api function.
> > 
> 
> LGTM.
> 
> Thanks again,
> Hal
> 
> > 
> > 
> > -- Johnny
> > 
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

_______________________________________________
Openmp-commits mailing list
Openmp-commits at dcs-maillist2.engr.illinois.edu
http://lists.cs.uiuc.edu/mailman/listinfo/openmp-commits
---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.