[cfe-dev] [libc++] Working on the parallel STL algorithms

Tue May 16 11:17:06 PDT 2017

On 05/16/2017 11:57 AM, C Bergström wrote:
>
>
> On Wed, May 17, 2017 at 12:20 AM, Hal Finkel <hfinkel at anl.gov 
> <mailto:hfinkel at anl.gov>> wrote:
>
>     On 05/16/2017 02:54 AM, C Bergström wrote:
>>
>>
>>     On Tue, May 16, 2017 at 2:50 PM, Hal Finkel via cfe-dev
>>     <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
>>
>>         Hi, Erik,
>>
>>         That's great!
>>
>>         Gor, Marshall, and I discussed this after some past committee
>>         meeting. We wanted to architect the implementation so that we
>>         could provide different underlying concurrency mechanisms;
>>         including:
>>
>>            a. A self-contained thread-pool-based implementation using
>>         a work-stealing scheme.
>>
>>            b. An implementation that wraps Grand Central Dispatch
>>         (for Mac and any other platforms providing libdispatch).
>>
>>            c. An implementation that uses OpenMP.
>>
>>
>>     Sorry to butt in, but I'm kinda curious how these will be
>>     substantially different under the hood
>
>     No need to be sorry; this is a good question. I think that there
>     are a few high-level goals here:
>
>      1. Provide a solution that works for everybody
>
>      2. Take advantage of compiler technology as appropriate
>
>      3. Provide useful interoperability. In practice: don't
>     oversubscribe the system.
>
>     The motivation for providing an implementation based on a libc++
>     thread pool is to satisfy (1). Your suggestion of using our OpenMP
>     runtime's low-level API directly is a good one. Personally, I
>     really like this idea. It does imply, however, that organizations
>     that distribute libc++ will also end up distributing libomp. If
>     libomp has matured (in the open-source sense) to the point where
>     this is a suitable solution, then we should do this. As I recall,
>     however, we still have at least several organizations that ship
>     Clang/LLVM/libc++-based toolchains that don't ship libomp, and I
>     don't know how generally comfortable people will be with this
>     dependency.
>
>
> If "people" aren't comfortable with llvm-openmp then kick it out as a 
> project. I use it and I know other projects that use it just fine. I 
> can maybe claim the title of OpenMP hater and yet I don't know any 
> legitimate reason against having this as a dependency. It's a portable 
> parallel runtime that exposes an API and works.. I hope someone does 
> speak up about specific concerns if they exist.
>
>
>     That having been said, to point (2), using the OpenMP compiler
>     directives is superior to calling the low-level API directly.
>     OpenMP directives to translate into API calls, as you point out,
>     but they also provide optimization hints to the compiler (e.g.
>     about lack of loop-carried dependencies). Over the next couple of
>     years, I expect to see a lot more in the compiler optimization
>     capabilities around OpenMP (and perhaps other parallelism)
>     directives (parallel-region fusion, etc.). OpenMP also provides a
>     standard way to access many of the relevant vectorization hints,
>     and taking advantage of this is useful for compiling with Clang
>     and also other compilers.
>
>
> If projects can't even ship llvm-openmp runtime then I have a very 
> strong concern with bootstrap dependencies which may start relying on 
> external tools.
>
> Further, I'm not sure I understand your point here. The directives 
> wouldn't be in the end user code, but would be in the STL 
> implementation side. Wouldn't that implementation stuff be fixed and 
> an abstract layer exposed to the end user? It almost sounds like 
> you're expressing the benefits of OMP here and not the parallel STL 
> side. (Hmm.. in the distance I hear.. "/premature optimization/ is the 
> root of /all evil")/

That's correct. The OpenMP pragmas would be an implementation detail. 
However, we'd design this so that the lambda that gets passed into the 
algorithm can be inlined into the code that has the compiler directives, 
thus reaping the benefit of OpenMP's compiler integration.

>
> Once llvm OpenMP can do things like handle nested parallelism and a 
> few more advanced things properly all this might be fun (We can go 
> down a big list if anyone wants to digress)

This is why I said we might consider using taskloop ;) -- There are 
other ways of handling nesting as well (colleagues of mine work on one: 
http://www.bolt-omp.org/), but we should probably have a separate thread 
on OpenMP and nesting to discuss this aspect of things.

>
>     Regarding why you'd use GDC on Mac, and similarly why it is
>     important for many users to use OpenMP underneath, it is
>     important, to the extent possible, to use the same underlying
>     thread pool as other things in the application. This is to avoid
>     over-subscription and other issues associated with conflicting
>     threading runtimes. If parts of the application are already using
>     GCD, then we probably want to do this to (or at least not compete
>     with it). Otherwise, OpenMP's runtime is probably better ;)
>
>
> Again this detail isn't visible to the end user? We pick an 
> implementation that makes sense. If other applications use GCD and we 
> use OpenMP, if multiple thread heavy applications are running, 
> over-subscription would be a kernel issue and not userland. I don't 
> see how you can always avoid that situation and creating two 
> implementations to try kinda seems funny. btw GCD is a marketing term 
> and libdispatch is really what I'm talking about here. It's been quite 
> a while since I hands on worked with it, but I wonder how much the API 
> overlaps with similar interfaces to llvm-openmp. If the interfaces are 
> similar and the "cost" in terms of complexity is low, who cares, but I 
> don't remember that being the case. (side note: I worked on an older 
> version of libdispatch and ported it Solaris. I also played around and 
> benchmarked OMP tasks lowering directly down to libdispatch calls 
> across multiple platforms. At the time our runtime always beat it in 
> performance. Maybe newer versions of libdispatch are better)

The detail is invisible to the user at the source-code level. Obviously 
they might notice if we're oversubscribing the system. Yes, on many 
systems the kernel can manage oversubscription, but that does not mean 
it will perform well. As I'm sure you understand, because of cache 
locality and many other effects, just running a bunch of threads and 
letting the kernel switch them is often much slower than running a 
smaller number of threads and having them pull from a task queue. There 
are exceptions worth mentioning, however, such as when the threads are 
mostly themselves blocked on I/O.

>
> I'm not trying to be combative, but your points just don't make 
> sense....... (I take the blame and must be missing something)
> -----------------
> All this aside - I'm happy to help if needed - GPU (NVIDIA or AMD) and 
> or llvm-openmp direct runtime api implementation. I've been involved 
> with sorta similar projects (C++AMP) and based on that experience may 
> be able to help avoid some gotchas.

Sounds great.

  -Hal

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20170516/433735a7/attachment.html>