[cfe-dev] [libc++] Working on the parallel STL algorithms
Hal Finkel via cfe-dev
cfe-dev at lists.llvm.org
Tue May 16 09:20:32 PDT 2017
On 05/16/2017 02:54 AM, C Bergström wrote:
> On Tue, May 16, 2017 at 2:50 PM, Hal Finkel via cfe-dev
> <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
> Hi, Erik,
> That's great!
> Gor, Marshall, and I discussed this after some past committee
> meeting. We wanted to architect the implementation so that we
> could provide different underlying concurrency mechanisms; including:
> a. A self-contained thread-pool-based implementation using a
> work-stealing scheme.
> b. An implementation that wraps Grand Central Dispatch (for Mac
> and any other platforms providing libdispatch).
> c. An implementation that uses OpenMP.
> Sorry to butt in, but I'm kinda curious how these will be
> substantially different under the hood
No need to be sorry; this is a good question. I think that there are a
few high-level goals here:
1. Provide a solution that works for everybody
2. Take advantage of compiler technology as appropriate
3. Provide useful interoperability. In practice: don't oversubscribe
The motivation for providing an implementation based on a libc++ thread
pool is to satisfy (1). Your suggestion of using our OpenMP runtime's
low-level API directly is a good one. Personally, I really like this
idea. It does imply, however, that organizations that distribute libc++
will also end up distributing libomp. If libomp has matured (in the
open-source sense) to the point where this is a suitable solution, then
we should do this. As I recall, however, we still have at least several
organizations that ship Clang/LLVM/libc++-based toolchains that don't
ship libomp, and I don't know how generally comfortable people will be
with this dependency.
That having been said, to point (2), using the OpenMP compiler
directives is superior to calling the low-level API directly. OpenMP
directives to translate into API calls, as you point out, but they also
provide optimization hints to the compiler (e.g. about lack of
loop-carried dependencies). Over the next couple of years, I expect to
see a lot more in the compiler optimization capabilities around OpenMP
(and perhaps other parallelism) directives (parallel-region fusion,
etc.). OpenMP also provides a standard way to access many of the
relevant vectorization hints, and taking advantage of this is useful for
compiling with Clang and also other compilers.
Regarding why you'd use GDC on Mac, and similarly why it is important
for many users to use OpenMP underneath, it is important, to the extent
possible, to use the same underlying thread pool as other things in the
application. This is to avoid over-subscription and other issues
associated with conflicting threading runtimes. If parts of the
application are already using GCD, then we probably want to do this to
(or at least not compete with it). Otherwise, OpenMP's runtime is
probably better ;)
> "OpenMP" is a pretty vague term and I'm curious what that means in
> terms of actual directives used. All non-accelerator OpenMP
> implementations lower down to threading currently. (Even if you use
> tasks it still ends up being a thread)
I had in mind basic host-level OpenMP directives (i.e. OpenMP 3 style
plus simd directives for vectorization, although using taskloop is a
good thing to consider as well). I don't think we can transparently use
OpenMP accelerator directives in their current state because we can't
identify the memory dependencies. When OpenMP grows some way to deal
with accelerators in a global address space (e.g. the new NVIDIA UVM
technology), then we should be able to use that too. CUDA+UVM will be an
option in the shorter term here as well, however. Given that Clang can
function as a CUDA compiler, this is definitely worth exploring.
> GCD (libdispatch) is essentially a task based execution model, but
> again on non-OSX platforms lowers to threads. (I have a doubt that GCD
> offers any performance benefit over native threads or Intel OMP
> runtime on OSX.)
> How would the above offer any benefit over a native thread pool? Would
> you be just duplicating code which is already working?
> I'm no OMP advocate, but I'd find it significantly more sane to target
> the Intel OMP runtime API directly.
> * Production ready
> * Portable across CPU (Intel, ARM, Power8)
> * Likely provides the interface needed for parallelism
> * Single approach
> * Already part of the llvm infrastructure without external dependencies.
> I don't know how well the API will map to accelerators, but for
> something quick and easy it's likely to the easiest.
> Bryce I think even mentioned he had used it before with positive results?
> In contrast the other approaches will loosely couple things to
> external dependencies and be more difficult to debug and support long
> term. It will introduce additional build dependencies which will
> likely add barriers to others contributing.
> I'm not writing the code and just trying to offer another pragmatic
> point of view..
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the cfe-dev