[cfe-dev] [libc++] Working on the parallel STL algorithms

Tue May 16 09:20:32 PDT 2017

On 05/16/2017 02:54 AM, C Bergström wrote:
>
>
> On Tue, May 16, 2017 at 2:50 PM, Hal Finkel via cfe-dev 
> <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
>
>     Hi, Erik,
>
>     That's great!
>
>     Gor, Marshall, and I discussed this after some past committee
>     meeting. We wanted to architect the implementation so that we
>     could provide different underlying concurrency mechanisms; including:
>
>        a. A self-contained thread-pool-based implementation using a
>     work-stealing scheme.
>
>        b. An implementation that wraps Grand Central Dispatch (for Mac
>     and any other platforms providing libdispatch).
>
>        c. An implementation that uses OpenMP.
>
>
> Sorry to butt in, but I'm kinda curious how these will be 
> substantially different under the hood

No need to be sorry; this is a good question. I think that there are a 
few high-level goals here:

  1. Provide a solution that works for everybody

  2. Take advantage of compiler technology as appropriate

  3. Provide useful interoperability. In practice: don't oversubscribe 
the system.

The motivation for providing an implementation based on a libc++ thread 
pool is to satisfy (1). Your suggestion of using our OpenMP runtime's 
low-level API directly is a good one. Personally, I really like this 
idea. It does imply, however, that organizations that distribute libc++ 
will also end up distributing libomp. If libomp has matured (in the 
open-source sense) to the point where this is a suitable solution, then 
we should do this. As I recall, however, we still have at least several 
organizations that ship Clang/LLVM/libc++-based toolchains that don't 
ship libomp, and I don't know how generally comfortable people will be 
with this dependency.

That having been said, to point (2), using the OpenMP compiler 
directives is superior to calling the low-level API directly. OpenMP 
directives to translate into API calls, as you point out, but they also 
provide optimization hints to the compiler (e.g. about lack of 
loop-carried dependencies). Over the next couple of years, I expect to 
see a lot more in the compiler optimization capabilities around OpenMP 
(and perhaps other parallelism) directives (parallel-region fusion, 
etc.). OpenMP also provides a standard way to access many of the 
relevant vectorization hints, and taking advantage of this is useful for 
compiling with Clang and also other compilers.

Regarding why you'd use GDC on Mac, and similarly why it is important 
for many users to use OpenMP underneath, it is important, to the extent 
possible, to use the same underlying thread pool as other things in the 
application. This is to avoid over-subscription and other issues 
associated with conflicting threading runtimes. If parts of the 
application are already using GCD, then we probably want to do this to 
(or at least not compete with it). Otherwise, OpenMP's runtime is 
probably better ;)

>
> "OpenMP" is a pretty vague term and I'm curious what that means in 
> terms of actual directives used. All non-accelerator OpenMP 
> implementations lower down to threading currently. (Even if you use 
> tasks it still ends up being a thread)

I had in mind basic host-level OpenMP directives (i.e. OpenMP 3 style 
plus simd directives for vectorization, although using taskloop is a 
good thing to consider as well). I don't think we can transparently use 
OpenMP accelerator directives in their current state because we can't 
identify the memory dependencies. When OpenMP grows some way to deal 
with accelerators in a global address space (e.g. the new NVIDIA UVM 
technology), then we should be able to use that too. CUDA+UVM will be an 
option in the shorter term here as well, however. Given that Clang can 
function as a CUDA compiler, this is definitely worth exploring.

Thanks again,
Hal

>
> GCD (libdispatch) is essentially a task based execution model, but 
> again on non-OSX platforms lowers to threads. (I have a doubt that GCD 
> offers any performance benefit over native threads or Intel OMP 
> runtime on OSX.)
>
> How would the above offer any benefit over a native thread pool? Would 
> you be just duplicating code which is already working?
> --------------
> I'm no OMP advocate, but I'd find it significantly more sane to target 
> the Intel OMP runtime API directly.
> * Production ready
> * Portable across CPU (Intel, ARM, Power8)
> * Likely provides the interface needed for parallelism
> * Single approach
> * Already part of the llvm infrastructure without external dependencies.
>
> I don't know how well the API will map to accelerators, but for 
> something quick and easy it's likely to the easiest.
>
> Bryce I think even mentioned he had used it before with positive results?
>
> In contrast the other approaches will loosely couple things to 
> external dependencies and be more difficult to debug and support long 
> term. It will introduce additional build dependencies which will 
> likely add barriers to others contributing.
>
> I'm not writing the code and just trying to offer another pragmatic 
> point of view..
>

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20170516/2fc881c9/attachment.html>