[llvm-dev] PSA: Parallel STL algorithms available in LLVM

Fri May 12 01:05:20 PDT 2017

On Fri, May 12, 2017 at 3:52 PM, Bryce Lelbach via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Alright, I thought about this today and now I have a less terse response.
>
> High-level summary on nested parallelism:
>
> > On Wed, May 10, 2017 at 8:08 PM Scott Smith <scott.smith at purestorage.com
> >
> > wrote:
> >>
> >> The spec doesn't seem to say anything about recursive calls to parallel
> >> functions.  In my mind that means the functions must support it, since
> it
> >> doesn't explicitly say it does not need to support it.
>
> * ^ That sounds correct. Conforming C++17 and Parallelism TS
> implementations should support nested parallel algorithms; I am fairly
> confident about this.
> * My implementation (HPX) supports nested parallel algorithms
> implicitly (due to the design of our system - we have a M:N user-space
> tasking system) - at least for our CPU-centric executors. So, I've
> never thought about this before.
> * I am concerned that nested parallel algorithms will prove to be a
> big implementation burden for GPU and accelerator architectures.
> * I don't think having some executors support this and some not
> support this makes sense. Executors represent the /where/ of
> execution; on what resources (be they software abstractions, like
> thread pools, or a concrete hardware resource) should work be
> executed. Execution policies describe the /how/ of execution; e.g.
> what parallelism is allowable and what the contract is between element
> access functions (aka users) and the parallel algorithms runtime. I
> could imagine an execution policy that specifies UB for nested
> parallelism.
> * I don't think we've discussed this on the committee in the 1.5 years
> I've been involved with the standardization of the parallel
> algorithms. That time period covers the TS process and the IS process
> - I don't remember any NB comments about this on either. I checked the
> minutes briefly, as well. Perhaps I am very wrong and it was discussed
> earlier in the process before I arrived, or I am remembering
> incorrectly; but it is entirely possible this issue was never
> considered directly.
>
> I took a look at your initial work on llvm/Support/Parallel.h - it
> looks like a good start!
>
> However, I want to gently urge caution and then take this thread on a
> bit of a tangent, because I suspect this work will start leading
> towards whatever goes into the libc++ C++17 parallel algorithms
> implementation. Even if the intention is to have a separate,
> stand-alone implementation in LLVM Support, I think that
> implementation should share some interfaces and design philosophy with
> whatever goes into libc++. I think there is a very important step that
> needs to be taken before substantial work is done on libc++'s parallel
> algorithms.
>
> Let me explain by analogy: A few years ago, my research group was
> tasked with delivering an OpenMP compatibility story for our software
> stack. We decided the best approach was for us to implement the
> interface of the Intel OpenMP runtime library because this would allow
> us to "hijack" that interface, which is targeted by multiple compilers
> (Clang, ICPC, and one DoE research compiler thingy). Much to my
> surprise, this effort actually worked pretty well, with very few of
> the nightmarish stories that you might expect when gluing multiple
> large projects together.
>
> Why was this effort successful?
>
> * The Intel OpenMP interface was fairly stable.
> * It was well documented (relative to other internal interface layers).
> * It was designed to support this "hijacking" model.
> * The interface was not overly restrictive; there were not an
> excessive number of places where we had to force square pegs through
> round holes.
>
> I feel strongly that parallel algorithms implementation should take a
> similar approach, where the parallel runtime layer is encapsulated,
> well defined, and separate from the "frontend" (aka the algorithms
> themselves). One option, of course, would be to use executors for
> this. However, executors will likely not enable the "hijacking" model
> without code changes and recompilation. Even with executors, we will
> always have the "implicit default executor" that notionally exists
> today in C++17 and the Parallelism TS. I'd really like end-users and
> people who are shipping LLVM toolchains to be able to switch the
> parallel runtime layer by simply linking against a different library
> that implements some agreed-upon interface.
>
> I've spoken with both the libc++ and libstdc++ devs at past committee
> meetings about this; I think they have had similar thoughts.
>
> The challenge is to come up with something that meet everyone's
> requirements:
>
> * Everyone will want clear encapsulation between the parallel runtime
> engine and the algorithms implementation - including a well-defined
> and fairly stable (both API and ABI stable) interface between the two.
> * The default implementation will need to work for hundreds of
> thousands of users; it has to be robust and MUST avoid inflicting any
> unnecessary runtime costs (e.g. no one should pay for what they aren't
> using).
> * High-performance users will want a very different type of
> implementation - they'll want a more aggressive runtime layer that
> will assume large amounts of parallel work will be executed and will
> try to amortize, not minimize costs.
> * Some users will need implementations with very strict real-time or
> latency constraints.
> * Some users will want to plug in special implementations for
> heterogeneous systems.
>
> At the end of the day, crazy people like me will make our runtimes
> work with whatever interface is provided. However, I think it would be
> very beneficial to try and come up with a design before major
> implementation work is done, and get feedback and guidance from
> interested parties.
>
> The potential benefit of this approach: one day in the future, we
> might have multiple different backend implementations of the parallel
> algorithms that work with multiple different compiler frameworks, all
> of which interoperate through this agreed-upon C++17 parallel
> algorithms runtime layer. I think we'll thank ourselves later for
> sketching out a concrete, overall design first.
>
> So... the question is, are others interested in this approach and will
> to help? As one of the culprits of parallel algorithms in C++17, I've
> suspected that I'd end up contributing some sweat and blood to
> libc++'s design and implementation, but it is not a one-person
> project.
>
> TL;DR - We should encapsulate the parallelism engine from the
> algorithms themselves in libc++ and stick a flexible and robust
> interface layer (distinct from executors and completely implicit)
> between them. I tentatively volunteer to collaborate with
> Marshall/Eric on this, but would need other interested parties to
> help.
>

I'd be happy to quietly follow the discussion and chime in on the GPU or
accelerator side of things if needed.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170512/57046716/attachment.html>