[llvm-dev] PSA: Parallel STL algorithms available in LLVM

Fri May 12 00:52:30 PDT 2017

Alright, I thought about this today and now I have a less terse response.

High-level summary on nested parallelism:

> On Wed, May 10, 2017 at 8:08 PM Scott Smith <scott.smith at purestorage.com>
> wrote:
>>
>> The spec doesn't seem to say anything about recursive calls to parallel
>> functions.  In my mind that means the functions must support it, since it
>> doesn't explicitly say it does not need to support it.

* ^ That sounds correct. Conforming C++17 and Parallelism TS
implementations should support nested parallel algorithms; I am fairly
confident about this.
* My implementation (HPX) supports nested parallel algorithms
implicitly (due to the design of our system - we have a M:N user-space
tasking system) - at least for our CPU-centric executors. So, I've
never thought about this before.
* I am concerned that nested parallel algorithms will prove to be a
big implementation burden for GPU and accelerator architectures.
* I don't think having some executors support this and some not
support this makes sense. Executors represent the /where/ of
execution; on what resources (be they software abstractions, like
thread pools, or a concrete hardware resource) should work be
executed. Execution policies describe the /how/ of execution; e.g.
what parallelism is allowable and what the contract is between element
access functions (aka users) and the parallel algorithms runtime. I
could imagine an execution policy that specifies UB for nested
parallelism.
* I don't think we've discussed this on the committee in the 1.5 years
I've been involved with the standardization of the parallel
algorithms. That time period covers the TS process and the IS process
- I don't remember any NB comments about this on either. I checked the
minutes briefly, as well. Perhaps I am very wrong and it was discussed
earlier in the process before I arrived, or I am remembering
incorrectly; but it is entirely possible this issue was never
considered directly.

I took a look at your initial work on llvm/Support/Parallel.h - it
looks like a good start!

However, I want to gently urge caution and then take this thread on a
bit of a tangent, because I suspect this work will start leading
towards whatever goes into the libc++ C++17 parallel algorithms
implementation. Even if the intention is to have a separate,
stand-alone implementation in LLVM Support, I think that
implementation should share some interfaces and design philosophy with
whatever goes into libc++. I think there is a very important step that
needs to be taken before substantial work is done on libc++'s parallel
algorithms.

Let me explain by analogy: A few years ago, my research group was
tasked with delivering an OpenMP compatibility story for our software
stack. We decided the best approach was for us to implement the
interface of the Intel OpenMP runtime library because this would allow
us to "hijack" that interface, which is targeted by multiple compilers
(Clang, ICPC, and one DoE research compiler thingy). Much to my
surprise, this effort actually worked pretty well, with very few of
the nightmarish stories that you might expect when gluing multiple
large projects together.

Why was this effort successful?

* The Intel OpenMP interface was fairly stable.
* It was well documented (relative to other internal interface layers).
* It was designed to support this "hijacking" model.
* The interface was not overly restrictive; there were not an
excessive number of places where we had to force square pegs through
round holes.

I feel strongly that parallel algorithms implementation should take a
similar approach, where the parallel runtime layer is encapsulated,
well defined, and separate from the "frontend" (aka the algorithms
themselves). One option, of course, would be to use executors for
this. However, executors will likely not enable the "hijacking" model
without code changes and recompilation. Even with executors, we will
always have the "implicit default executor" that notionally exists
today in C++17 and the Parallelism TS. I'd really like end-users and
people who are shipping LLVM toolchains to be able to switch the
parallel runtime layer by simply linking against a different library
that implements some agreed-upon interface.

I've spoken with both the libc++ and libstdc++ devs at past committee
meetings about this; I think they have had similar thoughts.

The challenge is to come up with something that meet everyone's requirements:

* Everyone will want clear encapsulation between the parallel runtime
engine and the algorithms implementation - including a well-defined
and fairly stable (both API and ABI stable) interface between the two.
* The default implementation will need to work for hundreds of
thousands of users; it has to be robust and MUST avoid inflicting any
unnecessary runtime costs (e.g. no one should pay for what they aren't
using).
* High-performance users will want a very different type of
implementation - they'll want a more aggressive runtime layer that
will assume large amounts of parallel work will be executed and will
try to amortize, not minimize costs.
* Some users will need implementations with very strict real-time or
latency constraints.
* Some users will want to plug in special implementations for
heterogeneous systems.

At the end of the day, crazy people like me will make our runtimes
work with whatever interface is provided. However, I think it would be
very beneficial to try and come up with a design before major
implementation work is done, and get feedback and guidance from
interested parties.

The potential benefit of this approach: one day in the future, we
might have multiple different backend implementations of the parallel
algorithms that work with multiple different compiler frameworks, all
of which interoperate through this agreed-upon C++17 parallel
algorithms runtime layer. I think we'll thank ourselves later for
sketching out a concrete, overall design first.

So... the question is, are others interested in this approach and will
to help? As one of the culprits of parallel algorithms in C++17, I've
suspected that I'd end up contributing some sweat and blood to
libc++'s design and implementation, but it is not a one-person
project.

TL;DR - We should encapsulate the parallelism engine from the
algorithms themselves in libc++ and stick a flexible and robust
interface layer (distinct from executors and completely implicit)
between them. I tentatively volunteer to collaborate with
Marshall/Eric on this, but would need other interested parties to
help.

-- 
Bryce Adelstein Lelbach aka wash
Lawrence Berkeley National Laboratory
ISO C++ Committee Member
CppCon and C++Now Program Chair

Compiler ICE Hunter
--