<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, May 12, 2017 at 3:52 PM, Bryce Lelbach via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Alright, I thought about this today and now I have a less terse response.<br>

<br>

High-level summary on nested parallelism:<br>

<span class=""><br>

> On Wed, May 10, 2017 at 8:08 PM Scott Smith <<a href="mailto:scott.smith@purestorage.com">scott.smith@purestorage.com</a>><br>

> wrote:<br>

>><br>

>> The spec doesn't seem to say anything about recursive calls to parallel<br>

>> functions.  In my mind that means the functions must support it, since it<br>

>> doesn't explicitly say it does not need to support it.<br>

<br>

</span>* ^ That sounds correct. Conforming C++17 and Parallelism TS<br>

implementations should support nested parallel algorithms; I am fairly<br>

confident about this.<br>

* My implementation (HPX) supports nested parallel algorithms<br>

implicitly (due to the design of our system - we have a M:N user-space<br>

tasking system) - at least for our CPU-centric executors. So, I've<br>

never thought about this before.<br>

* I am concerned that nested parallel algorithms will prove to be a<br>

big implementation burden for GPU and accelerator architectures.<br>

* I don't think having some executors support this and some not<br>

support this makes sense. Executors represent the /where/ of<br>

execution; on what resources (be they software abstractions, like<br>

thread pools, or a concrete hardware resource) should work be<br>

executed. Execution policies describe the /how/ of execution; e.g.<br>

what parallelism is allowable and what the contract is between element<br>

access functions (aka users) and the parallel algorithms runtime. I<br>

could imagine an execution policy that specifies UB for nested<br>

parallelism.<br>

* I don't think we've discussed this on the committee in the 1.5 years<br>

I've been involved with the standardization of the parallel<br>

algorithms. That time period covers the TS process and the IS process<br>

- I don't remember any NB comments about this on either. I checked the<br>

minutes briefly, as well. Perhaps I am very wrong and it was discussed<br>

earlier in the process before I arrived, or I am remembering<br>

incorrectly; but it is entirely possible this issue was never<br>

considered directly.<br>

<br>

I took a look at your initial work on llvm/Support/Parallel.h - it<br>

looks like a good start!<br>

<br>

However, I want to gently urge caution and then take this thread on a<br>

bit of a tangent, because I suspect this work will start leading<br>

towards whatever goes into the libc++ C++17 parallel algorithms<br>

implementation. Even if the intention is to have a separate,<br>

stand-alone implementation in LLVM Support, I think that<br>

implementation should share some interfaces and design philosophy with<br>

whatever goes into libc++. I think there is a very important step that<br>

needs to be taken before substantial work is done on libc++'s parallel<br>

algorithms.<br>

<br>

Let me explain by analogy: A few years ago, my research group was<br>

tasked with delivering an OpenMP compatibility story for our software<br>

stack. We decided the best approach was for us to implement the<br>

interface of the Intel OpenMP runtime library because this would allow<br>

us to "hijack" that interface, which is targeted by multiple compilers<br>

(Clang, ICPC, and one DoE research compiler thingy). Much to my<br>

surprise, this effort actually worked pretty well, with very few of<br>

the nightmarish stories that you might expect when gluing multiple<br>

large projects together.<br>

<br>

Why was this effort successful?<br>

<br>

* The Intel OpenMP interface was fairly stable.<br>

* It was well documented (relative to other internal interface layers).<br>

* It was designed to support this "hijacking" model.<br>

* The interface was not overly restrictive; there were not an<br>

excessive number of places where we had to force square pegs through<br>

round holes.<br>

<br>

I feel strongly that parallel algorithms implementation should take a<br>

similar approach, where the parallel runtime layer is encapsulated,<br>

well defined, and separate from the "frontend" (aka the algorithms<br>

themselves). One option, of course, would be to use executors for<br>

this. However, executors will likely not enable the "hijacking" model<br>

without code changes and recompilation. Even with executors, we will<br>

always have the "implicit default executor" that notionally exists<br>

today in C++17 and the Parallelism TS. I'd really like end-users and<br>

people who are shipping LLVM toolchains to be able to switch the<br>

parallel runtime layer by simply linking against a different library<br>

that implements some agreed-upon interface.<br>

<br>

I've spoken with both the libc++ and libstdc++ devs at past committee<br>

meetings about this; I think they have had similar thoughts.<br>

<br>

The challenge is to come up with something that meet everyone's requirements:<br>

<br>

* Everyone will want clear encapsulation between the parallel runtime<br>

engine and the algorithms implementation - including a well-defined<br>

and fairly stable (both API and ABI stable) interface between the two.<br>

* The default implementation will need to work for hundreds of<br>

thousands of users; it has to be robust and MUST avoid inflicting any<br>

unnecessary runtime costs (e.g. no one should pay for what they aren't<br>

using).<br>

* High-performance users will want a very different type of<br>

implementation - they'll want a more aggressive runtime layer that<br>

will assume large amounts of parallel work will be executed and will<br>

try to amortize, not minimize costs.<br>

* Some users will need implementations with very strict real-time or<br>

latency constraints.<br>

* Some users will want to plug in special implementations for<br>

heterogeneous systems.<br>

<br>

At the end of the day, crazy people like me will make our runtimes<br>

work with whatever interface is provided. However, I think it would be<br>

very beneficial to try and come up with a design before major<br>

implementation work is done, and get feedback and guidance from<br>

interested parties.<br>

<br>

The potential benefit of this approach: one day in the future, we<br>

might have multiple different backend implementations of the parallel<br>

algorithms that work with multiple different compiler frameworks, all<br>

of which interoperate through this agreed-upon C++17 parallel<br>

algorithms runtime layer. I think we'll thank ourselves later for<br>

sketching out a concrete, overall design first.<br>

<br>

So... the question is, are others interested in this approach and will<br>

to help? As one of the culprits of parallel algorithms in C++17, I've<br>

suspected that I'd end up contributing some sweat and blood to<br>

libc++'s design and implementation, but it is not a one-person<br>

project.<br>

<br>

TL;DR - We should encapsulate the parallelism engine from the<br>

algorithms themselves in libc++ and stick a flexible and robust<br>

interface layer (distinct from executors and completely implicit)<br>

between them. I tentatively volunteer to collaborate with<br>

Marshall/Eric on this, but would need other interested parties to<br>

help.<br></blockquote><div><br></div><div>I'd be happy to quietly follow the discussion and chime in on the GPU or accelerator side of things if needed.<br></div><div><br></div></div></div></div>