<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<p><br>
</p>
<div class="moz-cite-prefix">On 05/16/2017 11:57 AM, C Bergström
wrote:<br>
</div>
<blockquote
cite="mid:CAOnawYoW4KLedZFT5tb6VVVDw0Yr8Lp1wk5v4uixzJ8zuefAEw@mail.gmail.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
<div dir="ltr"><br>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Wed, May 17, 2017 at 12:20 AM, Hal
Finkel <span dir="ltr"><<a moz-do-not-send="true"
target="_blank" href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>></span>
wrote:<br>
<blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px
solid rgb(204,204,204);padding-left:1ex"
class="gmail_quote">
<div bgcolor="#FFFFFF"><span class="gmail-">
<div
class="gmail-m_3541403897252453532moz-cite-prefix">On
05/16/2017 02:54 AM, C Bergström wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr"><br>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Tue, May 16, 2017 at
2:50 PM, Hal Finkel via cfe-dev <span
dir="ltr"><<a moz-do-not-send="true"
target="_blank"
href="mailto:cfe-dev@lists.llvm.org">cfe-dev@lists.llvm.org</a>></span>
wrote:<br>
<blockquote style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex"
class="gmail_quote">
<div bgcolor="#FFFFFF">
<p>Hi, Erik,</p>
<p>That's great!<br>
</p>
<p>Gor, Marshall, and I discussed this
after some past committee meeting. We
wanted to architect the implementation
so that we could provide different
underlying concurrency mechanisms;
including:</p>
<p> a. A self-contained
thread-pool-based implementation using a
work-stealing scheme.</p>
<p> b. An implementation that wraps
Grand Central Dispatch (for Mac and any
other platforms providing libdispatch).</p>
<p> c. An implementation that uses
OpenMP.</p>
</div>
</blockquote>
<div><br>
</div>
<div>Sorry to butt in, but I'm kinda curious
how these will be substantially different
under the hood<br>
</div>
</div>
</div>
</div>
</blockquote>
<br>
</span> No need to be sorry; this is a good question. I
think that there are a few high-level goals here:<br>
<br>
1. Provide a solution that works for everybody<br>
<br>
2. Take advantage of compiler technology as appropriate<br>
<br>
3. Provide useful interoperability. In practice: don't
oversubscribe the system.<br>
<br>
The motivation for providing an implementation based on
a libc++ thread pool is to satisfy (1). Your suggestion
of using our OpenMP runtime's low-level API directly is
a good one. Personally, I really like this idea. It does
imply, however, that organizations that distribute
libc++ will also end up distributing libomp. If libomp
has matured (in the open-source sense) to the point
where this is a suitable solution, then we should do
this. As I recall, however, we still have at least
several organizations that ship Clang/LLVM/libc++-based
toolchains that don't ship libomp, and I don't know how
generally comfortable people will be with this
dependency.<br>
</div>
</blockquote>
<div><br>
</div>
<div>If "people" aren't comfortable with llvm-openmp then
kick it out as a project. I use it and I know other
projects that use it just fine. I can maybe claim the
title of OpenMP hater and yet I don't know any legitimate
reason against having this as a dependency. It's a
portable parallel runtime that exposes an API and works..
I hope someone does speak up about specific concerns if
they exist.<br>
</div>
<blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px
solid rgb(204,204,204);padding-left:1ex"
class="gmail_quote">
<div bgcolor="#FFFFFF"> <br>
That having been said, to point (2), using the OpenMP
compiler directives is superior to calling the low-level
API directly. OpenMP directives to translate into API
calls, as you point out, but they also provide
optimization hints to the compiler (e.g. about lack of
loop-carried dependencies). Over the next couple of
years, I expect to see a lot more in the compiler
optimization capabilities around OpenMP (and perhaps
other parallelism) directives (parallel-region fusion,
etc.). OpenMP also provides a standard way to access
many of the relevant vectorization hints, and taking
advantage of this is useful for compiling with Clang and
also other compilers.<br>
</div>
</blockquote>
<div><br>
</div>
<div>If projects can't even ship llvm-openmp runtime then I
have a very strong concern with bootstrap dependencies
which may start relying on external tools.<br>
<br>
</div>
<div>Further, I'm not sure I understand your point here. The
directives wouldn't be in the end user code, but would be
in the STL implementation side. Wouldn't that
implementation stuff be fixed and an abstract layer
exposed to the end user? It almost sounds like you're
expressing the benefits of OMP here and not the parallel
STL side. (Hmm.. in the distance I hear.. "<span
class="gmail-st"><em>premature optimization</em> is the
root of <em>all evil")</em></span></div>
</div>
</div>
</div>
</blockquote>
<br>
That's correct. The OpenMP pragmas would be an implementation
detail. However, we'd design this so that the lambda that gets
passed into the algorithm can be inlined into the code that has the
compiler directives, thus reaping the benefit of OpenMP's compiler
integration.<br>
<br>
<blockquote
cite="mid:CAOnawYoW4KLedZFT5tb6VVVDw0Yr8Lp1wk5v4uixzJ8zuefAEw@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div><br>
</div>
<div>Once llvm OpenMP can do things like handle nested
parallelism and a few more advanced things properly all
this might be fun (We can go down a big list if anyone
wants to digress)<br>
</div>
</div>
</div>
</div>
</blockquote>
<br>
This is why I said we might consider using taskloop ;) -- There are
other ways of handling nesting as well (colleagues of mine work on
one: <a class="moz-txt-link-freetext" href="http://www.bolt-omp.org/">http://www.bolt-omp.org/</a>), but we should probably have a
separate thread on OpenMP and nesting to discuss this aspect of
things.<br>
<br>
<blockquote
cite="mid:CAOnawYoW4KLedZFT5tb6VVVDw0Yr8Lp1wk5v4uixzJ8zuefAEw@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div> </div>
<blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px
solid rgb(204,204,204);padding-left:1ex"
class="gmail_quote">
<div bgcolor="#FFFFFF"> <br>
Regarding why you'd use GDC on Mac, and similarly why it
is important for many users to use OpenMP underneath, it
is important, to the extent possible, to use the same
underlying thread pool as other things in the
application. This is to avoid over-subscription and
other issues associated with conflicting threading
runtimes. If parts of the application are already using
GCD, then we probably want to do this to (or at least
not compete with it). Otherwise, OpenMP's runtime is
probably better ;)<span class="gmail-"><br>
</span></div>
</blockquote>
<div><br>
</div>
<div>Again this detail isn't visible to the end user? We
pick an implementation that makes sense. If other
applications use GCD and we use OpenMP, if multiple thread
heavy applications are running, over-subscription would be
a kernel issue and not userland. I don't see how you can
always avoid that situation and creating two
implementations to try kinda seems funny. btw GCD is a
marketing term and libdispatch is really what I'm talking
about here. It's been quite a while since I hands on
worked with it, but I wonder how much the API overlaps
with similar interfaces to llvm-openmp. If the interfaces
are similar and the "cost" in terms of complexity is low,
who cares, but I don't remember that being the case. (side
note: I worked on an older version of libdispatch and
ported it Solaris. I also played around and benchmarked
OMP tasks lowering directly down to libdispatch calls
across multiple platforms. At the time our runtime always
beat it in performance. Maybe newer versions of
libdispatch are better)<br>
</div>
</div>
</div>
</div>
</blockquote>
<br>
The detail is invisible to the user at the source-code level.
Obviously they might notice if we're oversubscribing the system.
Yes, on many systems the kernel can manage oversubscription, but
that does not mean it will perform well. As I'm sure you understand,
because of cache locality and many other effects, just running a
bunch of threads and letting the kernel switch them is often much
slower than running a smaller number of threads and having them pull
from a task queue. There are exceptions worth mentioning, however,
such as when the threads are mostly themselves blocked on I/O. <br>
<br>
<blockquote
cite="mid:CAOnawYoW4KLedZFT5tb6VVVDw0Yr8Lp1wk5v4uixzJ8zuefAEw@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div><br>
</div>
<div>I'm not trying to be combative, but your points just
don't make sense....... (I take the blame and must be
missing something)<br>
-----------------<br>
</div>
<div>All this aside - I'm happy to help if needed - GPU
(NVIDIA or AMD) and or llvm-openmp direct runtime api
implementation. I've been involved with sorta similar
projects (C++AMP) and based on that experience may be able
to help avoid some gotchas.<br>
</div>
</div>
</div>
</div>
</blockquote>
<br>
Sounds great.<br>
<br>
-Hal<br>
<br>
<pre class="moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</body>
</html>