[Openmp-dev] [cfe-dev] RFC: Proposing an LLVM subproject for parallelism runtime and support libraries

Tue Mar 29 09:16:41 PDT 2016

Sergos,

Do you plan to introduce a cross-task dependece, as it is done in other
models? Say, you started data copy in one stream and want to start two
computation tasks dependent on these data in parallel. Can you assign these
tasks on different streams? How can you ensure the data is copied before
tasks are started?

Cross-stream dependencies would have to be handled on the host side. For
example, suppose we have streams s1 and s2, data transfer task d1, and
compute tasks t1 and t2 that use the data from d1. The host would enqueue
d1 on s1 and then create an "event" e1 (similar to a CUDA runtime library
event) and enqueue e1 followed by t1 on s1. Then the host would synchronize
on the completion of e1 and enqueue t2 on s2 when e1 completed.

So, introducing clang emission of SE calls should remove necessity of the
file interfaces?

The file interfaces will still be used for the less commonly used
"standalone" SE mode.

Also, I have a concern that for compiler it whould be reasonable to
introduce a plain C interface, since user can use SE in a non-C++ program.
At least it might be required to re-use the SE in other parallel models.

SE is meant as a C++-only interface, so we never intended to make a C
interface for it, but we would definitely be open to a modern C interface
living alongside the C++ interface.

As for being GPU-centric: is it a decision to follow in the future, or do
you envision some uniform device support? Say, if device hasn't a two-level
'block and dim' hierarchy as in your interface. What should be in the
plugin implementation then?

Right now the kernel launch method is the part of the plugin interface that
explicitly references the thread block hierarchy. I could see this part of
the interface changing in the future to accept a more generic "thread team
dimensions" argument in order to support other platforms. I'm not sure at
this stage what that would look like, but it would be useful to hear what
developers of potential SE platforms think.

-Jason

On Tue, Mar 29, 2016 at 7:08 AM Sergey Ostanevich <sergos.gnu at gmail.com>
wrote:

> Jason,
>
>
>> If I understand your interpretation of streams, it does not match my
>> understanding. SE follows the CUDA meaning of "stream". I think of a stream
>> as a "work queue" and each device can have several active streams. Memory
>> space on the device does not belong to any stream, so any stream can access
>> it. The thing that does belong to the stream is the "task" of copying the
>> data from one place to another (or other tasks such as running a kernel).
>>
>
> Do you plan to introduce a cross-task dependece, as it is done in other
> models? Say, you started data copy in one stream and want to start two
> computation tasks dependent on these data in parallel. Can you assign these
> tasks on different streams? How can you ensure the data is copied before
> tasks are started?
>
>
>> Yes, I think the in-memory model is much nicer, but requires compiler
>> support. SE has modes with and without compiler support and so it can
>> handle storing kernels in files as well as in memory. You are right that
>> using files requires users to change build files; that's part of the reason
>> we want clang to be able to emit SE calls. That way the kernel can be
>> stored in memory and the user won't have to think much about it.
>>
>
> So, introducing clang emission of SE calls should remove necessity of the
> file interfaces?
> Also, I have a concern that for compiler it whould be reasonable to
> introduce a plain C interface, since user can use SE in a non-C++ program.
> At least it might be required to re-use the SE in other parallel models.
>
> As for being GPU-centric: is it a decision to follow in the future, or do
> you envision some uniform device support? Say, if device hasn't a two-level
> 'block and dim' hierarchy as in your interface. What should be in the
> plugin implementation then?
>
> Sergos
>
> -Jason
>>
>> On Mon, Mar 28, 2016 at 2:31 PM Sergey Ostanevich <sergos.gnu at gmail.com>
>> wrote:
>>
>>> Jason,
>>>
>>> Am I got it right, that SE interfaces are bound to the stream that is
>>> passed as argument? As I can see the stream is an abstraction of the target
>>> - hence data transfers for particular stream is limited to this stream?
>>> As for libomptarget implementation the data once offloaded can be reused
>>> in all offload entries, without additional data transfer. Is it possible in
>>> SE approach?
>>>
>>> Regarding the kernels storing in memory or on file: the design was
>>> originally to provide offload entries within the same object file as host
>>> code. It is intended to ease adoption of the heterogeneous approach: there
>>> should be no changes to build scripts. The resultant executable/library
>>> obtained from the build should be self-contained and user will have no
>>> extra problems with target objects/files availability at rutnime.
>>>
>>> Sergos.
>>>
>>>
>>> On Mon, Mar 28, 2016 at 9:47 PM, Jason Henline via cfe-dev <
>>> cfe-dev at lists.llvm.org> wrote:
>>>
>>>> Alexandre,
>>>>
>>>> Thanks for further shedding some light on the way OpenMP handles
>>>> dependencies between tasks. I'm sorry for leaving that out of my document,
>>>> it was just because I didn't know much about the way OpenMP handled its
>>>> workflows.
>>>>
>>>> On Mon, Mar 28, 2016 at 11:43 AM Jason Henline <jhen at google.com> wrote:
>>>>
>>>>> Hi Carlo,
>>>>>
>>>>> Thanks for helping to clarify this point about libomptarget vs
>>>>> liboffload, I have been getting confused about it myself. I think the open
>>>>> question concerns libomptarget not liboffload (others can correct me if I
>>>>> have misunderstood). My analysis from looking through the code was that
>>>>> libomptarget had some similarities with the platform support in SE, so I
>>>>> just wanted to consider how those two libraries compared. I didn't do a
>>>>> comparison with liboffload.
>>>>>
>>>>> On Mon, Mar 28, 2016 at 11:11 AM Carlo Bertolli <cbertol at us.ibm.com>
>>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> Reading through the comments: both Chris and Chandler referenced to
>>>>>> liboffload, while I thought the subject of conversation was libomptarget
>>>>>> and SE.
>>>>>> I am being picky about names because liboffload is a library
>>>>>> available as part of omp (llvm's openmp runtime library) that, I believe,
>>>>>> only targets Intel Xeon Phi.
>>>>>>
>>>>>> Did you mean liboffload or libomptarget?
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> -- Carlo
>>>>>>
>>>>>> [image: Inactive hide details for Alexandre Eichenberger via
>>>>>> Openmp-dev ---03/28/2016 01:44:12 PM---Jason,]Alexandre Eichenberger
>>>>>> via Openmp-dev ---03/28/2016 01:44:12 PM---Jason,
>>>>>>
>>>>>> From: Alexandre Eichenberger via Openmp-dev <
>>>>>> openmp-dev at lists.llvm.org>
>>>>>> To: jhen at google.com
>>>>>> Cc: llvm-dev at lists.llvm.org, cfe-dev at lists.llvm.org,
>>>>>> openmp-dev at lists.llvm.org
>>>>>> Date: 03/28/2016 01:44 PM
>>>>>>
>>>>>>
>>>>>> Subject: Re: [Openmp-dev] [cfe-dev] RFC: Proposing an LLVM
>>>>>> subproject for parallelism runtime and support libraries
>>>>>>
>>>>>> Sent by: "Openmp-dev" <openmp-dev-bounces at lists.llvm.org>
>>>>>> ------------------------------
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Jason,
>>>>>>
>>>>>> I concur with your decision since OMP and StreamExecutor
>>>>>> fundamentally differ in how dependences between consecutive tasks are
>>>>>> expressed. OMP uses task dependences to express constraint ordering between
>>>>>> tasks that execute on the host and/or on a particular device. Obviously, a
>>>>>> stream is a DAG but with very specific constraints (one linear ordering per
>>>>>> stream), whereas DAG generated by OMP dependences are arbitrary DAGs. This
>>>>>> is not a jugement statement, as in many ways stream are much more friendly
>>>>>> to GPUs, it is just a decision that the OMP and StreamExecutor "language
>>>>>> experts" settled on a different language expressivity/efficiency data point.
>>>>>>
>>>>>> I read your blog on the similarities and differences with great
>>>>>> interest. I may venture to add another overlooked difference: OMP maps
>>>>>> objects with references counts (e.g. first time an object is mapped, its
>>>>>> ref count is zero, and the alloc on device and memory copy will occur;
>>>>>> further nested map will not generate any alloc and/or communication). In
>>>>>> summary, OMP primarily uses a dictionary of mapped variables to manage
>>>>>> allocation and data transfer, whereas StreamExecutor it appears to
>>>>>> explicitly allocate and move data.
>>>>>>
>>>>>> Thanks for your work on this, much appreciated
>>>>>>
>>>>>> Alexandre
>>>>>>
>>>>>>
>>>>>> -----------------------------------------------------------------------------------------------------
>>>>>> Alexandre Eichenberger, Master Inventor, Advanced Compiler
>>>>>> Technologies
>>>>>> - research: compiler optimization (OpenMP, multithreading, SIMD)
>>>>>> - info: alexe at us.ibm.com http://www.research.ibm.com/people/a/alexe
>>>>>> - phone: 914-945-1812 (work) 914-312-3618 (cell)
>>>>>>
>>>>>>
>>>>>> ----- Original message -----
>>>>>> From: Jason Henline via Openmp-dev <openmp-dev at lists.llvm.org>
>>>>>> Sent by: "Openmp-dev" <openmp-dev-bounces at lists.llvm.org>
>>>>>> To: Andrey Bokhanko <andreybokhanko at gmail.com>, Chandler Carruth <
>>>>>> chandlerc at google.com>
>>>>>> Cc: llvm-dev <llvm-dev at lists.llvm.org>, cfe-dev <
>>>>>> cfe-dev at lists.llvm.org>, "openmp-dev at lists.llvm.org" <
>>>>>> openmp-dev at lists.llvm.org>
>>>>>> Subject: Re: [Openmp-dev] [cfe-dev] RFC: Proposing an LLVM subproject
>>>>>> for parallelism runtime and support libraries
>>>>>> Date: Mon, Mar 28, 2016 12:38 PM
>>>>>>
>>>>>> I did a more thorough read through liboffload and wrote up a more
>>>>>> detailed doc describing how StreamExecutor platforms relate to libomptarget
>>>>>> RTL interfaces. The doc also describes why the lack of support for streams
>>>>>> in libomptarget makes it impossible to implement some of the most important
>>>>>> StreamExecutor platforms in terms of libomptarget (
>>>>>> *https://github.com/henline/streamexecutordoc/blob/master/se_and_openmp.rst*
>>>>>> <https://github.com/henline/streamexecutordoc/blob/master/se_and_openmp.rst>).
>>>>>> When I was originally optimistic about using liboffload to implement
>>>>>> StreamExecutor platforms, I was not aware of this issue with streams.
>>>>>> Thanks to Carlo Bertolli for bringing this to my attention.
>>>>>>
>>>>>> After having looked in detail at the liboffload code, it sounds like
>>>>>> the best thing to do at this point is to keep StreamExecutor and liboffload
>>>>>> separate, but to leave the door open to implement future StreamExecutor
>>>>>> platforms in terms of liboffload. From the recent messages on this subject
>>>>>> from Carlo and Andrey it seems like there is a general consensus on this,
>>>>>> so I would like to move forward with the StreamExecutor project in this
>>>>>> spirit.
>>>>>>
>>>>>> On Tue, Mar 15, 2016 at 5:09 PM Jason Henline <*jhen at google.com*
>>>>>> <jhen at google.com>> wrote:
>>>>>>
>>>>>>    I created a GitHub repo that contains the documentation I have
>>>>>>    been creating for StreamExecutor.
>>>>>>    *https://github.com/henline/streamexecutordoc*
>>>>>>    <https://github.com/henline/streamexecutordoc>
>>>>>>
>>>>>>    It contains the design docs from the original email in this
>>>>>>    thread, and it contains a new doc I just made that gives a more detailed
>>>>>>    sketch of the StreamExecutor platform plugin interface. This shows which
>>>>>>    methods must be implemented to support a new platform in StreamExecutor, or
>>>>>>    to provide a new implementation for an existing platform (e.g. using
>>>>>>    liboffload to implement the CUDA platform).
>>>>>>
>>>>>>    I wrote up this doc in response to a lot of good questions I am
>>>>>>    getting about the details of how StreamExecutor might work with the code
>>>>>>    OpenMP already has in place.
>>>>>>
>>>>>>    Best Regards,
>>>>>>    -Jason
>>>>>>
>>>>>>    On Tue, Mar 15, 2016 at 12:28 PM Andrey Bokhanko <
>>>>>>    *andreybokhanko at gmail.com* <andreybokhanko at gmail.com>> wrote:
>>>>>>    Hola Chandler,
>>>>>>
>>>>>>    On Tue, Mar 15, 2016 at 1:44 PM, Chandler Carruth via Openmp-dev <
>>>>>>    *openmp-dev at lists.llvm.org* <openmp-dev at lists.llvm.org>> wrote:
>>>>>>       It seems like if the OpenMP folks want to add a liboffload
>>>>>>       plugin to StreamExecutor, that would be an awesome additional platform, but
>>>>>>       I don't see why we need to force the coupling here.
>>>>>>
>>>>>>    Let me give you a reason: while user-facing sides of
>>>>>>    StreamExecutor and OpenMP are quite different (and each warrants its place
>>>>>>    under the sun!), internal SE's offloading interface and liboffload are
>>>>>>    doing exactly the same thing. Why we want to duplicate code? As previous
>>>>>>    replies demonstrated, SE can't serve OpenMP's needs, while liboffload API
>>>>>>    seems to be general enough to serve SE well (though this has to be
>>>>>>    verified, of course -- as I understand, Jason is going to do this).
>>>>>>
>>>>>>    Sure, there is no "must have need" to couple SE and liboffload,
>>>>>>    but this sounds like a solid software engineering decision to me. Or,
>>>>>>    quoting Jason, who said this much better than me:
>>>>>>
>>>>>>    > Although OpenMP and StreamExecutor support different
>>>>>>    programming models,
>>>>>>    > some of the work they perform under the hood will likely be
>>>>>>    very similar.
>>>>>>    > By sharing code and domain expertise, both projects will be
>>>>>>    improved and
>>>>>>    > strengthened as their capabilities are expanded. The
>>>>>>    StreamExecutor
>>>>>>    > community looks forward to much collaboration and discussion
>>>>>>    with OpenMP
>>>>>>    > about the best places and ways to cooperate.
>>>>>>
>>>>>>    Espere veure't demà!
>>>>>>
>>>>>>    Yours,
>>>>>>    Andrey
>>>>>>    =====
>>>>>>    Enginyer de Software
>>>>>>    Intel Compiler Team
>>>>>>
>>>>>> _______________________________________________
>>>>>> Openmp-dev mailing list
>>>>>> Openmp-dev at lists.llvm.org
>>>>>> *http://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev*
>>>>>> <http://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Openmp-dev mailing list
>>>>>> Openmp-dev at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>> _______________________________________________
>>>> cfe-dev mailing list
>>>> cfe-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20160329/70981d8c/attachment-0001.html>