[cfe-dev] RFC: clacc: translating OpenACC to OpenMP in clang

Fri Dec 8 10:02:59 PST 2017

On Fri, Dec 8, 2017 at 9:00 AM, Joel E. Denny <jdenny.ornl at gmail.com> wrote:
>
> On Fri, Dec 8, 2017 at 11:32 AM, Jeff Hammond <jeff.science at gmail.com>
wrote:
>>
>>
>>
>> On Fri, Dec 8, 2017 at 7:51 AM, Joel E. Denny <jdenny.ornl at gmail.com>
wrote:
>> >
>> > Hi Jeff, Hal,
>> >
>> > Thanks for your feedback.  My comments are inline below.
>> >
>> > On Tue, Dec 5, 2017 at 6:43 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>> >>
>> >> On 12/05/2017 05:11 PM, Jeff Hammond via cfe-dev wrote:
>> >>
>> >> All of the usage of OpenACC outside of benchmarks/research that I
know about is done in Fortran.
>> >
>> > I agree that it's easier to find real apps that use OpenACC in Fortran
than those that use OpenACC in C/C++.  However, the latter certainly
exist.  For example:
>>
>> Two of the three examples you cite are primarily Fortran and using
OpenACC exclusively in Fortran subroutines.
>
>
> Are you saying that the occurrences of "pragma acc" in Nek5000 and NekCEM
are unused?
>

The instances of "pragma acc" in those - it's the same code in both
projects - are either (1) only causing host-device data synchronization or
(2) commented-out.

It's unclear to me what actually happens in the code as currently written.
The OpenACC C/C++ code does not more than copy data to/from the device.  I
didn't trace the entire code execution but I can't tell if any code touches
the device data that OpenACC is updating.  If it is updated, it is updated
by Fortran OpenACC code somewhere else in the source tree.

What does the OpenACC standard say about interoperability of
compilers+runtimes, as would be required if one used Clang OpenACC for
C/C++ and Fortran OpenACC implemented by PGI, Cray, or GCC.  OpenMP
definitely does not support this, even if a subset of usage may work when
one uses the same runtime library with different compilers.

/tmp/Nek5000$ git grep "pragma acc"
jl/gs.c:#pragma acc update host(sendbuf[0:unit_size*bufSize/2]) if(acc)
jl/gs.c:#pragma acc update device(buf[0:unit_size*bufSize/2]) if(acc)
jl/gs.c:#pragma acc update host(sendbuf[0:unit_size*bufSize/2]) if(acc)
jl/gs.c:#pragma acc update device(buf[0:unit_size*bufSize/2]) if(acc)
jl/gs.c:#pragma acc exit data delete(map0,map1)
jl/gs.c:#pragma acc update host(buf[0:unit_size*bufSize]) if(acc)
jl/gs.c:#pragma acc update device(buf[0:unit_size*bufSize]) if(acc)
jl/gs.c:#pragma acc update host(buf[0:unit_size*bufSize]) if(acc)
jl/gs.c:#pragma acc update device(buf[0:unit_size*bufSize]) if(acc)
jl/gs.c://#pragma acc enter data
copyin(stage[0].scatter_map[0:stage[0].s_size],stage[0].scatter_mapf[0:stage[0].s_nt])
jl/gs.c://#pragma acc enter data
copyin(stage[i].scatter_map[i:stage[i].s_size],stage[i].scatter_mapf[i:stage[i].s_nt])
jl/gs.c://#pragma acc enter data
copyin(stage[i].gather_map[i:stage[i].g_size],stage[i].gather_mapf[i:stage[i].g_nt])
jl/gs.c://#pragma acc enter data
copyin(stage[i].scatter_map[i:stage[i].s_size],stage[i].scatter_mapf[i:stage[i].s_nt])
jl/gs.c://#pragma acc enter data
copyin(stage[i].gather_map[i:stage[i].g_size],stage[i].gather_mapf[i:stage[i].g_nt])
jl/gs.c://#pragma acc enter data
copyin(stage2[0].scatter_map[0:stage2[0].s_size],stage2[0].scatter_mapf[0:stage2[0].s_nt])
jl/gs.c://#pragma acc enter data
copyin(stage2[i].scatter_map[i:stage2[i].s_size],stage2[i].scatter_mapf[i:stage2[i].s_nt])
jl/gs.c://#pragma acc enter data
copyin(stage2[i].gather_map[i:stage2[i].g_size],stage2[i].gather_mapf[i:stage2[i].g_nt])
jl/gs.c://#pragma acc enter data
copyin(stage2[i].scatter_map[i:stage2[i].s_size],stage2[i].scatter_mapf[i:stage2[i].s_nt])
jl/gs.c://#pragma acc enter data
copyin(stage2[i].gather_map[i:stage2[i].g_size],stage2[i].gather_mapf[i:stage2[i].g_nt])
jl/gs.c:#pragma acc exit data delete(map,mapf)
jl/gs.c:#pragma acc exit data delete(map,mapf)
jl/gs.c:#pragma acc exit data delete(map,mapf)
jl/gs.c:#pragma acc exit data delete(map,mapf)
jl/gs.c:#pragma acc update host(buf[0:vn*unit_size*bufSize]) if(acc)
jl/gs.c:#pragma acc update device(buf[0:vn*unit_size*bufSize]) if(acc)
jl/gs.c:  //#pragma acc exit data
delete(ard->map_to_buf[0],ard->map_to_buf[1],ard->map_from_buf[0],ard->map_from_buf[1])
jl/gs.c:  //#pragma acc enter data
copyin(ard->map_to_buf[0][0:ard->mt_size[0]],ard->map_from_buf[0][0:ard->mf_size[0]],ard->map_to_buf_f[0][0:ard->mt_nt[0]],ard->map_from_buf_f[0][0:ard->mf_nt[0]],ard->map_to_buf[1][0:ard->mt_size[1]],ard->map_from_buf[1][0:ard->mf_size[1]],ard->map_to_buf_f[1][0:ard->mt_nt[1]],ard->map_from_buf_f[1][0:ard->mf_nt[1]])
jl/gs.c:#pragma acc update host(a[0:n])
jl/gs.c:#pragma acc update host(a[0:n])
jl/gs.c:#pragma acc exit data delete(bufPtr)
jl/gs.c:#pragma acc enter data
create(bufPtr[0:vn*gs_dom_size[dom]*gsh->r.buffer_size])
jl/gs.c:#pragma acc exit data delete(bufPtr)
jl/gs.c:#pragma acc enter data
create(bufPtr[0:vn*gs_dom_size[dom]*gsh->r.buffer_size])
jl/gs.c:#pragma acc exit data
delete(map_local0,map_local1,flagged_primaries)
jl/gs.c:#pragma acc enter data pcopyin(map[0:*m_size],mapf2[0:2*mf_temp])
jl/gs_acc.c://#pragma acc data present(buf[0:l])
jl/gs_acc.c://#pragma acc host_data use_device(buf)
jl/gs_acc.c://#pragma acc data present(buf[0:l])
jl/gs_acc.c://#pragma acc host_data use_device(buf)
jl/gs_acc.c:  //#pragma acc enter data
copyin(t_mapf[0:t_m_nt*2],mapf[0:m_nt*2],snd_mapf[0:snd_m_nt*2],rcv_mapf[0:rcv_m_nt*2],fp_mapf[0:fp_m_nt*2],
t_map[0:t_m_size],map[0:m_size],fp_map[0:fp_m_size],snd_map[0:snd_m_size],rcv_map[0:rcv_m_size])
jl/gs_acc.c:  //#pragma acc enter data
copyin(t_mapf[0:t_m_nt*2],mapf[0:m_nt*2],snd_mapf[0:snd_m_nt*2],rcv_mapf[0:rcv_m_nt*2],fp_mapf[0:fp_m_nt*2],
t_map[0:t_m_size],map[0:m_size],fp_map[0:fp_m_size],snd_map[0:snd_m_size],rcv_map[0:rcv_m_size])
jl/gs_acc.c://#pragma acc enter data
pcopyin(t_mapf[0:t_m_nt*2],mapf[0:m_nt*2],snd_mapf[0:snd_m_nt*2],rcv_mapf[0:rcv_m_nt*2],fp_mapf[0:fp_m_nt*2],
t_map[0:t_m_size],map[0:m_size],fp_map[0:fp_m_size],snd_map[0:snd_m_size],rcv_map[0:rcv_m_size])
jl/gs_acc.c://#pragma acc data
present(u[0:uds],mapf[0:m_nt*2],snd_mapf[0:snd_m_nt*2],rcv_mapf[0:rcv_m_nt*2],fp_mapf[0:fp_m_nt*2],t_map[0:t_m_size],map[0:m_size],fp_map[0:fp_m_size],snd_map[0:snd_m_size],rcv_map[0:rcv_m_size])
jl/gs_acc.c://#pragma acc data create(sbuf[0:bl],rbuf[0:bl]) if(bl!=0)
jl/gs_acc.c://#pragma acc parallel loop gang vector
present(u[0:uds],map[0:m_size],mapf[0:m_nt*2]) private(i,j,t) async(k+1)
jl/gs_acc.c://#pragma acc loop seq
jl/gs_acc.c://#pragma acc wait
jl/gs_acc.c://#pragma acc parallel loop gang vector
present(u[0:uds],fp_map[0:fp_m_size],fp_mapf[0:fp_m_nt*2]) private(i,j)
async(k+1)
jl/gs_acc.c://#pragma acc loop seq
jl/gs_acc.c://#pragma acc wait
jl/gs_acc.c://#pragma acc parallel loop gang vector
present(u[0:uds],fp_map[0:fp_m_size]) private(i,k)
jl/gs_acc.c://#pragma acc parallel loop gang vector
present(u[0:uds],snd_map[0:snd_m_size],snd_mapf[0:snd_m_nt*2],sbuf[0:bl])
private(i,j,t) async(k+1)
jl/gs_acc.c://#pragma acc loop seq
jl/gs_acc.c://#pragma acc wait
jl/gs_acc.c://#pragma acc parallel loop gang vector
present(u[0:uds],snd_map[0:snd_m_size],sbuf[0:bl]) private(i,j,k)
jl/gs_acc.c://#pragma acc update host(sbuf[0:bl]) async(vn+2)
jl/gs_acc.c://#pragma acc wait
jl/gs_acc.c://#pragma acc update device(rbuf[0:bl]) async(vn+2)
jl/gs_acc.c://#pragma acc wait
jl/gs_acc.c://#pragma acc parallel loop gang vector
present(u[0:uds],rcv_map[0:rcv_m_size],rcv_mapf[0:rcv_m_nt*2],rbuf[0:bl])
private(i,j,t) async(k+1)
jl/gs_acc.c://#pragma acc loop seq
jl/gs_acc.c://#pragma acc wait
jl/gs_acc.c:    //#pragma acc parallel loop gang vector
present(u[0:uds],rcv_map[0:rcv_m_size],rbuf[0:bl]) private(i,j,k)
jl/gs_acc.c://#pragma acc parallel loop gang vector
present(u[0:uds],t_map[0:t_m_size],t_mapf[0:t_m_nt*2]) private(i,j,t)
async(k+1)
jl/gs_acc.c://#pragma acc loop seq
jl/gs_acc.c://#pragma acc wait

>>
>>
>> > http://mrfil.github.io/PowerGrid/
>>
>> /tmp/PowerGrid$ git grep -il "pragma acc"
>> PowerGrid/Gfft.hpp
>> PowerGrid/Gnufft.hpp
>> PowerGrid/ftCpu.hpp
>> PowerGrid/gridding.hpp
>> PowerGrid/griddingSupport.hpp
>>
>> From http://mrfil.github.io/PowerGrid/docs/Installation:
>>
>> We have experience with PGC++ 15.7 from NVIDIA/The Portland Group as the
version we have used most extensively. There is a free license available as
part of the OpenACC Toolkit for academic users.
>>
>> GCC 6.1 has OpenACC support but has not yet been tested by the
developers, we welcome reports of anyone trying to compile with it. We hope
to support it alongside PGI compilers in the near future.
>>
>> For those lucky enough to have access to Cray supercomputers, the Cray
compiler does support OpenACC, but we have not tried to build with it.
Because the Cray compilers are not available on desktops, workstations, or
non-Cray branded clusters, we have not dedicated resources to testing
PowerGrid on it.
>>
>> So these folks support OpenACC, but haven't bothered to try the GCC
implementation in the 1+ year that it's been available.  How likely are
they to use Clang's?
>
>
> I cannot answer that. Perhaps they were waiting for GCC support to mature?

Or maybe they aren't interested using in OpenACC compiler support outside
of PGI.

What I'm really getting at here is who is going to use OpenACC support in
Clang, particularly if there is no compatible Fortran OpenACC compiler?  In
addition to justifying the code maintenance effort, users who are not
developers are essential for implementation hardening.

Best,

Jeff

> Thanks.
>
> Joel
>
>>
>> > https://nek5000.mcs.anl.gov/ (look at the openacc branch in github)
>>
>> (on the openacc branch)
>>
>> /tmp/Nek5000$ git grep -il "\$acc "
>> core/acc.f
>> core/comm_mpi.f
>> core/gmres.f
>> core/hmholtz.f
>> core/hsmg.f
>> core/math.f
>> core/navier1.f
>> core/navier4.f
>> core/plan4.f
>> core/prepost.f
>> core/subs2.f
>>
>> >
>> > https://nekcem.mcs.anl.gov/
>>
>> (on master)
>> /tmp/svn$ git grep -il "\$acc"
>> branches/maxwell-experimental/src/cem_dg.F
>> branches/maxwell-experimental/src/dssum2.F
>> branches/maxwell-experimental/src/io.F
>> branches/maxwell-experimental/src/mat1.F
>> branches/maxwell-experimental/src/maxwell.F
>> branches/maxwell-experimental/src/maxwell_acc.F
>> branches/maxwell-experimental/src/mxm_acc.F
>> branches/trunkQu/src/quantum_csr.F
>> branches/trunkQu/src/quantum_setup.f
>> branches/trunkQu/src/quantum_time.F
>> trunk/examples/openacc_gpu=1/box.usr
>> trunk/examples/openacc_gpu=8/box.usr
>> trunk/src/acoustic.F
>> trunk/src/cem_dg2.F
>> trunk/src/complex.F
>> trunk/src/drift1.F
>> trunk/src/drift1_maud.F
>> trunk/src/drive.F
>> trunk/src/drive_maud.F
>> trunk/src/dssum2.F
>> trunk/src/hmholtz.F
>> trunk/src/io.F
>> trunk/src/mat1.F
>> trunk/src/maxwell.F
>> trunk/src/maxwell_acc.F
>> trunk/src/mg_r2204.F
>> trunk/src/mxm_acc.F
>> trunk/src/poisson.F
>> trunk/src/quantum2.F
>> www/examples/libs/phpThumb/phpthumb.functions.php
>> www/examples/phpthumb.functions.php
>>
>> >>   Can you provide a list of C/C++ applications using OpenACC today
and estimate the number of users that will benefit from this feature?
>> >>
>> >>
>> >> Such lists exist, although I don't know what can be shared (and Oak
Ridge likely has better lists in this regard than I do).
>> >
>> > I'll look for a better list that I can share.
>>
>> That would be helpful.
>>
>> Best,
>>
>> Jeff
>>
>>
>>
>> >> I can tell you, from my own experience, that we're seeing an increase
in development using OpenACC, in both C/C++ and Fortran, over the last
couple of years (essentially because the compiler technology has improved
to the point where that is now a potentially-productive choice).
>> >
>> >
>> > Providing support in a production-quality, open-source compiler tool
chain like LLVM will hopefully accelerate this trend.
>> >
>> > Joel
>> >
>> >>
>> >> Also, we have a strong desire to enable tooling over code bases using
OpenACC. Among many other things, at some point we'll likely want the
option to automatically migrate much of this code to using OpenMP. Having
an OpenACC-enabled Clang, with an implementation that maps to OpenMP, is an
important step in that process.
>> >>
>> >>  -Hal
>> >>
>> >>
>> >>
>> >> Thanks,
>> >>
>> >> Jeff
>> >>
>> >> On Tue, Dec 5, 2017 at 11:06 AM, Joel E. Denny via cfe-dev <
cfe-dev at lists.llvm.org> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> We are working on a new project, clacc, that extends clang with
OpenACC support.  Clacc's approach is to translate OpenACC (a descriptive
language) to OpenMP (a prescriptive language) and thus to build on clang's
existing OpenMP support.  While we plan to develop clacc to support our own
research, an important goal is to contribute clacc as a production-quality
component of upstream clang.
>> >>>
>> >>> We have begun implementing an early prototype of clacc.  Before we
get too far into the implementation, we would like to get feedback from the
LLVM community to help ensure our design would ultimately be acceptable for
contribution.  For that purpose, below is an analysis of several high-level
design alternatives we have considered and their various features.  We
welcome any feedback.
>> >>>
>> >>> Thanks.
>> >>>
>> >>> Joel E. Denny
>> >>> Future Technologies Group
>> >>> Oak Ridge National Laboratory
>> >>>
>> >>>
>> >>> Design Alternatives
>> >>> -------------------
>> >>>
>> >>> We have considered three design alternatives for the clacc compiler:
>> >>>
>> >>> 1. acc src  --parser-->                     omp AST  --codegen-->
 LLVM IR + omp rt calls
>> >>> 2. acc src  --parser-->  acc AST                     --codegen-->
 LLVM IR + omp rt calls
>> >>> 3. acc src  --parser-->  acc AST  --ttx-->  omp AST  --codegen-->
 LLVM IR + omp rt calls
>> >>>
>> >>> In the above diagram:
>> >>>
>> >>> * acc src = C source code containing acc constructs.
>> >>> * acc AST = a clang AST in which acc constructs are represented by
>> >>>   nodes with acc node types.  Of course, such node types do not
>> >>>   already exist in clang's implementation.
>> >>> * omp AST = a clang AST in which acc constructs have been lowered
>> >>>   to omp constructs represented by nodes with omp node types.  Of
>> >>>   course, such node types do already exist in clang's
>> >>>   implementation.
>> >>> * parser = the existing clang parser and semantic analyzer,
>> >>>   extended to handle acc constructs.
>> >>> * codegen = the existing clang backend that translates a clang AST
>> >>>   to LLVM IR, extended if necessary (depending on which design is
>> >>>   chosen) to perform codegen from acc nodes.
>> >>> * ttx (tree transformer) = a new clang component that transforms
>> >>>   acc to omp in clang ASTs.
>> >>>
>> >>> Design Features
>> >>> ---------------
>> >>>
>> >>> There are several features to consider when choosing among the
designs
>> >>> in the previous section:
>> >>>
>> >>> 1. acc AST as an artifact -- Because they create acc AST nodes,
>> >>>    designs 2 and 3 best facilitate the creation of additional acc
>> >>>    source-level tools (such as pretty printers, analyzers, lint-like
>> >>>    tools, and editor extensions).  Some of these tools, such as
pretty
>> >>>    printing, would be available immediately or as minor extensions of
>> >>>    tools that already exist in clang's ecosystem.
>> >>>
>> >>> 2. omp AST/source as an artifact -- Because they create omp AST
>> >>>    nodes, designs 1 and 3 best facilitate the use of source-level
>> >>>    tools to help an application developer discover how clacc has
>> >>>    mapped his acc to omp, possibly in order to debug a mapping
>> >>>    specification he has supplied.  With design 2 instead, an
>> >>>    application developer has to examine low-level LLVM IR + omp rt
>> >>>    calls.  Moreover, with designs 1 and 3, permanently migrating an
>> >>>    application's acc source to omp source can be automated.
>> >>>
>> >>> 3. omp AST for mapping implementation -- Designs 1 and 3 might
>> >>>    also make it easier for the compiler developer to reason about and
>> >>>    implement mappings from acc to omp.  That is, because acc and omp
>> >>>    syntax is so similar, implementing the translation at the level of
>> >>>    a syntactic representation is probably easier than translating to
>> >>>    LLVM IR.
>> >>>
>> >>> 4. omp AST for codegen -- Designs 1 and 3 simplify the
>> >>>    compiler implementation by enabling reuse of clang's existing omp
>> >>>    support for codegen.  In contrast, design 2 requires at least some
>> >>>    extensions to clang codegen to support acc nodes.
>> >>>
>> >>> 5. Full acc AST for mapping -- Designs 2 and 3 potentially
>> >>>    enable the compiler to analyze the entire source (as opposed to
>> >>>    just the acc construct currently being parsed) while choosing the
>> >>>    mapping to omp.  It is not clear if this feature will prove
useful,
>> >>>    but it might enable more optimizations and compiler research
>> >>>    opportunities.
>> >>>
>> >>> 6. No acc node classes -- Design 1 simplifies the compiler
>> >>>    implementation by eliminating the need to implement many acc node
>> >>>    classes.  While we have so far found that implementing these
>> >>>    classes is mostly mechanical, it does take a non-trivial amount of
>> >>>    time.
>> >>>
>> >>> 7. No omp mapping -- Design 2 does not require acc to be mapped to
>> >>>    omp.  That is, it is conceivable that, for some acc constructs,
>> >>>    there will prove to be no omp syntax to capture the semantics we
>> >>>    wish to implement.  It is also conceivable that we might one day
>> >>>    want to represent some acc constructs directly as extensions to
>> >>>    LLVM IR, where some acc analyses or optimizations might be more
>> >>>    feasible to implement.  This possibility dovetails with recent
>> >>>    discussions in the LLVM community about developing LLVM IR
>> >>>    extensions for various parallel programming models.
>> >>>
>> >>> Because of features 4 and 6, design 1 is likely the fastest design to
>> >>> implement, at least at first while we focus on simple acc features
and
>> >>> simple mappings to omp.  However, we have so far found no advantage
>> >>> that design 1 has but that design 3 does not have except for feature
>> >>> 6, which we see as the least important of the above features in the
>> >>> long term.
>> >>>
>> >>> The only advantage we have found that design 2 has but that design 3
>> >>> does not have is feature 7.  It should be possible to choose design 3
>> >>> as the default but, for certain acc constructs or scenarios where
>> >>> feature 7 proves important (if any), incorporate design 2.  In other
>> >>> words, if we decide not to map a particular acc construct to any omp
>> >>> construct, ttx would leave it alone, and we would extend codegen to
>> >>> handle it directly.
>> >>>
>> >>> Conclusions
>> >>> -----------
>> >>>
>> >>> For the above reasons, and because design 3 offers the cleanest
>> >>> separation of concerns, we have chosen design 3 with the possibility
>> >>> of incorporating design 2 where it proves useful.
>> >>>
>> >>> Because of the immutability of clang's AST, the design of our
proposed
>> >>> ttx component requires careful consideration.  To shorten this
initial
>> >>> email, we have omitted those details for now, but we will be happy to
>> >>> include them as the discussion progresses.
>> >>>
>> >>> _______________________________________________
>> >>> cfe-dev mailing list
>> >>> cfe-dev at lists.llvm.org
>> >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Jeff Hammond
>> >> jeff.science at gmail.com
>> >> http://jeffhammond.github.io/
>> >>
>> >>
>> >> _______________________________________________
>> >> cfe-dev mailing list
>> >> cfe-dev at lists.llvm.org
>> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >>
>> >>
>> >> --
>> >> Hal Finkel
>> >> Lead, Compiler Technology and Programming Languages
>> >> Leadership Computing Facility
>> >> Argonne National Laboratory
>> >
>> >
>>
>>
>>
>> --
>> Jeff Hammond
>> jeff.science at gmail.com
>> http://jeffhammond.github.io/
>
>

--
Jeff Hammond
jeff.science at gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20171208/7f466a12/attachment.html>