[Openmp-dev] Proposal: Resolve combined directives in parsing phase

Mon Jun 5 01:29:46 PDT 2017

The reason is that the user would expect the same outcome/performance from both,

The user should not expect the same performance.
Consider, if you write
#pragma omp parallel for
for (...)
   ...

and expand it as described by the standard, you have (where the implicit barrier at the end of omp for is made visible)
#pragma omp parallel
{
#   pragma omp for nowait
    for (...)
        ...
#   pragma omp barrier
}

That barrier is unnecessary since it immediately precedes a join which is itself a barrier (the mast thread cannot leave until all threads arrive).
Therefore, the code that is notmally generated elides the unnecessary barrier and expands into
#pragma omp parallel
{
#   pragma omp for nowait
    for (...)
        ...
}

Thus it can easily be the case that omp parallel do/for is faster than omp parallel + omp do/for.

(Of course you can fix this if you expand to parallel; do/for nowait...)

-- Jim

Jim Cownie <james.h.cownie at intel.com>
SSG/DPD/TCAR (Technical Computing, Analyzers, and Runtimes)
Tel: +44 117 9071438

From: Openmp-dev [mailto:openmp-dev-bounces at lists.llvm.org] On Behalf Of Schürmann, Daniel via Openmp-dev
Sent: Friday, June 2, 2017 9:58 PM
To: Arpith C Jacob <acjacob at us.ibm.com>
Cc: openmp-dev at lists.llvm.org
Subject: Re: [Openmp-dev] Proposal: Resolve combined directives in parsing phase

Thank you for your feedback, Arpith.
I see that the nvptx codegen relies in parts on combined directives and understand the possible difficulties.
However, codegen is already able to do this with the target and teams directive. Remember that the split directives are immediate child nodes in the AST.
But you are right, this change can only be made after changes to codegen.
In my opinion the code generation should in general be able to handle both cases equivally, combined directives and the separated version.
The reason is that the user would expect the same outcome/performance from both, e.g.
-> #pragma omp target parallel for
as well as
-> #pragma omp target
   #pragma omp parallel
   #pragma omp for

like the spec states. (The only exception seem to be the distribute- combined directives.

Kind regards,
Daniel

Von: Arpith C Jacob<mailto:acjacob at us.ibm.com>
Gesendet: Freitag, 2. Juni 2017 20:24
An: Schürmann, Daniel<mailto:daniel.schuermann at campus.tu-berlin.de>
Cc: openmp-dev at lists.llvm.org<mailto:openmp-dev at lists.llvm.org>
Betreff: Re: [Openmp-dev] Proposal: Resolve combined directives in parsing phase

Daniel,

> Cons: - Code optimizations for combined directives may be harder to implement

>From the perspective of GPU code generation treating combined directives as a special case is *immensely* important for performance. Knowing that there is *no* serial section in the target region makes it possible to generate simplified and low overhead code that resembles typical CUDA kernels. I am worried that splitting combined directives will make it much harder to get this performance back, particularly because we must do OpenMP lowering in Clang.

> the code generation is implemented as inlined function what results in
> ignoring the semantic meaning of these directives.

Can the current code generation be altered to fix the issues that you see?

> Unrelated question:
> I don't understand the necessity of the __kmpc_fork_teams() run-time
> call as the __tgt_target_teams() implementation should be able to handle
> this case.

My understanding is that the __tgt* calls are implemented in the target offload library while the __kmpc* calls are in the OpenMP runtime. On CPUs, forking of teams is done in the __kmpc_fork_teams() call. On the GPU, the offload call __tgt_target_teams() launches a kernel with multiple teams so the __kmpc_fork_teams() is a no-op.

Regards,
Arpith

[Inactive hide details for Daniel Schürmann via Openmp-dev ---06/02/2017 09:08:11 AM---At the moment, combined directives have t]Daniel Schürmann via Openmp-dev ---06/02/2017 09:08:11 AM---At the moment, combined directives have their own ast representation for type-checking and code gen

From: Daniel Schürmann via Openmp-dev <openmp-dev at lists.llvm.org<mailto:openmp-dev at lists.llvm.org>>
To: <openmp-dev at lists.llvm.org<mailto:openmp-dev at lists.llvm.org>>
Date: 06/02/2017 09:08 AM
Subject: [Openmp-dev] Proposal: Resolve combined directives in parsing phase
Sent by: "Openmp-dev" <openmp-dev-bounces at lists.llvm.org<mailto:openmp-dev-bounces at lists.llvm.org>>

________________________________

At the moment, combined directives have their own ast representation for
type-checking and code generation. For some of the combined constructs,
the code generation is implemented as inlined function what results in
ignoring the semantic meaning of these directives.

This is true for e.g.
EmitOMPTargetParallelForSimdDirective
EmitOMPTargetSimdDirective
EmitOMPTeamsDistributeDirective
EmitOMPTargetTeamsDistributeDirective
EmitOMPTargetTeamsDistributeParallelForDirective
and more

One solution would be the proper codegen implementation for these
directives.
However, I would like to propose a simpler and closer-to-spec approach:
By resolving combined directives in the parsing phase into nested AST nodes.

E.g. an OMPTargetTeamsDistributeDirective would be resolved into
OMPTargetDirective
    |- OMPTeamsDirective
        |- OMPDistributeDirective

whereas type-checking and codegen for these single directives is already
implemented.
The advantages are:
- Much simpler type-checking and code generation
- We match the specification stating that combined directives have the
semantic meaning of one construct immediately followed by the other
construct
- All combined directives are fully supported if their derived
constructs are supported

Potential disadvantages:
- The AST representation differs from the input. However, this is
already the case due to inserted implicit parameters.
- Code optimizations for combined directives may be harder to implement

In my opinion the benefits outweigh the disadvantages, but I may not be
aware of some implications. Please let me know your thoughts about this
idea. And tell me if I missunderstood anything related that led to the
decision for the actual design.

Unrelated question:
I don't understand the necessity of the __kmpc_fork_teams() run-time
call as the __tgt_target_teams() implementation should be able to handle
this case.

Daniel
_______________________________________________
Openmp-dev mailing list
Openmp-dev at lists.llvm.org<mailto:Openmp-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev

---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20170605/02fe58b6/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 105 bytes
Desc: image001.gif
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20170605/02fe58b6/attachment.gif>