[flang-dev] Flang Technical Call : Summary of presentation on OpenMP for Flang

Thu Sep 26 07:44:04 PDT 2019

Handling of Target construct
++++++++++++++++++++++
 Target construct in OpenMP is used to execute code on devices like GPUs. For this construct, the design I am proposing uses the OpenMP IRBuilder. The handling in MLIR will be minimal and will only involve passing through the information and processing loop specific clauses like collapse. I expect that the driver will invoke flang multiple times for handling copies of the source for the targets and the host and clang-offload-bundler will be called to bundle and unbundle wherever necessary.
Note: I realise that there might be other opinions regarding this, if so please respond to this mail.

Below I summarize the various steps in the proposed F18 flow from Fortran source to LLVM IR.
1) Fortran source. Consider the sample program below with a parallel do inside a target region.
subroutine target_add(a, b, c, N)
  integer:: N
  real:: a(N)
  real:: b(N)
  real:: c(N)
  !$omp target
  !$omp parallel do
  do i=1, N
    c(i) = a(i) + b(i)
  end do
end subroutine

2) Parse tree: Omitted to keep it short

3) The parse tree is lowered to a mix of OpenMP and FIR dialects in MLIR. There are operations in the OpenMP dialect for representing the target, parallel, and parallel do constructs. Rest of the code is lowered to FIR.

omp.target {
  omp.parallel {
     omp.do {
       fir.do %i = 1 to %n : !fir.integer {
         %a_val = fir.load %a_addr[%i] : memref<nxf32>
         %b_val = fir.load %b_addr[%i] : memref<nxf32>
         %c_val = addf %a_val, %b_val : !fir.float
         fir.store %c_val, %c_addr[%i] : memref<nxf32>
       }
     }
  }
}

4) The next conversion is to a mix of OpenMP and LLVM dialects in MLIR. Here the loop is now lowered to LLVM dialect of MLIR while the OpenMP constructs are retained. This is possible since the OpenMP dialect is designed to exist with other dialects including FIR and LLVM.

omp.target {
  omp.parallel {
    omp.do {
      ^body:
      ....
      %a_val = llvm.load %a_addr : !llvm<"float*">
      %b_val = llvm.load %b_addr : !llvm<"float*">
      %c_val = llvm.fadd %a_val, %b_val : !llvm.float
      llvm.store %c_val, %c_addr : !llvm<"float*">
      ...
      llvm.cond_br %check, ^s_exit, ^body
      ^s_exit:
    }
  }
}

5) At the translation stage, the mix of OpenMP and LLVM dialect is converted to LLVM IR using the OpenMP IRBuilder and the existing translation library for LLVM dialect. The OpenMP IRBuilder will be fed with the basic blocks constituting the loop and asked to generate IR containing calls to the OpenMP runtime (tgt_target, kmpc_fork_call) for the parallel do construct in a target region.
Note: Only the host-side code is shown here.

define internal void @target_add() {
  ...
  call void _omp_offloading_target_add()
  ...
}

define internal void @__omp_offloading_target_add() {
  ...
  call i32 @__tgt_target( ...)
  call void kmpc_fork_call(__omp_outined_target_add)
  ...
}

define internal void @__omp_outined_target_add() {
...
body:
...
%a_val = load float, float *a_addr
%b_val = load float, float *b_addr
%c_val = fadd  %a_val, %b_val
store float %c_val, float *c_add
...
br i1 %check, label %s_exit, label %body
...
s_exit:
}

--Kiran
________________________________
From: flang-dev <flang-dev-bounces at lists.flang-compiler.org> on behalf of Kiran Chandramohan <Kiran.Chandramohan at arm.com>
Sent: 12 September 2019 16:43
To: flang-dev at lists.flang-compiler.org <flang-dev at lists.flang-compiler.org>; flang-dev at lists.llvm.org <flang-dev at lists.llvm.org>; Eric Schweitz <eschweitz at nvidia.com>
Subject: Re: [Flang-dev] Flang Technical Call : Summary of presentation on OpenMP for Flang

This mail summarises the handling of the simd construct.

!$omp simd: The simd construct tells the compiler that the loop can be vectorised. Since vectorisation is performed by LLVM (See Note 1), the frontend passes the simd information to LLVM through metadata. Since the simd construct is all handled by metadata we can skip the OpenMP IRBuilder for handling this construct (See Note 2).

1) Consider the following source which has a loop which adds two arrays and stores the result in another array. Assume that this loop is not trivially vectorisable due to some alias issues (the arrays being pointers for e.g). An omp simd construct is used to inform the compiler that this loop can be vectorised.
  !$omp simd simdlen(4)
  do i=1,n
    c(i) = a(i) + b(i)
  end do

2) The Fortran program will be parsed and represented as a parse tree. Skipping the parse tree representation to keep it short.

3) The parse tree is lowered to a mix of OpenMP and FIR dialects in MLIR. A representation for this code mix is given below. We have an operation omp.simd in the dialect which represents OpenMP simd. It has attributes for the various constant clauses like simdlen, safelen etc. Reduction if present in the omp simd statement can be represented by another operation omp.reduction. Any transformation necessary to expose reduction operations/variables (as specified in the reduction clause) can be performed in OpenMP MLIR layer itself. The fir do loop is nested inside the simd region.

omp.simd {simdlen=4} {
   fir.do %i = 1 to %n : !fir.integer {
     %a_val = fir.load %a_addr[%i] : memref<nxf32>
     %a_val = fir.load %a_addr[%i] : memref<nxf32>
     %c_val = addf %a_val, %b_val : !fir.float
     fir.store %c_val, %c_addr[%i] : memref<nxf32>
   }
}

4) For this construct, the next step is to lower the OpenMP and FIR dialects to LLVM dialect. During this lowering, information is added via attributes to the memory instructions and loop branch instruction in the loop.
a) the memory access instructions have an attribute which denotes that they can be executed in parallel.
b) the loop branch instruction has attributes for enabling vectorisation, setting the vectorisation width and pointing to all memory access operations via the access_group which can be parallelised.

^body:
....
%a_val = llvm.load %a_addr : !llvm<"float*"> {access_group=1}
%b_val = llvm.load %b_addr : !llvm<"float*"> {access_group=1}
%c_val = llvm.fadd %a_val, %b_val : !llvm.float
llvm.store %c_val, %c_addr : !llvm<"float*"> {access_group=1}
...
llvm.cond_br %check, ^s_exit, ^body {vectorize_width=4, vectorize_enable=1,parallel_loop_accesses=1}

^s_exit:

llvm.cond_br %7, ^bb6, ^bb7

5) The LLVM MLIR is translated to LLVM IR. In this stage, all the attributes from (4) will be translated to metadata.

body:
...
%a_val = load float, float *a_addr, !llvm.access.group !1
%b_val = load float, float *b_addr, !llvm.access.group !1
%c_val = fadd  %a_val, %b_val
store float %c_val, float *c_add,  !llvm.access.group !1
...
br i1 %check, label %s_exit, label %body, !llvm.loop !2
...
s_exit:

!1 = !{}
!2 = !distinct{!2,!3,!4,!5}
!3 = !{!"llvm.loop.vectorize.width", i32 4}
!4 = !{!"llvm.loop.vectorize.enable", i1 true}
!5 = !{!"llvm.loop.parallel_accesses", !1}

Note:
1) There is support for vectorization in MLIR also, I am assuming that it is not as good as the LLVM vectoriser and hence not using MLIR vectorization.
2) For this construct, we have chosen to not use the OpenMP IRBuilder. There is still one possible reason for using the OpenMP IRBuilder even for this simple use case. The use case being, if LLVM decided to change the loop metadata then they have to change it only in the OpenMP IRBuilder. If we do not use the IRBuilder then the developers will have to change the metadata generation in Clang and Flang. I assume that this happens rarely and hence is OK.

________________________________
From: flang-dev <flang-dev-bounces at lists.flang-compiler.org> on behalf of Kiran Chandramohan <Kiran.Chandramohan at arm.com>
Sent: 03 September 2019 23:02
To: flang-dev at lists.flang-compiler.org <flang-dev at lists.flang-compiler.org>; flang-dev at lists.llvm.org <flang-dev at lists.llvm.org>
Cc: nd <nd at arm.com>
Subject: Re: [Flang-dev] Flang Technical Call : Summary of presentation on OpenMP for Flang

A walkthrough for the collapse clause on an OpenMP loop construct is given below. This is an example where the transformation (collapse) is performed in the MLIR layer itself.

1)Fortran OpenMP code with collapse
!$omp parallel do private(j) collapse(2)
do i=lb1,ub1
  do j=lb2,ub2
    ...
    ...
  end do
end do

2) The Fortran source with OpenMP will be converted to an AST by the F18 parser. Parse tree not shown here to keep it short.

3)
3.a)The Parse tree will be lowered to a mix of FIR and OpenMP dialects. There are omp.parallel and omp.do operations in the OpenMP dialect which represents parallel and OpenMP loop constructs. The omp.do operation has an attribute "collapse" which specifies the number of loops to be collapsed.
omp.parallel {
  omp.do {collapse = 2} %i = %lb1 to %ub1 : !fir.integer {
    fir.do %j = %lb2 to %ub2 : !fir.integer {
    ...
    }
  }
}

3.b) A transformation pass in MLIR will perform the collapsing. The collapse operation will cause the omp.do loop to be coalesced with the loop immediately following it. Note : There exists loop coalescing passes in MLIR transformation passes. We should try to make use of it.
omp.parallel {
  %ub3 =
  omp.do %i = 0 to %ub3 : !fir.integer {
    ...
  }
}

4) Next conversion will be to a mix of LLVM and OpenMP dialect.
omp.parallel {
  %ub3 =
  omp.do %i = 0 to %ub3 : !llvm.integer {
    ...
  }
}

5) Finally, LLVM IR will be generated for this code. The translation to LLVM IR can make use of the OpenMP IRBuilder. LLVM IR not shown here to keep it short.

Thanks,
Kiran

________________________________
From: Kiran Chandramohan <Kiran.Chandramohan at arm.com>
Sent: 21 August 2019 13:15:04
To: Eric Schweitz (PGI) <eric.schweitz at pgroup.com>; flang-dev at lists.flang-compiler.org <flang-dev at lists.flang-compiler.org>; flang-dev at lists.llvm.org <flang-dev at lists.llvm.org>
Cc: nd <nd at arm.com>
Subject: Re: Flang Technical Call : Summary of presentation on OpenMP for Flang

Thanks, Eric for the clarification.

Also, sharing this write up of the flow through the compiler for an OpenMP construct. The first one (Proposed Plan) is as per the presentation. The second one (Modified Plan) incorporates Eric's feedback to lower the F18 AST to a mix of OpenMP and FIR dialect.

I Proposed plan

1) Example OpenMP code 
<Fortran code>
 !$omp parallel
 c = a + b 
!$omp end parallel 
<More Fortran code>   

2) Parse tree (Copied relevant section from -fdebug-dump-parse-tree)
<Fortran parse tree> | | ExecutionPartConstruct -> ExecutableConstruct -> OpenMPConstruct -> OpenMPBlockConstruct
| | | OmpBlockDirective -> Directive = Parallel
| | | OmpClauseList ->
| | | Block
| | | | ExecutionPartConstruct -> ExecutableConstruct -> ActionStmt -> AssignmentStmt
| | | | | Variable -> Designator -> DataRef -> Name = 'c'
| | | | | Expr -> Add
| | | | | | Expr -> Designator -> DataRef -> Name = 'a'
| | | | | | Expr -> Designator -> DataRef -> Name = 'b'
| | | OmpEndBlockDirective -> OmpBlockDirective -> Directive = Parallel
<More Fortran parse tree>

3) The first lowering will be to FIR dialect and the dialect has a pass-through operation for OpenMP. This operation has a nested region which contains the region of code influenced by the OpenMP directive. The contained region will have other FIR (or standard dialect) operations. 
Mlir.region(…) { 
%1 = fir.x(…)   … 
%20 = fir.omp attribute:parallel {
        %1 = addf %2, %3 : f32
        }
 %21 = <more fir> 
… 
} 

4) The next lowering will be to OpenMP and LLVM dialect. The OpenMP dialect has an operation called parallel with a nested region of code. The nested region will have llvm dialect operations.
 Mlir.region(…) { 
%1 = llvm.xyz(...)   … 
%20 = omp.parallel {
        %1 = llvm.fadd %2, %3 : !llvm.float
        }
 %21 = <more llvm dialect> 
… 
}

 5) The next conversion will be to LLVM IR. Here the OpenMP dialect will be lowered using the OpenMP IRBuilder and the translation library of the LLVM dialect. The IR Builder will see that there is a region under the OpenMP construct omp.parallel. It will collect all the basic blocks inside that region and then generate outlined code using those basic blocks. Suitable calls will be inserted to the OpenMP API.    

define @outlined_parallel_fn(...)
{
  ....
  %1 = fadd float %2, %3
  ...
}
  define @xyz(…)
{
  %1 = alloca float
  ....
  call kmpc_fork_call(...,outlined_parallel_fn,...)
}

II Modified plan

The differences are only in steps 3 and 4. Other steps remain the same.

3) The first lowering will be to a mix of FIR dialect and OpenMP dialect. The OpenMP dialect has an operation called parallel with a nested region of code. The nested region will have FIR (and standard dialect) operations. 
Mlir.region(…) { 
%1 = fir.x(…)   … 
%20 = omp.parallel {
        %1 = addf %2, %3 : f32
        }
 %21 = <more fir> 
… 
} 

4) The next lowering will be to OpenMP and LLVM dialect
 Mlir.region(…) { 
%1 = llvm.xyz(...)   … 
%20 = omp.parallel {
        %1 = llvm.fadd %2, %3 : !llvm.float
        }
 %21 = <more llvm dialect> 
… 
}

Thanks,
Kiran

________________________________
From: Eric Schweitz (PGI) <eric.schweitz at pgroup.com>
Sent: 19 August 2019 17:35:04
To: Kiran Chandramohan <Kiran.Chandramohan at arm.com>; flang-dev at lists.flang-compiler.org <flang-dev at lists.flang-compiler.org>
Subject: RE: Flang Technical Call : Summary of presentation on OpenMP for Flang

Re: And would like the FIR to have nothing to do with OpenMP.

This seems stronger than what I meant, so I’ll clarify a bit.

FIR should have no dependence on OpenMP, since it is possible to write (vanilla) Fortran programs without OpenMP.  However FIR clearly must also co-exist with an OpenMP dialect.  Stated another way, we don’t want a circular dependence between “vanilla Fortran” and OpenMP.  Since OpenMP is a directive-based meta-language that impacts the code gen of a program (whether it is Fortran or some other language), it seems quite natural and necessary that OpenMP be able to operate upon the substrate language  (whether that substrate is FIR or LLVM IR or something else). Fortunately, MLIR allows one to mix dialects, and thus each dialect can focus on the problems it’s trying to solve.

--

Eric

From: flang-dev <flang-dev-bounces at lists.flang-compiler.org> On Behalf Of Kiran Chandramohan
Sent: Friday, August 16, 2019 9:13 AM
To: flang-dev at lists.flang-compiler.org
Subject: [Flang-dev] Flang Technical Call : Summary of presentation on OpenMP for Flang

Hi,

This mail is a summary of the presentation that I gave on 31st August about supporting OpenMP in Flang (F18). Link to presentation: https://drive.google.com/open?id=1Q2Y2fIavewq9oxRDruWoi8TaItlq1g9l

The mail follows the contents of the presentation and should be read in conjunction with the slides. It also includes feedback provided during and outside the presentation. It would be great to receive further feedback.

This presentation discusses the design of how OpenMP can be represented in the IR and how it can be translated to LLVM IR. The design is intended to be modular so that other frontends (C/C++) can reuse it in the future. The design should also enable reuse of OpenMP code from Clang (LLVM IR generation, Outlining etc). It should also be acceptable to Flang, MLIR, Clang communities. And most importantly should allow for any OpenMP optimisation desired.

The designs presented uses the following two components.

a) MLIR [3]: Necessary since MLIR has already been chosen as the framework for building the Fortran IR (FIR) for Flang.

b) OpenMP IRBuilder [2]: For sharing of OpenMP code with Clang.

The current sequential code flow in Flang (Slide 5) can be summarised as follows,

[Fortran code] -> Parser -> [AST] -> Lowering -> [FIR MLIR] -> Conversion -> [LLVM MLIR] -> Translation -> [LLVM IR]

Four design plans are presented. These plans add OpenMP support to the sequential code flow. All the plans assume that FIR is augmented with a pass-through operation which represents OpenMP constructs.

Plan 1 (Slide 6)

The first plan extends LLVM MLIR with OpenMP operations. The pass-through OpenMP operation in FIR is converted to the OpenMP operation in LLVM MLIR. The OpenMP operation in LLVM MLIR can be converted to LLVM IR during translation. The translation process can call the OpenMP IR Builder to generated the LLVM IR code.

Pros: Easy to implement, Can reuse Clang OpenMP code.

Cons: Not acceptable to MLIR community [1]. By design LLVM MLIR should be similar to LLVM IR (OpenMP is a form of parallelism. Has concepts of nested regions etc which do not map directly to LLVM IR and hence to LLVM MLIR).

Plan 2 (Slide 7)

In the second plan, LLVM MLIR is not extended. Here the pass-through OpenMP operation in FIR has to be converted to existing LLVM MLIR. This would involve outlining, privatisation etc during the conversion of FIR to LLVM MLIR.

Pros: Easy to implement.

Cons: Cannot be shared with other users (C/C++) and non-LLVM targets (accelerators). Cannot re-use Clang OpenMP code.

Plan 3 (Slide 8)

The third plan defines a separate MLIR dialect for OpenMP. The pass-through OpenMP operation in FIR can be converted to Operations in the OpenMP MLIR dialect. OpenMP specific optimisations can be performed in this dialect. This dialect will be lowered to the LLVM dialect and outlining etc will happen at this time.

Pros: MLIR dialect can be used by other users. Acceptable to MLIR community [1].

Cons: Cannot reuse code from Clang for OpenMP

Plan 4: The proposed design (Slide 9)

The proposed design also involves creating an OpenMP dialect in MLIR. The difference lies in the fact that the OpenMP dialect has some ties to the LLVM dialect like sharing the types and using the translation library of the LLVM dialect. The Fortran IR (FIR) will be lowered to a mix of OpenMP and LLVM dialect. The translation library of LLVM dialect will be extended to create a new translation library which can then lower the mix to LLVM IR. The new translation library can make use of the LLVM IRBuilder.

Pros: Other users (C/C++) can use it. Can reuse Clang OpenMP code. Acceptable to MLIR community [1].

Cons: Has some ties to the LLVM dialect.

OpenMP dialect

The proposed plan involves writing an MLIR dialect for OpenMP. An MLIR dialect has types and operations. Types will be based on the LLVM dialect. Operations will be a mix of fine and coarse-grained. e.g. Coarse: omp.parallel, omp.target, Fine: omp.flush. The detailed design of the dialect is TBD and will be based on what should be optimized. The OpenMP dialect creation in the proposed plan is from FIR, this would require pass-through OpenMP operations in FIR. Another approach would be to lower F18 AST with OpenMP directly to a mix of OpenMP and FIR dialects. This would require that FIR co-exist with the OpenMP dialect.

Next Steps

-> Address feedback provided after the presentation. And reach an agreement on a plan
-> Design a minimum viable MLIR dialect and get community acceptance.
-> Help with progressing the OpenMP IRBuilder.
-> Implement the accepted plan. Implementation would follow a vertical plan (going construct by construct).

-> Represent construct in OpenMP MLIR

-> Refactor the code for the construct in OpenMP IRBuilder

-> Set up the translation library for OpenMP in MLIR to call the IRBuilder

-> Set up the transformation from the frontend to OpenMP MLIR for this construct

Feedback (during the call):

-> Johaness agrees with this proposed plan (Plan 4) and does not see the ties to LLVM as being a disadvantage.

-> Hal Finkel informs that the OpenMP IRBuilder proposal does not have any opposition in the Clang community and hence that probably is not a major risk.

Feedback (other):

-> Eric Schweitz: "I like the idea of adding OpenMP as its own dialect.  I think the design should allow this dialect to coexist with FIR and other dialects.  The dialect should be informed by the specific optimizations you plan to perform on the OpenMP semantic level, of course." Would prefer to lower F18 AST to a mix of OpenMP and FIR dialects. And would like the FIR to have nothing to do with OpenMP.

-> Steve Scalpone: "I think the design needs to be vetted against some of the scenarios that’ll be encountered when generating executable OpenMP code.  I have no reason to doubt what you’ve proposed will work but I do think it is important to consider how certain scenarios will be implemented.

Some topics to consider:

- How to implement loop collapse?

- How to implement loop distribution?

- How to implement simd?

- Is an outliner needed?  In MLIR, before, after?  Model to reference shared variables?

- How will target offload be implemented?

- Where will the OpenMP runtime API be exposed?"

References

1. OpenMP MLIR dialect discussion.

https://groups.google.com/a/tensorflow.org/forum/#!topic/mlir/4Aj_eawdHiw
2. Johaness’ OpenMP IRBuilder proposal.

http://lists.flang-compiler.org/pipermail/flang-dev_lists.flang-compiler.org/2019-May/000197.html
3. MLIR

https://github.com/tensorflow/mlir

Thanks,
Kiran

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

________________________________

This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

________________________________
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/flang-dev/attachments/20190926/44ff301c/attachment-0001.html>