[llvm-dev] [RFC] IR-level Region Annotations

Wed Jan 11 14:02:52 PST 2017

A Proposal for adding an experimental IR-level region-annotation 
infrastructure
============================================================================= 

Hal Finkel (ANL) and Xinmin Tian (Intel)

This is a proposal for adding an experimental infrastructure to support
annotating regions in LLVM IR, making use of intrinsics and metadata, and
a generic analysis to allow transformations to easily make use of these
annotated regions. This infrastructure is flexible enough to support
representation of directives for parallelization, vectorization, and
offloading of both loops and more-general code regions. Under this scheme,
the conceptual distance between source-level directives and the region
annotations need not be significant, making the incremental cost of
supporting new directives and modifiers often small. It is not, however,
specific to those use cases.

Problem Statement
=================
There are a series of discussions on LLVM IR extensions for representing 
region
and loop annotations for parallelism, and other user-guided 
transformations,
among both industrial and academic members of the LLVM community. 
Increasing
the quality of our OpenMP implementation is an important motivating use 
case,
but certainly not the only one. For OpenMP in particular, we've discussed
having an IR representation for years. Presently, all OpenMP pragmas are
transformed directly into runtime-library calls in Clang, and outlining 
(i.e.
extracting parallel regions into their own functions to be invoked by the
runtime library) is done in Clang as well. Our implementation does not 
further
optimize OpenMP constructs, and a lot of thought has been put into how 
we might
improve this. For some optimizations, such as redundant barrier removal, we
could use a TargetLibraryInfo-like mechanism to recognize 
frontend-generated
runtime calls and proceed from there. Dealing with cases where we lose
pointer-aliasing information, information on loop bounds, etc. we could 
improve
by improving our inter-procedural-analysis capabilities. We should do that
regardless. However, there are important cases where the underlying 
scheme we
want to use to lower the various parallelism constructs, especially when
targeting accelerators, changes depending on what is in the parallel 
region.
In important cases where we can see everything (i.e. there aren't arbitrary
external calls), code generation should proceed in a way that is very 
different
from the general case. To have a sensible implementation, this must be done
after inlining. When using LTO, this should be done during the link-time 
phase.
As a result, we must move away from our purely-front-end based lowering 
scheme.
The question is what to do instead, and how to do it in a way that is 
generally
useful to the entire community.

Designs previously discussed can be classified into four categories:

(a) Add a large number of new kinds of LLVM metadata, and use them to 
annotate
     each necessary instruction for parallelism, data attributes, etc.
(b) Add several new LLVM instructions such as, for parallelism, fork, 
spawn,
     join, barrier, etc.
(c) Add a large number of LLVM intrinsics for directives and clauses, each
     intrinsic representing a directive or a clause.
(d) Add a small number of LLVM intrinsics for region or loop annotations,
     represent the directive/clause names using metadata and the remaining
     information using arguments.

Here we're proposing (d), and below is a brief pros and cons analysis 
based on
these discussions and our own experiences of supporting region/loop 
annotations
in LLVM-based compilers. The table below shows a short summary of our 
analysis.

Various commercial compilers (e.g. from Intel, IBM, Cray, PGI), and GCC 
[1,2],
have IR-level representations for parallelism constructs. Based on 
experience
from these previous developments, we'd like a solution for LLVM that 
maximizes
optimization enablement while minimizing the maintenance costs and 
complexity
increase experienced by the community as a whole.

Representing the desired information in the LLVM IR is just the first 
step. The
challenge is to maintain the desired semantics without blocking useful
optimizations. With options (c) and (d), dependencies can be preserved 
mainly
based on the use/def chain of the arguments of each intrinsic, and a 
manageable
set LLVM analysis and transformations can be made aware of certain kinds of
annotations in order to enable specific optimizations. In this regard,
options (c) and (d) are close with respect to maintenance efforts. However,
based on our experiences, option (d) is preferable because it is easier to
extend to support new directives and clauses in the future without the 
need to
add new intrinsics as required by option (c).

Table 1. Pros/cons summary of LLVM IR experimental extension options

--------+----------------------+----------------------------------------------- 

Options |         Pros         | Cons
--------+----------------------+----------------------------------------------- 

(a)     | No need to add new   | LLVM passes do not always maintain 
metadata.
         | instructions or      | Need to educate many passes (if not 
all) to
         | new intrinsics       | understand and handle them.
--------+----------------------+----------------------------------------------- 

(b)     | Parallelism becomes  | Huge effort for extending all LLVM 
passes and
         | first class citizen  | code generation to support new 
instructions.
         |                      | A large set of information still needs 
to be
         |                      | represented using other means.
--------+----------------------+----------------------------------------------- 

(c)     | Less impact on the   | A large number of intrinsics must be 
added.
         | exist LLVM passes.   | Some of the optimizations need to be
         | Fewer requirements   | educated to understand them.
         | for passes to        |
         | maintain metadata.   |
--------+----------------------+----------------------------------------------- 

(d)     | Minimal impact on    | Some of the optimizations need to be
         | existing LLVM        | educated to understand them.
         | optimizations passes.| No requirements for all passes to 
maintain
         | directive and clause | large set of metadata with values.
         | names use metadata   |
         | strings.             |
--------+----------------------+----------------------------------------------- 

Regarding (a), LLVM already uses metadata for certain loop information 
(e.g.
annotations directing loop transformations and assertions about 
loop-carried
dependencies), but there is no natural or consistent way to extend this 
scheme
to represent necessary data-movement or region information.

New Intrinsics for Region and Value Annotations
==============================================
The following new (experimental) intrinsics are proposed which allow:

a) Annotating a code region marked with directives / pragmas,
b) Annotating values associated with the region (or loops), that is, those
    values associated with directives / pragmas.
c) Providing information on LLVM IR transformations needed for the 
annotated
    code regions (or loops).

These can be used both by frontends and also by transformation passes (e.g.
automated parallelization). The names used here are similar to those 
used by
our internal prototype, but obviously we expect a community bikeshed
discussion.

def int_experimental_directive : Intrinsic<[], [llvm_metadata_ty],
                                    [IntrArgMemOnly],
"llvm.experimental.directive">;

def int_experimental_dir_qual : Intrinsic<[], [llvm_metadata_ty],
[IntrArgMemOnly],
"llvm.experimental.dir.qual">;

def int_experimental_dir_qual_opnd : Intrinsic<[],
[llvm_metadata_ty, llvm_any_ty],
[IntrArgMemOnly],
"llvm.experimental.dir.qual.opnd">;

def int_experimental_dir_qual_opndlist : Intrinsic<
                                         [],
[llvm_metadata_ty, llvm_vararg_ty],
[IntrArgMemOnly],
"llvm.experimental.dir.qual.opndlist">;

Note that calls to these intrinsics might need to be annotated with the
convergent attribute when they represent fork/join operations, barriers, 
and
similar.

Usage Examples
==============

This section shows a few examples using these experimental intrinsics.
LLVM developers who will use these intrinsics can defined their own 
MDstring.
All details of using these intrinsics on representing OpenMP 4.5 
constructs are described in [1][3].

Example I: An OpenMP combined construct

#pragma omp target teams distribute parallel for simd
   loop

LLVM IR
-------
call void @llvm.experimental.directive(metadata !0)
call void @llvm.experimental.directive(metadata !1)
call void @llvm.experimental.directive(metadata !2)
call void @llvm.experimental.directive(metadata !3)
   loop
call void @llvm.experimental.directive(metadata !6)
call void @llvm.experimental.directive(metadata !5)
call void @llvm.experimental.directive(metadata !4)

!0 = metadata !{metadata !DIR.OMP.TARGET}
!1 = metadata !{metadata !DIR.OMP.TEAMS}
!2 = metadata !{metadata !DIR.OMP.DISTRIBUTE.PARLOOP.SIMD}

!6 = metadata !{metadata !DIR.OMP.END.DISTRIBUTE.PARLOOP.SIMD}
!5 = metadata !{metadata !DIR.OMP.END.TEAMS}
!4 = metadata !{metadata !DIR.OMP.END.TARGET}

Example II: Assume x,y,z are int variables, and s is a non-POD variable.
             Then, lastprivate(x,y,s,z) is represented as:

LLVM IR
-------
call void @llvm.experimental.dir.qual.opndlist(
                 metadata !1, %x, %y, metadata !2, %a, %ctor, %dtor, %z)

!1 = metadata !{metadata !QUAL.OMP.PRIVATE}
!2 = metadata !{metadata !QUAL.OPND.NONPOD}

Example III: A prefetch pragma example

// issue vprefetch1 for xp with a distance of 20 vectorized iterations 
ahead
// issue vprefetch0 for yp with a distance of 10 vectorized iterations 
ahead
#pragma prefetch x:1:20 y:0:10
for (i=0; i<2*N; i++) { xp[i*m + j] = -1; yp[i*n +j] = -2; }

LLVM IR
-------
call void @llvm.experimental.directive(metadata !0)
call void @llvm.experimental.dir.qual.opnslist(metadata !1, %xp, 1, 20,
                                                metadata !1, %yp, 0, 10)
   loop
call void @llvm.experimental.directive(metadata !3)

References
==========

[1] LLVM Framework and IR extensions for Parallelization, SIMD 
Vectorization
     and Offloading Support. SC'2016 LLVM-HPC3 Workshop. (Xinmin Tian 
et.al.)
     Saltlake City, Utah.

[2] Extending LoopVectorizer towards supporting OpenMP4.5 SIMD and outer 
loop
     auto-vectorization. (Hideki Saito, et.al.) LLVM Developers' Meeting 
2016,
     San Jose.

[3] Intrinsics, Metadata, and Attributes: The Story continues! (Hal Finkel)
     LLVM Developers' Meeting, 2016. San Jose

[4] LLVM Intrinsic Function and Metadata String Interface for Directive (or
     Pragmas) Representation. Specification Draft v0.9, Intel 
Corporation, 2016.

Acknowledgements
================
We would like to thank Chandler Carruth (Google), Johannes Doerfert 
(Saarland
Univ.), Yaoqing Gao (HuaWei), Michael Wong (Codeplay), Ettore Tiotto,
Carlo Bertolli, Bardia Mahjour (IBM), and all other LLVM-HPC IR 
Extensions WG
members for their constructive feedback on the LLVM framework and IR 
extension
proposal.

Proposed Implementation
=======================

Two sets of patches of supporting these experimental intrinsics and 
demonstrate
the usage are ready for community review.

a) Clang patches that support core OpenMP pragmas using this approach.
b) W-Region framework patches: CFG restructuring to form single-entry-
    single-exit work region (W-Region) based on annotations, Demand-driven
    intrinsic parsing, and WRegionInfo collection and analysis passes,
    Dump functions of WRegionInfo.

On top of this functionality, we will provide the transformation patches 
for
core OpenMP constructs (e.g. start with "#pragma omp parallel for" loop for
lowering and outlining, and "#pragma omp simd" to hook it up with
LoopVectorize.cpp). We have internal implementations for many constructs 
now.
We will break this functionality up to create a series of patches for
community review.

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory