[llvm-dev] [RFC] IR-level Region Annotations

Tue Jan 17 21:55:31 PST 2017

> On Jan 17, 2017, at 8:09 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> 
> 
> On 01/13/2017 12:13 AM, Adve, Vikram Sadanand via llvm-dev wrote:
>>> (d) Add a small number of LLVM intrinsics for region or loop annotations,
>>>     represent the directive/clause names using metadata and the remaining
>>>     information using arguments.
>>>  Here we're proposing (d),
>> I think this would serve the goal of communicating source-level directives and annotations down to LLVM passes and back-ends, while deferring inlining and allowing optimizations and code-generation for parallel code to happen more effectively at the IR level.  Essentially, you’re tunneling language-specific information through the front-end, incorporating the new information in the IR using minimally invasive mechanisms, but the contents (captured in the metadata strings) are entirely language-specific, both in detailed syntax and semantics.
>> 
>> However, as you also said, a parallel IR needs to be able to support a wide range of parallel languages, besides OpenMP.  Two near-term examples I can think of are C++17 and Cilk (or Cilk++).  I can imagine adding support for other existing languages like Halide and perhaps the PGAS languages like UPC.  Many of the *parallel*  optimizations and code-generation tasks are likely to be common for these different languages (e.g., optimizations of reductions, parallel loop fusion, or GPU code generation).
>> 
>> Fundamentally, I think we should keep most of the parallel IR extensions as language-neutral as possible.
> 
> We obviously need to work out the details here, but one motivation is to allow the same facility to both represent concepts common to many programming models as well as programming-model-specific concepts.

Yes, I agree.  There will inevitably be programming-model-specific features that are not supported by any generic abstraction.  The hope is that they will be few, especially for the "smaller" languages.

> Also, I'd like to be able to transition from programming-model-specific representations (where I imagine most things will start) toward abstracted concepts. The goal is to retain programming-model-specific semantics while allowing the creation of transformations and analysis which deal with abstract concepts. One way we might accomplish this is by using both like this:
> 
>   1. A frontend generates region annotations. A frontend like Clang will generate (mostly) programming-model-specific region annotations. Frontends for other languages might directly use the abstract concepts for their region annotations.
> 
>  2. During optimization, a transformation pass analyzes programming-model-specific region annotations and, if legal, transforms them into abstract-concept annotations. It might:
> 
>  !"omp.barrier" -> !"llvm.parallel.barrier", !"openmp"
> 
> Such that the barrier is now a general concept that transformations might understand (and, for example, eliminate redundant barriers). It is tagged with !"openmp" do that in the end, should it survive, the concept will be lowered using OpenMP.

Yes, this is exactly what I have in mind too.  We can discuss the details — what should particular front-ends should generate directly; what back-end components can be shared even when doing programming-model-specific code generation — but this flow has many advantages.

Some specific goals I’d like to see are:
+ Have as many optimizations and back-end components as possible be driven by the programming-model-agnostic information and shared among multiple languages, i.e., minimize the need for passes that use the programming-model-specific information.
+ Allow concepts in different languages be mixed and matched, to maximize performance, e.g., a work-stealing scheduler used with an OMP parallel loop; a static schedule be used with a Cilk_for parallel loop; an SIMD directive and hints used with a Cilk_for loop; etc.).
+ In the same vein, allow optimization and code generation passes to leverage available features of run-time systems and target hardware, to maximize performance in similar ways.
+ (This is not a separate goal, but rather a strategy to enable the previous two goals.)  Use the annotations to decouple front-ends and upstream auto-parallelization passes from optimizations and code generation, so that the optimizations and code generation phases "don’t care" what source language(s) or other mechanisms were used to parallelize code.
+ Allow a flexible parallel run-time system that can span multiple hardware targets, e.g., a pipeline that runs some pipeline stages on a shared memory multicore host and some on one or more GPUs.

I didn’t explicitly spell out other goals like ones in your original email, especially to make sure standard optimization passes (const. prop.; redundancy elim; strength reduction; etc.) should continue to be as effective as possible, while minimizing the need to rewrite them to respect parallel semantics.  E.g., avoiding outlining in the front-end is likely to be an important requirement.

> 
> 3. During optimization, transformations optimize abstract-concept annotations (i.e. eliminate redundant barriers, fuse parallel regions, etc.)
> 
> 4. Later in the pipeline, programming-model specific code lowers annotations for each programming model into concrete IR (i.e. runtime function calls, etc.). For abstract concepts without a specific programming-model tag, some default programming model is selected.

For code with programming-model specific tags, it may still be possible to map into a more general run-time.  See examples above.

> The programming-model-specific to abstract-concept translation in (2) can sometimes be done on a syntactic basis alone (we already do this, in fact, for atomics), but sometimes will require analysis that can be done only after inlining/IPA (to make sure, for example, that the parallel region does not contain certain classes of runtime-library calls). Plus, this allows the translation logic to be shared easily by different frontends.
> 
> Thoughts?

I generally agree.  My main additional point (perhaps also what you had in mind) is that we should aim to maximize flexibility in the opts. and code-gen passes, while minimizing the dependence on programming-model-specific semantics.

-—Vikram

// Vikram S. Adve
// Professor, Department of Computer Science
// University of Illinois at Urbana-Champaign
// vadve at illinois.edu
// http://llvm.org

> 
> -Hal
> 
>>   There are a number of common parallel constructs that are common to many, perhaps most, of these parallel languages: parallel loops, barriers, associative and non-associative reductions, loop scheduling specifications, affinity specifications, etc. It seems to me that we could define a core set of parallel constructs that capture *most* of the common requirements of these languages.  How this is encoded is really a separate issue, e.g., perhaps(*) we could use approach (d).  We’d essentially be doing what you’re proposing for OpenMP, but with more language-neutral structure and semantics for the parallel constructs.  The corresponding language features would be directly lowered to these constructs.  The missed features could use the per-language encodings you’re describing, but with a good design they should be relatively few such features.  The overall design complexity and the impact on the existing passes should be no greater, and potentially less, than what you’ve proposed.
>> 
>> 
>> (*) Or possibly some combination of (b) and (d).  In particular, after thinking about this quite a bit recently, I’m not convinced that flexible parallel control flow, e.g., task creation and joins, are best encoded as intrinsics.  Adding a very small number of first-class instructions for them may achieve this goal more effectively.  But I’d be happy to be convinced otherwise.  In any case, that’s a detail we can discuss separately — it doesn’t change my main point that we should keep most of the parallel IR as language-neutral as possible.
>> 
>> 
>> -—Vikram
>> 
>> // Vikram S. Adve
>> // Professor, Department of Computer Science
>> // University of Illinois at Urbana-Champaign
>> // vadve at illinois.edu
>> // http://llvm.org
>> 
>> 
>> 
>>> On Jan 11, 2017, at 4:49 PM, via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>>> 
>>> Date: Wed, 11 Jan 2017 16:02:52 -0600
>>> From: Hal Finkel via llvm-dev <llvm-dev at lists.llvm.org>
>>> To: llvm-dev <llvm-dev at lists.llvm.org>
>>> Subject: [llvm-dev] [RFC] IR-level Region Annotations
>>> Message-ID: <37e418db-da79-7408-ab32-bc9fbe7940bb at anl.gov>
>>> Content-Type: text/plain; charset="utf-8"; format=flowed
>>> 
>>> A Proposal for adding an experimental IR-level region-annotation
>>> infrastructure
>>> =============================================================================
>>> 
>>> Hal Finkel (ANL) and Xinmin Tian (Intel)
>>> 
>>> This is a proposal for adding an experimental infrastructure to support
>>> annotating regions in LLVM IR, making use of intrinsics and metadata, and
>>> a generic analysis to allow transformations to easily make use of these
>>> annotated regions. This infrastructure is flexible enough to support
>>> representation of directives for parallelization, vectorization, and
>>> offloading of both loops and more-general code regions. Under this scheme,
>>> the conceptual distance between source-level directives and the region
>>> annotations need not be significant, making the incremental cost of
>>> supporting new directives and modifiers often small. It is not, however,
>>> specific to those use cases.
>>> 
>>> Problem Statement
>>> =================
>>> There are a series of discussions on LLVM IR extensions for representing
>>> region
>>> and loop annotations for parallelism, and other user-guided
>>> transformations,
>>> among both industrial and academic members of the LLVM community.
>>> Increasing
>>> the quality of our OpenMP implementation is an important motivating use
>>> case,
>>> but certainly not the only one. For OpenMP in particular, we've discussed
>>> having an IR representation for years. Presently, all OpenMP pragmas are
>>> transformed directly into runtime-library calls in Clang, and outlining
>>> (i.e.
>>> extracting parallel regions into their own functions to be invoked by the
>>> runtime library) is done in Clang as well. Our implementation does not
>>> further
>>> optimize OpenMP constructs, and a lot of thought has been put into how
>>> we might
>>> improve this. For some optimizations, such as redundant barrier removal, we
>>> could use a TargetLibraryInfo-like mechanism to recognize
>>> frontend-generated
>>> runtime calls and proceed from there. Dealing with cases where we lose
>>> pointer-aliasing information, information on loop bounds, etc. we could
>>> improve
>>> by improving our inter-procedural-analysis capabilities. We should do that
>>> regardless. However, there are important cases where the underlying
>>> scheme we
>>> want to use to lower the various parallelism constructs, especially when
>>> targeting accelerators, changes depending on what is in the parallel
>>> region.
>>> In important cases where we can see everything (i.e. there aren't arbitrary
>>> external calls), code generation should proceed in a way that is very
>>> different
>>> from the general case. To have a sensible implementation, this must be done
>>> after inlining. When using LTO, this should be done during the link-time
>>> phase.
>>> As a result, we must move away from our purely-front-end based lowering
>>> scheme.
>>> The question is what to do instead, and how to do it in a way that is
>>> generally
>>> useful to the entire community.
>>> 
>>> Designs previously discussed can be classified into four categories:
>>> 
>>> (a) Add a large number of new kinds of LLVM metadata, and use them to
>>> annotate
>>>     each necessary instruction for parallelism, data attributes, etc.
>>> (b) Add several new LLVM instructions such as, for parallelism, fork,
>>> spawn,
>>>     join, barrier, etc.
>>> (c) Add a large number of LLVM intrinsics for directives and clauses, each
>>>     intrinsic representing a directive or a clause.
>>> (d) Add a small number of LLVM intrinsics for region or loop annotations,
>>>     represent the directive/clause names using metadata and the remaining
>>>     information using arguments.
>>> 
>>> Here we're proposing (d), and below is a brief pros and cons analysis
>>> based on
>>> these discussions and our own experiences of supporting region/loop
>>> annotations
>>> in LLVM-based compilers. The table below shows a short summary of our
>>> analysis.
>>> 
>>> Various commercial compilers (e.g. from Intel, IBM, Cray, PGI), and GCC
>>> [1,2],
>>> have IR-level representations for parallelism constructs. Based on
>>> experience
>>> from these previous developments, we'd like a solution for LLVM that
>>> maximizes
>>> optimization enablement while minimizing the maintenance costs and
>>> complexity
>>> increase experienced by the community as a whole.
>>> 
>>> Representing the desired information in the LLVM IR is just the first
>>> step. The
>>> challenge is to maintain the desired semantics without blocking useful
>>> optimizations. With options (c) and (d), dependencies can be preserved
>>> mainly
>>> based on the use/def chain of the arguments of each intrinsic, and a
>>> manageable
>>> set LLVM analysis and transformations can be made aware of certain kinds of
>>> annotations in order to enable specific optimizations. In this regard,
>>> options (c) and (d) are close with respect to maintenance efforts. However,
>>> based on our experiences, option (d) is preferable because it is easier to
>>> extend to support new directives and clauses in the future without the
>>> need to
>>> add new intrinsics as required by option (c).
>>> 
>>> Table 1. Pros/cons summary of LLVM IR experimental extension options
>>> 
>>> --------+----------------------+-----------------------------------------------
>>> 
>>> Options |         Pros         | Cons
>>> --------+----------------------+-----------------------------------------------
>>> 
>>> (a)     | No need to add new   | LLVM passes do not always maintain
>>> metadata.
>>>         | instructions or      | Need to educate many passes (if not
>>> all) to
>>>         | new intrinsics       | understand and handle them.
>>> --------+----------------------+-----------------------------------------------
>>> 
>>> (b)     | Parallelism becomes  | Huge effort for extending all LLVM
>>> passes and
>>>         | first class citizen  | code generation to support new
>>> instructions.
>>>         |                      | A large set of information still needs
>>> to be
>>>         |                      | represented using other means.
>>> --------+----------------------+-----------------------------------------------
>>> 
>>> (c)     | Less impact on the   | A large number of intrinsics must be
>>> added.
>>>         | exist LLVM passes.   | Some of the optimizations need to be
>>>         | Fewer requirements   | educated to understand them.
>>>         | for passes to        |
>>>         | maintain metadata.   |
>>> --------+----------------------+-----------------------------------------------
>>> 
>>> (d)     | Minimal impact on    | Some of the optimizations need to be
>>>         | existing LLVM        | educated to understand them.
>>>         | optimizations passes.| No requirements for all passes to
>>> maintain
>>>         | directive and clause | large set of metadata with values.
>>>         | names use metadata   |
>>>         | strings.             |
>>> --------+----------------------+-----------------------------------------------
>>> 
>>> 
>>> Regarding (a), LLVM already uses metadata for certain loop information
>>> (e.g.
>>> annotations directing loop transformations and assertions about
>>> loop-carried
>>> dependencies), but there is no natural or consistent way to extend this
>>> scheme
>>> to represent necessary data-movement or region information.
>>> 
>>> 
>>> New Intrinsics for Region and Value Annotations
>>> ==============================================
>>> The following new (experimental) intrinsics are proposed which allow:
>>> 
>>> a) Annotating a code region marked with directives / pragmas,
>>> b) Annotating values associated with the region (or loops), that is, those
>>>    values associated with directives / pragmas.
>>> c) Providing information on LLVM IR transformations needed for the
>>> annotated
>>>    code regions (or loops).
>>> 
>>> These can be used both by frontends and also by transformation passes (e.g.
>>> automated parallelization). The names used here are similar to those
>>> used by
>>> our internal prototype, but obviously we expect a community bikeshed
>>> discussion.
>>> 
>>> def int_experimental_directive : Intrinsic<[], [llvm_metadata_ty],
>>>                                    [IntrArgMemOnly],
>>> "llvm.experimental.directive">;
>>> 
>>> def int_experimental_dir_qual : Intrinsic<[], [llvm_metadata_ty],
>>> [IntrArgMemOnly],
>>> "llvm.experimental.dir.qual">;
>>> 
>>> def int_experimental_dir_qual_opnd : Intrinsic<[],
>>> [llvm_metadata_ty, llvm_any_ty],
>>> [IntrArgMemOnly],
>>> "llvm.experimental.dir.qual.opnd">;
>>> 
>>> def int_experimental_dir_qual_opndlist : Intrinsic<
>>>                                         [],
>>> [llvm_metadata_ty, llvm_vararg_ty],
>>> [IntrArgMemOnly],
>>> "llvm.experimental.dir.qual.opndlist">;
>>> 
>>> Note that calls to these intrinsics might need to be annotated with the
>>> convergent attribute when they represent fork/join operations, barriers,
>>> and
>>> similar.
>>> 
>>> Usage Examples
>>> ==============
>>> 
>>> This section shows a few examples using these experimental intrinsics.
>>> LLVM developers who will use these intrinsics can defined their own
>>> MDstring.
>>> All details of using these intrinsics on representing OpenMP 4.5
>>> constructs are described in [1][3].
>>> 
>>> 
>>> Example I: An OpenMP combined construct
>>> 
>>> #pragma omp target teams distribute parallel for simd
>>>   loop
>>> 
>>> LLVM IR
>>> -------
>>> call void @llvm.experimental.directive(metadata !0)
>>> call void @llvm.experimental.directive(metadata !1)
>>> call void @llvm.experimental.directive(metadata !2)
>>> call void @llvm.experimental.directive(metadata !3)
>>>   loop
>>> call void @llvm.experimental.directive(metadata !6)
>>> call void @llvm.experimental.directive(metadata !5)
>>> call void @llvm.experimental.directive(metadata !4)
>>> 
>>> !0 = metadata !{metadata !DIR.OMP.TARGET}
>>> !1 = metadata !{metadata !DIR.OMP.TEAMS}
>>> !2 = metadata !{metadata !DIR.OMP.DISTRIBUTE.PARLOOP.SIMD}
>>> 
>>> !6 = metadata !{metadata !DIR.OMP.END.DISTRIBUTE.PARLOOP.SIMD}
>>> !5 = metadata !{metadata !DIR.OMP.END.TEAMS}
>>> !4 = metadata !{metadata !DIR.OMP.END.TARGET}
>>> 
>>> Example II: Assume x,y,z are int variables, and s is a non-POD variable.
>>>             Then, lastprivate(x,y,s,z) is represented as:
>>> 
>>> LLVM IR
>>> -------
>>> call void @llvm.experimental.dir.qual.opndlist(
>>>                 metadata !1, %x, %y, metadata !2, %a, %ctor, %dtor, %z)
>>> 
>>> !1 = metadata !{metadata !QUAL.OMP.PRIVATE}
>>> !2 = metadata !{metadata !QUAL.OPND.NONPOD}
>>> 
>>> Example III: A prefetch pragma example
>>> 
>>> // issue vprefetch1 for xp with a distance of 20 vectorized iterations
>>> ahead
>>> // issue vprefetch0 for yp with a distance of 10 vectorized iterations
>>> ahead
>>> #pragma prefetch x:1:20 y:0:10
>>> for (i=0; i<2*N; i++) { xp[i*m + j] = -1; yp[i*n +j] = -2; }
>>> 
>>> LLVM IR
>>> -------
>>> call void @llvm.experimental.directive(metadata !0)
>>> call void @llvm.experimental.dir.qual.opnslist(metadata !1, %xp, 1, 20,
>>>                                                metadata !1, %yp, 0, 10)
>>>   loop
>>> call void @llvm.experimental.directive(metadata !3)
>>> 
>>> References
>>> ==========
>>> 
>>> [1] LLVM Framework and IR extensions for Parallelization, SIMD
>>> Vectorization
>>>     and Offloading Support. SC'2016 LLVM-HPC3 Workshop. (Xinmin Tian
>>> et.al.)
>>>     Saltlake City, Utah.
>>> 
>>> [2] Extending LoopVectorizer towards supporting OpenMP4.5 SIMD and outer
>>> loop
>>>     auto-vectorization. (Hideki Saito, et.al.) LLVM Developers' Meeting
>>> 2016,
>>>     San Jose.
>>> 
>>> [3] Intrinsics, Metadata, and Attributes: The Story continues! (Hal Finkel)
>>>     LLVM Developers' Meeting, 2016. San Jose
>>> 
>>> [4] LLVM Intrinsic Function and Metadata String Interface for Directive (or
>>>     Pragmas) Representation. Specification Draft v0.9, Intel
>>> Corporation, 2016.
>>> 
>>> 
>>> Acknowledgements
>>> ================
>>> We would like to thank Chandler Carruth (Google), Johannes Doerfert
>>> (Saarland
>>> Univ.), Yaoqing Gao (HuaWei), Michael Wong (Codeplay), Ettore Tiotto,
>>> Carlo Bertolli, Bardia Mahjour (IBM), and all other LLVM-HPC IR
>>> Extensions WG
>>> members for their constructive feedback on the LLVM framework and IR
>>> extension
>>> proposal.
>>> 
>>> Proposed Implementation
>>> =======================
>>> 
>>> Two sets of patches of supporting these experimental intrinsics and
>>> demonstrate
>>> the usage are ready for community review.
>>> 
>>> a) Clang patches that support core OpenMP pragmas using this approach.
>>> b) W-Region framework patches: CFG restructuring to form single-entry-
>>>    single-exit work region (W-Region) based on annotations, Demand-driven
>>>    intrinsic parsing, and WRegionInfo collection and analysis passes,
>>>    Dump functions of WRegionInfo.
>>> 
>>> On top of this functionality, we will provide the transformation patches
>>> for
>>> core OpenMP constructs (e.g. start with "#pragma omp parallel for" loop for
>>> lowering and outlining, and "#pragma omp simd" to hook it up with
>>> LoopVectorize.cpp). We have internal implementations for many constructs
>>> now.
>>> We will break this functionality up to create a series of patches for
>>> community review.
>>> 
>>> -- 
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
> -- 
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
> 
>