[llvm-dev] FW: [RFC] Abstract Parallel IR Optimizations
Adve, Vikram Sadanand via llvm-dev
llvm-dev at lists.llvm.org
Wed Jun 6 22:52:57 PDT 2018
I definitely agree that LLVM needs better support for optimizing parallel programs. Comments inline:
From: Johannes Doerfert <jdoerfert at anl.gov>
Date: Wednesday, June 6, 2018 at 10:52 AM
To: LLVM-Dev <llvm-dev at lists.llvm.org>
Cc: Hal Finkel <hfinkel at anl.gov>, TB Schardl <neboat at mit.edu>, "Adve, Vikram Sadanand" <vadve at illinois.edu>, "Tian, Xinmin" <xinmin.tian at intel.com>, "llvmpar at lists.cs.illinois.edu" <llvmpar at lists.cs.illinois.edu>
Subject: [RFC] Abstract Parallel IR Optimizations
This is an RFC to add analyses and transformation passes into LLVM to
optimize programs based on an abstract notion of a parallel region.
== this is _not_ a proposal to add a new encoding of parallelism ==
Actually, I suspect that your three “abstract” interfaces below (ParallelRegionInfo, ParallelCommunicationInfo, ParallelIR/Builder) *are* a parallel IR definition, and what you call the “implementations” essentially encapsulate *target-specific* variations. IOW, this looks like a perhaps-very-preliminary parallel IR, not some generic abstract interface. I can only suspect this because I don’t have any information about what info or operations these interfaces provide.
In fact, I think it *is* a good idea to have a proper parallel IR to support the kinds of optimizations you’re describing. I just don’t think you can define a universal parallel IR for all the parallel languages and targets that LLVM may want to support. At a minimum, I think we need to experiment more before committing to one (*if* one proved enough).
Instead of defining a specific parallel IR or interface, as you’re trying to do, it would be much better to provide IR-independent hooks in LLVM so that different parallel IRs can sit “above” the LLVM representation and support these and other optimizations. Your IR (or IR implementations, whatever you call them) could be layered in this way.
The earlier RFC that Hal Finkel and Xinmin Tian circulated were for such a set of hooks. In fact, these hooks also work on regions, but are much more limited and are *not* meant to support parallel passes themselves: those are left to any IR built above them.
As you know, we (Hal, Xinmin, TB Schardl, George Stelle, Hashim Sharif, I, and a few others) are trying to refine that proposal to provide examples of multiple IRs the hooks can support, and to make a clearer argument for the soundness and correctness impact of the hooks on LLVM passes. You’ve been invited to our conference calls and to edit the Google Doc. It would be valuable if you could add examples of how your IR information could be connected to LLVM via these hooks.
More specific questions below.
We currently perform poorly when it comes to optimizations for parallel
codes. In fact, parallelizing your loops might actually prevent various
optimizations that would have been applied otherwise. One solution to
this problem is to teach the compiler about the semantics of the used
parallel representation. While this sounds tedious at first, it turns
out that we can perform key optimizations with reasonable implementation
effort (and thereby also reasonable maintenance costs). However, we have
various parallel representations that are already in use (KMPC,
GOMP, CILK runtime, ...) or proposed (Tapir, IntelPIR, ...).
Our proposal seeks to introduce parallelism specific optimizations for
multiple representations while minimizing the implementation overhead.
This is done through an abstract notion of a parallel region which hides
the actual representation from the analysis and optimization passes.
In general, for analysis and optimization passes to work with multiple representations, you’d need to have a pretty rich abstract notion of a parallel region. I suspect that your design is pretty OpenMP-specific. Even within that context, it would have to capture quite a rich set of parallel constructs. Can you describe what exactly is your abstract notion of a parallel region, and how is it different from a concrete representation?
the schemata below, our current five optimizations (described in detail
here ) are shown on the left, the abstract parallel IR interface is
is in the middle, and the representation specific implementations is on
Optimization (A)nalysis/(T)ransformation Impl.
CodePlacementOpt \ /---> ParallelRegionInfo (A) ---------|-> KMPCImpl (A)
RegionExpander -\ | | GOMPImpl (A)
AttributeAnnotator -|-|---> ParallelCommunicationInfo (A) --/ ...
BarrierElimination -/ |
VariablePrivatization / \---> ParallelIR/Builder (T) -----------> KMPCImpl (T)
I wasn’t able to understand parts of this figure. What info do ParallelRegionInfo() and ParallelCommunicationInfo() provide? What operations does ParallelIR/Builder provide?
In our setting, a parallel region can be an outlined function called
through a runtime library but also a fork-join/attach-reattach region
embedded in an otherwise sequential code. The new optimizations will
provide parallelism specific optimizations to all of them (if
applicable). There are various reasons why we believe this is a
worthwhile effort that belongs into the LLVM codebase, including:
Before adding something this broad into the LLVM code base, I think we need to understand a whole slew of things about it, starting with the questions above.
1. We improve the performance of parallel programs, today.
There are a number of important parallel languages and both language-specific and target-specific parallel optimizations. Just because these five optimizations improve OpenMP program performance isn’t enough to justify adding a new parallel IR (or interface) to LLVM, without knowing whether other languages, targets and optimizations could be correctly and effectively implemented using this approach.
2) It serves as a meaningful baseline for future discussions on
(optimized) parallel representations.
We can’t just add a new parallel IR design (abstract or concrete) to mainline just to serve as a baseline for future discussions.
3) It allows to determine the pros and cons of the different schemes
when it comes to actual optimizations and inputs.
4) It helps to identify problems that might arise once we start to
transform parallel programs but _before_ we commit to a specific
I suspect you *are* committing to a specific representation, although we can’t be sure until you provide more details.
Our prototypes for the OpenMP KMPC library (used by clang) already shows
significant speedups for various benchmarks . It also exposed a (to
me) prior unknown problem between restrict/noalias pointers and
(potential) barriers (see Section 3 in ).
We are currently in the process of cleaning the code, extending the
support for OpenMP constructs and adding a second implementation for a
embedded parallel regions. Though, a first horizontal prototype
implementation is already available for review .
Inputs of any kind are welcome and reviewers are needed!
PhD Student / Researcher
Compiler Design Lab (Professor Hack) / Argonne National Laboratory
Saarland Informatics Campus, Germany / Lemont, IL 60439, USA
Building E1.3, Room 4.31
Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de<mailto:doerfert at cs.uni-saarland.de> / jdoerfert at anl.gov<mailto:jdoerfert at anl.gov>
Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert
// Interim Head, Department of Computer Science
// Donald B. Gillies Professor of Computer Science
// University of Illinois at Urbana-Champaign
// Admin Assistant: Amanda Foley - ajfoley2 at illinois.edu<mailto:ajfoley2 at illinois.edu>
// Google Hangouts: vikram.s.adve at gmail.com<mailto:vikram.s.adve at gmail.com> || Skype: vikramsadve
// Research page: http://vikram.cs.illinois.edu<http://vikram.cs.illinois.edu/>
-------------- next part --------------
An HTML attachment was scrubbed...
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 228 bytes
More information about the llvm-dev