[llvm-dev] Publication LLVM Related Publications Submission

Tue Nov 28 09:05:03 PST 2017

Hello,

I would like to submit two papers that use LLVM to the
Related Publications section.

Both papers focus on code isolation
applied to perform piecewise compiler optimizations.
The code isolation
process is performed by CERE, an open source tool based on LLVM.

The
second paper is an extended version of the first one.

1) Piecewise
Holistic Autotuning of Compiler and Runtime Parameters

@inproceedings{popov2016piecewise,
 title={Piecewise Holistic
Autotuning of Compiler and Runtime Parameters},
 author={Popov, Mihail
and Akel, Chadi and Jalby, William and de Oliveira Castro, Pablo},

booktitle={European Conference on Parallel Processing},

pages={238--250},
 year={2016},
 organization={Springer}
}

2) Piecewise
holistic autotuning of parallel programs with CERE

@article{popov2017piecewise,
 title={Piecewise holistic autotuning of
parallel programs with CERE},
 author={Popov, Mihail and Akel, Chadi and
Chatelain, Yohan and Jalby, William and de Oliveira Castro, Pablo},

journal={Concurrency and Computation: Practice and Experience},

volume={29},
 number={15},
 year={2017},
 publisher={Wiley Online
Library}
}

Do not hesitate if you have any questions or if you need any
additional documents.

Thank you,
Mihail
Popov

-----------------------------------------------------------------------------------

PAPERS
SUMMARY:

Piecewise Holistic Autotuning of Compiler and Runtime
Parameters

Abstract. Current architecture complexity requires fine
tuning of compiler 
and runtime parameters to achieve full potential
performance. Autotuning 
substantially improves default parameters in
many scenarios
but it is a costly process requiring a long iterative
evaluation.
We propose an automatic piecewise autotuner based on CERE
(Codelet
Extractor and REplayer). CERE decomposes applications into
small
pieces called codelets: each codelet maps to a loop or to an
OpenMP
parallel region and can be replayed as a standalone
program.
Codelet autotuning achieves better speedups at a lower tuning
cost. By
grouping codelet invocations with the same performance
behavior, CERE
reduces the number of loops or OpenMP regions to be
evaluated. Moreover 
unlike whole-program tuning, CERE customizes the
set of best 
parameters for each specific OpenMP region or loop.
We
demonstrate CERE tuning of compiler optimizations, number of
threads and
thread affinity on a NUMA architecture. On average over the
NAS 3.0
benchmarks, we achieve a speedup of 1.08× after tuning. Tuning 
a single
codelet is 13× cheaper than whole-program evaluation and
estimates the
tuning impact on the original region with a 94.7% accuracy. 
On a
Reverse Time Migration (RTM) proto-application we achieve
a 1.11×
speedup with a 200× cheaper exploration.

Piecewise Holistic Autotuning
of Parallel Programs with CERE

Current architecture complexity requires
fine tuning of compiler
 and runtime parameters to achieve best
performance. Autotuning
substantially improves default parameters in
many scenarios but it is a
costly process requiring long iterative
evaluations.
We propose an automatic piecewise autotuner based on CERE
(Codelet
Extractor and REplayer). CERE decomposes applications into
small
pieces called codelets: each codelet maps to a loop or to an
OpenMP
parallel region and can be replayed as a standalone
program.
Codelet autotuning achieves better speedups at a lower tuning
cost. By
grouping codelet invocations with the same performance
behavior, CERE
reduces the number of loops or OpenMP regions to be
evaluated. Moreover 
unlike whole-program tuning, CERE customizes the
set of best parameters
 for each specific OpenMP region or loop.
We
demonstrate the CERE tuning of compiler optimizations, number
of
threads, thread affinity, and scheduling policy on both NUMA
and
heterogeneous architectures. Over the NAS benchmarks, we achieve
an
average speedup of 1.08× after tuning. Tuning a codelet is 13×
cheaper
than whole-program evaluation and predicts the tuning impact
with a
94.7% accuracy. Similarly, exploring thread configurations and
scheduling
 policies for a Black-Scholes solver on an heterogeneous
big.LITTLE
architecture is over 40× faster using CERE.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171128/d31e9c54/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2016_codelet_tuning_Euro-Par.pdf
Type: application/pdf
Size: 467678 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171128/d31e9c54/attachment-0002.pdf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2017_CERE_tuning_Concurrency_and_Computation__Practice_and_Experience.pdf
Type: application/pdf
Size: 868319 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171128/d31e9c54/attachment-0003.pdf>