[llvm-dev] About Clang llvm PGO

Xinliang David Li via llvm-dev llvm-dev at lists.llvm.org
Fri May 6 17:06:07 PDT 2016


Thanks for testing out LLVM PGO and evaluated the performance.

We are currently still more focused on infrastructure improvement which is
the foundation for performance improvement.  We are making great progress
in this direction, but there are still some key missing pieces such as
profile data in inliner etc. We are working on that. Once those are done,
more focus will be on making more passes profile aware, make existing
profile aware passes better (e.g, code layout etc).

I looked at this particular example. GCC PGO can reduce the runtime by
half, while LLVM's PGO makes no performance difference as you noticed.

For GCC case, PGO itself contributes about 15% performance boost. The
majority of the performance improvement comes from loop vectorization. Note
that trunk GCC does not turn on vectorization at O2, but O3 or O2 with PGO.

LLVM also vectorizes the key loops. However compared with GCC's vectorizor,
LLVM's auto-vectorizer produces worse code (e.g, long sequence of
instructions to do sign extension etc): ~6.5instr/iter vs ~9instr/iter.
GCC also does loop unroll after vectorization which also helped a little
more.   LLVM's vectorization actually hurts performance a little.

We will look into this issue.

thanks,

David

On Fri, May 6, 2016 at 2:04 PM, Jie Chen <Jie.Chen at mathworks.com> wrote:

> Hi David,
>
> I am a performance engineer from MathWorks. I am currently exploring
> building our products with PGO on the Mac platform. While searching for
> llvm PGO solutions, I came across your name many times. So I thought you
> were probably the guy behind llvm’s PGO implementation! :-) Here is what
> confused me regarding the llvm PGO capability. I started with a small code
> (see my code at the end of this email) which I saw more than 10%
> performance improvement with PGO on Linux GCC (g++ -O2, -profile-geneate,
> -profile-use). I wrote this code based on the assumption that llvm would
> rearrange the hot/code branches based on profile run. But when tried with
> Apple Clang and Clang on ubuntu, I did not see any performance improvement.
> Since I do not know the implementation detail of llvm PGO, I am confused by
> not seeing performance improvement as I saw it with GCC (probably with
> Visual Studio PGO as well). Could you please offer me some insights into
> the issue? Or on a further question, what kind of code would benefit from
> llvm PGO optimization?
>
> Best,
>
> Jie Chen
> MathWorks
>
>
> #include <iostream>
>
> #include <stdlib.h>
>
>
> using namespace std;
>
>
> long long hot() {
>
>     long long x = 0;
>
>
>     for (int i = 0; i < 1000; i++) {
>
>         x += i^2;
>
>     }
>
>
>     return x;
>
> }
>
>
> long long cold() {
>
>     long long y = 0;
>
>
>     for (int i = 0; i < 1000; i++) {
>
>         y += i^2;
>
>     }
>
>
>     return y;
>
>
> }
>
>
> long long foo() {
>
>     long long y = 0;
>
>
>     for (int i = 0; i < 1000; i++) {
>
>         y *= i^2;
>
>     }
>
>
>     return y*2;
>
>
> }
>
>
> long long bar() {
>
>     long long y = 0;
>
>
>     for (int i = 0; i < 1000; i++) {
>
>         y *= i^2;
>
>     }
>
>
>     return y*3;
>
>
> }
>
>
> #define SIZE 10000000
>
>
> int main() {
>
>
>     int* a = (int *)calloc(SIZE, sizeof(int));
>
>
>     a[100] = 1;
>
>
>     long long sum = 0;
>
>
>     for (int i = 0; i < SIZE; i++) {
>
>         if (a[i] == 1) {
>
>             sum += cold();
>
>         } else if (a[i] > 1) {
>
>             sum += bar();
>
>             sum += foo();
>
>         } else if (a[i] < 1) {
>
>             sum += hot();
>
>         }
>
>     }
>
>
>     cout << sum << endl;
>
>
>     return 0;
>
> }
>
>
> Makefile to compile the above code on Mac:
>
>
> .PHONY: clean
>
>
> regular: main.cpp
>
>     clang++ -O2  main.cpp -o main.regular
>
>
> hand: main2.cpp
>
>     clang++ -O2  main2.cpp -o main.regular2
>
>
> instr: main.cpp
>
>     clang++ -O2 -fprofile-instr-generate main.cpp -o main.instr
>
>
> profile: main.instr
>
>     ./main.instr
>
>
> merge: default.profraw
>
>     xcrun llvm-profdata merge -output default.profdata default.profraw
>
>
> optimize: default.profdata
>
>     clang++ -O2 -fprofile-instr-use=default.profdata main.cpp -o
> main.optimized
>
>
> clean:
>
>     $(RM) default.* main.instr main.optimized main.regular
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160506/9a6141fd/attachment-0001.html>


More information about the llvm-dev mailing list