[llvm-dev] About Clang llvm PGO

Xinliang David Li via llvm-dev llvm-dev at lists.llvm.org
Sun May 8 16:59:24 PDT 2016


On Sun, May 8, 2016 at 2:14 PM, Jie Chen <Jie.Chen at mathworks.com> wrote:

> Hi David,
>
>
> Thanks for your great explanations not only covering llvm but also gcc! To
> understand the code layout optimization better, I slightly changed my code,
> basically, calling the hot() function in the first if-branch instead of at
> the last else branch (see my modified code below). This essentially reduces
> branch instructions being executed, and possibly improves the branch
> predictor performance. On my Mac, I got ~6% performance improvement
> (clang++ -O2) with this code change. Looking at the default.profraw data, I
> can see it has the information that the optimizer could use to make a
> similar optimization as my manual approach. I was hoping llvm PGO could
> do the same thing.
>

yes -- this is a missing profile guided control flow optimization --
reducing hot path's control-dependence height by branch re-ordering --
possible when branch conditions are mutually exclusive.



> I am excited to hear from you that more infrastructure changes are
> undergoing which will  improve the PGO support. So as for now, what is the
> list of PGO optimizations that I can write some code and see
> immediate improvement from llvm? It would be great to know such details. :-)
>
>
> What I can tell you is that there are many missing ones (that can benefit
from profile): such as profile aware LICM (patch pending), speculative PRE,
loop unrolling, loop peeling, auto vectorization, inlining, function
splitting, function layout, function outlinling,  profile driven size
optimization, induction variable optimization/strength reduction, stringOp
specialization/optimization/inlining, switch peeling/lowering etc. The
biggest profile user today include ralloc, BB layout, ifcvt, shrinkwrapping
etc, but there should be rooms to be improvement there too.

thanks,

David

> Best,
>
>
> Jie
>
>
> //main2.cpp: manual reordering of branches
> #include <iostream>
> #include <stdlib.h>
>
> using namespace std;
>
> long long hot() {
> long long x = 0;
>
> for (int i = 0; i < 1000; i++) {
> x += i^2;
> }
>
> return x;
> }
>
> long long cold() {
> long long y = 0;
>
> for (int i = 0; i < 1000; i++) {
> y += i^2;
> }
>
> return y;
>
> }
>
> long long foo() {
> long long y = 0;
>
> for (int i = 0; i < 1000; i++) {
> y *= i^2;
> }
>
> return y*2;
>
> }
>
> long long bar() {
> long long y = 0;
>
> for (int i = 0; i < 1000; i++) {
> y *= i^2;
> }
>
> return y*3;
>
> }
>
> #define SIZE 10000000
>
> int main() {
>
> int* a = (int *)calloc(SIZE, sizeof(int));
>
> a[100] = 1;
>
> long long sum = 0;
>
> for (int i = 0; i < SIZE; i++) {
> if (a[i] < 1) {
> sum += hot();
> } else if (a[i] == 1) {
> sum += cold();
> } else if (a[i] < 1) {
> sum += bar();
> sum += foo();
> }
> }
> cout << sum << endl;
> return 0;
> }
>
>
>
>
>
>
>
>
> ------------------------------
> *From:* Xinliang David Li <davidxl at google.com>
> *Sent:* Friday, May 6, 2016 8:06 PM
> *To:* Jie Chen
> *Cc:* llvm-dev
> *Subject:* Re: About Clang llvm PGO
>
> Thanks for testing out LLVM PGO and evaluated the performance.
>
> We are currently still more focused on infrastructure improvement which is
> the foundation for performance improvement.  We are making great progress
> in this direction, but there are still some key missing pieces such as
> profile data in inliner etc. We are working on that. Once those are done,
> more focus will be on making more passes profile aware, make existing
> profile aware passes better (e.g, code layout etc).
>
> I looked at this particular example. GCC PGO can reduce the runtime by
> half, while LLVM's PGO makes no performance difference as you noticed.
>
> For GCC case, PGO itself contributes about 15% performance boost. The
> majority of the performance improvement comes from loop vectorization. Note
> that trunk GCC does not turn on vectorization at O2, but O3 or O2 with PGO.
>
> LLVM also vectorizes the key loops. However compared with GCC's
> vectorizor, LLVM's auto-vectorizer produces worse code (e.g, long sequence
> of instructions to do sign extension etc): ~6.5instr/iter vs ~9instr/iter.
> GCC also does loop unroll after vectorization which also helped a little
> more.   LLVM's vectorization actually hurts performance a little.
>
> We will look into this issue.
>
> thanks,
>
> David
>
> On Fri, May 6, 2016 at 2:04 PM, Jie Chen <Jie.Chen at mathworks.com> wrote:
>
>> Hi David,
>>
>> I am a performance engineer from MathWorks. I am currently exploring
>> building our products with PGO on the Mac platform. While searching for
>> llvm PGO solutions, I came across your name many times. So I thought you
>> were probably the guy behind llvm’s PGO implementation! :-) Here is what
>> confused me regarding the llvm PGO capability. I started with a small code
>> (see my code at the end of this email) which I saw more than 10%
>> performance improvement with PGO on Linux GCC (g++ -O2, -profile-geneate,
>> -profile-use). I wrote this code based on the assumption that llvm would
>> rearrange the hot/code branches based on profile run. But when tried with
>> Apple Clang and Clang on ubuntu, I did not see any performance improvement.
>> Since I do not know the implementation detail of llvm PGO, I am confused by
>> not seeing performance improvement as I saw it with GCC (probably with
>> Visual Studio PGO as well). Could you please offer me some insights into
>> the issue? Or on a further question, what kind of code would benefit from
>> llvm PGO optimization?
>>
>> Best,
>>
>> Jie Chen
>> MathWorks
>>
>>
>> #include <iostream>
>>
>> #include <stdlib.h>
>>
>>
>> using namespace std;
>>
>>
>> long long hot() {
>>
>>     long long x = 0;
>>
>>
>>     for (int i = 0; i < 1000; i++) {
>>
>>         x += i^2;
>>
>>     }
>>
>>
>>     return x;
>>
>> }
>>
>>
>> long long cold() {
>>
>>     long long y = 0;
>>
>>
>>     for (int i = 0; i < 1000; i++) {
>>
>>         y += i^2;
>>
>>     }
>>
>>
>>     return y;
>>
>>
>> }
>>
>>
>> long long foo() {
>>
>>     long long y = 0;
>>
>>
>>     for (int i = 0; i < 1000; i++) {
>>
>>         y *= i^2;
>>
>>     }
>>
>>
>>     return y*2;
>>
>>
>> }
>>
>>
>> long long bar() {
>>
>>     long long y = 0;
>>
>>
>>     for (int i = 0; i < 1000; i++) {
>>
>>         y *= i^2;
>>
>>     }
>>
>>
>>     return y*3;
>>
>>
>> }
>>
>>
>> #define SIZE 10000000
>>
>>
>> int main() {
>>
>>
>>     int* a = (int *)calloc(SIZE, sizeof(int));
>>
>>
>>     a[100] = 1;
>>
>>
>>     long long sum = 0;
>>
>>
>>     for (int i = 0; i < SIZE; i++) {
>>
>>         if (a[i] == 1) {
>>
>>             sum += cold();
>>
>>         } else if (a[i] > 1) {
>>
>>             sum += bar();
>>
>>             sum += foo();
>>
>>         } else if (a[i] < 1) {
>>
>>             sum += hot();
>>
>>         }
>>
>>     }
>>
>>
>>     cout << sum << endl;
>>
>>
>>     return 0;
>>
>> }
>>
>>
>> Makefile to compile the above code on Mac:
>>
>>
>> .PHONY: clean
>>
>>
>> regular: main.cpp
>>
>>     clang++ -O2  main.cpp -o main.regular
>>
>>
>> hand: main2.cpp
>>
>>     clang++ -O2  main2.cpp -o main.regular2
>>
>>
>> instr: main.cpp
>>
>>     clang++ -O2 -fprofile-instr-generate main.cpp -o main.instr
>>
>>
>> profile: main.instr
>>
>>     ./main.instr
>>
>>
>> merge: default.profraw
>>
>>     xcrun llvm-profdata merge -output default.profdata default.profraw
>>
>>
>> optimize: default.profdata
>>
>>     clang++ -O2 -fprofile-instr-use=default.profdata main.cpp -o
>> main.optimized
>>
>>
>> clean:
>>
>>     $(RM) default.* main.instr main.optimized main.regular
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160508/c4577e91/attachment.html>


More information about the llvm-dev mailing list