[LLVMdev] [cfe-dev] costing optimisations

Fri Nov 23 06:12:32 PST 2012

On 23/11/2012, at 5:46 PM, Sean Silva wrote:

> Adding LLVMdev, since this is intimately related to the optimization passes.
> 
>> I think this is roughly because some function level optimisations are
>> worse than O(N) in the number of instructions.
> 
> Please profile this and mail llvmdev regarding passes with
> significantly superlinear behavior (e.g. O(n^2)). My understanding is
> that these kinds of algorithmic problems are generally considered bugs
> in the optimizer and will be fixed.

I cannot do that yet, however here's some more info:

Felix compile time: 5 seconds.
C++ compile and run times:

With -O3:

/usr/local/bin/clang++ -c -fno-common -fno-strict-aliasing -std=c++11 -fPIC    ....
Elapsed:  106.058, Result code 0

/usr/local/bin/clang++ -fno-common -fno-strict-aliasing -std=c++11 -dynamiclib     -O3  ....
Elapsed:    0.049, Result code 0

env DYLD_LIBRARY_PATH=build/release/lib/rtl:$DYLD_LIBRARY_PATH build/release/bin/flx_run '/Users/johnskaller/felix/test/regress/rt/tuple-02.dylib' 
Elapsed:    0.007, Result code 0

With -O2:
Compile: Elapsed:  106.918, Result code 0
Link: Elapsed:    0.048, Result code 0
Run: Elapsed:    0.010, Result code 0

With -O1:
Compile: Elapsed:   13.664, Result code 0
Link: Elapsed:    0.042, Result code 0
Run: Elapsed:    0.007, Result code 0

So on this particular test case, -O3 and -O2 take the same time, but -O1 is 8 times faster.
Link and run times are roughly the same in all cases.

I made a modified program, this is Felix code but should be comprehensible.
The Felix compiler generates C++. This particular test checks printing and
comparison of tuples and arrays. What is relevant here is the "inline* keyword.
This tells Felix to inline the procedure, so the result is a single big C++ function.

/////////////////
var z = 1, (2.2, "xx");
println$ str z;
var z2 = 1, 2.2, "xx", 3, 4.3;
println$ z2;

var z3 = 1,23.3, z, z2, 99, "Jah";
println$ z3;
a :=  1,2,3,4,5,6,7,8,9,10;

inline proc A () {
println$ 1,2;
println$ 1,2,3;
println$ 1,2,3,4;
println$ 1,2,3,4,5;
println$ (a, a) :>> (int ^ 20);
}
A;

inline proc B () {
println$ (1,"x") == (1,"x");
println$ (1,"x",4.2) == (1,"x",4.2);
println$ (1,"x",4.2,"x") == (1,"x",4.2,"y");
x1 := (1,"x",4.2,"x");
x2 := (1,"x",4.2,"y");
println$ x1 == x2;
}
B;

inline proc C() {
println$ "-" * 20;
y1 := (1,"x",4.2,"x",55,"hello",4.222);
y2 := (1,"x",4.2,"y",55,"hello",4.222);
println$ y1 == y2; // false
println$ y1 != y2; // true
println$ y1 < y2;  // true
println$ y1 <= y2; // true
println$ y1 > y2;  // false 
println$ y1 >= y2; // false
}
C;

println$ a == a;

b:= 1,2;
println$ b == b;

inline proc D() {
println$ (1,2) == (1,2);
println$ (1,2,3) == (1,2,3);
println$ (1,2,3,4) == (1,2,3,4);
println$ (1,2,3,4,5) == (1,2,3,4,5);
println$ (1,2,3,4,5,6) == (1,2,3,4,5,6);

println$ ("1",2,3,4,5,6) == ("1",2,3,4,5,6);
}
D;
/////////////////

The elapsed time for the compile on this variant with -O2 is 120 seconds.

Now, watch what happens if I change "inline" to "noinline" which makes
each procedure A,B,C,D a separate C++ function:

Elapsed time: 47 seconds.

With -O1:

inline version: 14 seconds.
noinline version:  11 seconds

So the O1 compilation takes a bit longer with a single big function.
The O2 compilation is almost a decimal order of  magnitude slower
with the inline version, but twice as fast when the function is split up
into 4 pieces (plus some other stuff).

The actual C++ here:

    1729    5498   93095 /Users/johnskaller/.felix/cache/text/Users/johnskaller/felix/tup1.cpp
     753    2246   17518 /Users/johnskaller/.felix/cache/text/Users/johnskaller/felix/tup1.hpp

plus library files. Felix optimises the use of headers so only required library
headers are actually used.

100 seconds for 2000 lines is only 20 lines a second.
At that speed a 2 million line program would take 100K seconds to compile
(about one day).

Actually, Clang + LLVM took over 10 hours (using Clang as the compiler).

OK .. so I split it up again like this:

/////////////////////////////
noinline proc A1 () {
println$ 1,2;
}
noinline proc A2() {
println$ 1,2,3;
}

noinline proc A3() {
println$ 1,2,3,4;
}
noinline proc A4() {
println$ 1,2,3,4,5;
}
noinline proc A5() {
println$ (a, a) :>> (int ^ 20);
}
A1; A2; A3; A4; A5;
......
////////////////////////

Now with -O1 the compilation time is the same: 11 seconds.
But with -O2 we're down to 30 seconds ( -O3 is the same).

The thing to note here is that at a coarse grain, the program
contains no branches. It's straight line code. Of course there
may be local branches at a lower level.

Roughly: the optimiser isn't smart enough to recognise sequential
computations. It should be able to partition the control flow graph
of a function into a sequence of "super blocks" if you like, which
can be optimised independently and the results merged.
This would reduce the cost of a non-linear algorithm significantly.

I have no idea how to do this at the moment. However in the bug finding
code I worked on, there was a natural way to "clump" the control flow graph.
The graph was built directly on the C AST, so in general each assignment

	x = expr

would be a clump. Inside the expr there is control flow, but it is usually
entirely linear. So when doing path analysis, you could basically look
at the C level control flow, and disregard the flow in the expression.
Most of the time, data flow followed because sub-expressions only
assigned to temporaries (so these variables are created and destroyed
within the statement and can usually be ignored).

Obviously LLVM cannot do that (the higher level stuff has been compiled
away) but it could use heuristics to search for  the patterns that would
be left behind. I do that in my compiler with some success.

--
john skaller
skaller at users.sourceforge.net
http://felix-lang.org