[cfe-dev] [LLVMdev] costing optimisations

Fri Nov 23 06:44:57 PST 2012

On 23.11.2012, at 15:12, john skaller <skaller at users.sourceforge.net> wrote:

> 
> On 23/11/2012, at 5:46 PM, Sean Silva wrote:
> 
>> Adding LLVMdev, since this is intimately related to the optimization passes.
>> 
>>> I think this is roughly because some function level optimisations are
>>> worse than O(N) in the number of instructions.
>> 
>> Please profile this and mail llvmdev regarding passes with
>> significantly superlinear behavior (e.g. O(n^2)). My understanding is
>> that these kinds of algorithmic problems are generally considered bugs
>> in the optimizer and will be fixed.
> 
> 
> I cannot do that yet, however here's some more info:
> 
> Felix compile time: 5 seconds.
> C++ compile and run times:
> 
> With -O3:
> 
> /usr/local/bin/clang++ -c -fno-common -fno-strict-aliasing -std=c++11 -fPIC    ....
> Elapsed:  106.058, Result code 0
> 
> /usr/local/bin/clang++ -fno-common -fno-strict-aliasing -std=c++11 -dynamiclib     -O3  ....
> Elapsed:    0.049, Result code 0
> 
> env DYLD_LIBRARY_PATH=build/release/lib/rtl:$DYLD_LIBRARY_PATH build/release/bin/flx_run '/Users/johnskaller/felix/test/regress/rt/tuple-02.dylib' 
> Elapsed:    0.007, Result code 0
> 
> 
> With -O2:
> Compile: Elapsed:  106.918, Result code 0
> Link: Elapsed:    0.048, Result code 0
> Run: Elapsed:    0.010, Result code 0
> 
> With -O1:
> Compile: Elapsed:   13.664, Result code 0
> Link: Elapsed:    0.042, Result code 0
> Run: Elapsed:    0.007, Result code 0
> 
> So on this particular test case, -O3 and -O2 take the same time, but -O1 is 8 times faster.
> Link and run times are roughly the same in all cases.
> 
> I made a modified program, this is Felix code but should be comprehensible.
> The Felix compiler generates C++. This particular test checks printing and
> comparison of tuples and arrays. What is relevant here is the "inline* keyword.
> This tells Felix to inline the procedure, so the result is a single big C++ function.
> 
> /////////////////
> var z = 1, (2.2, "xx");
> println$ str z;
> var z2 = 1, 2.2, "xx", 3, 4.3;
> println$ z2;
> 
> var z3 = 1,23.3, z, z2, 99, "Jah";
> println$ z3;
> a :=  1,2,3,4,5,6,7,8,9,10;
> 
> inline proc A () {
> println$ 1,2;
> println$ 1,2,3;
> println$ 1,2,3,4;
> println$ 1,2,3,4,5;
> println$ (a, a) :>> (int ^ 20);
> }
> A;
> 
> 
> inline proc B () {
> println$ (1,"x") == (1,"x");
> println$ (1,"x",4.2) == (1,"x",4.2);
> println$ (1,"x",4.2,"x") == (1,"x",4.2,"y");
> x1 := (1,"x",4.2,"x");
> x2 := (1,"x",4.2,"y");
> println$ x1 == x2;
> }
> B;
> 
> inline proc C() {
> println$ "-" * 20;
> y1 := (1,"x",4.2,"x",55,"hello",4.222);
> y2 := (1,"x",4.2,"y",55,"hello",4.222);
> println$ y1 == y2; // false
> println$ y1 != y2; // true
> println$ y1 < y2;  // true
> println$ y1 <= y2; // true
> println$ y1 > y2;  // false 
> println$ y1 >= y2; // false
> }
> C;
> 
> println$ a == a;
> 
> b:= 1,2;
> println$ b == b;
> 
> inline proc D() {
> println$ (1,2) == (1,2);
> println$ (1,2,3) == (1,2,3);
> println$ (1,2,3,4) == (1,2,3,4);
> println$ (1,2,3,4,5) == (1,2,3,4,5);
> println$ (1,2,3,4,5,6) == (1,2,3,4,5,6);
> 
> println$ ("1",2,3,4,5,6) == ("1",2,3,4,5,6);
> }
> D;
> /////////////////
> 
> The elapsed time for the compile on this variant with -O2 is 120 seconds.
> 
> Now, watch what happens if I change "inline" to "noinline" which makes
> each procedure A,B,C,D a separate C++ function:
> 
> Elapsed time: 47 seconds.
> 
> With -O1:
> 
> inline version: 14 seconds.
> noinline version:  11 seconds
> 
> So the O1 compilation takes a bit longer with a single big function.
> The O2 compilation is almost a decimal order of  magnitude slower
> with the inline version, but twice as fast when the function is split up
> into 4 pieces (plus some other stuff).
> 
> The actual C++ here:
> 
>    1729    5498   93095 /Users/johnskaller/.felix/cache/text/Users/johnskaller/felix/tup1.cpp
>     753    2246   17518 /Users/johnskaller/.felix/cache/text/Users/johnskaller/felix/tup1.hpp

How about providing a preprocessed version of those files as a test case? I don't think there are many people on this list that speak Felix fluently and/or are willing to install your compiler. You may have found a pathological case in the optimizer that can be fixed by putting in a clever heuristic, but we'll never know without a test case.

You can also pass -ftime-report to clang to get a breakdown on where the compile time goes.

- Ben
> 
> plus library files. Felix optimises the use of headers so only required library
> headers are actually used.
> 
> 100 seconds for 2000 lines is only 20 lines a second.
> At that speed a 2 million line program would take 100K seconds to compile
> (about one day).
> 
> Actually, Clang + LLVM took over 10 hours (using Clang as the compiler).
> 
> OK .. so I split it up again like this:
> 
> /////////////////////////////
> noinline proc A1 () {
> println$ 1,2;
> }
> noinline proc A2() {
> println$ 1,2,3;
> }
> 
> noinline proc A3() {
> println$ 1,2,3,4;
> }
> noinline proc A4() {
> println$ 1,2,3,4,5;
> }
> noinline proc A5() {
> println$ (a, a) :>> (int ^ 20);
> }
> A1; A2; A3; A4; A5;
> ......
> ////////////////////////
> 
> Now with -O1 the compilation time is the same: 11 seconds.
> But with -O2 we're down to 30 seconds ( -O3 is the same).
> 
> The thing to note here is that at a coarse grain, the program
> contains no branches. It's straight line code. Of course there
> may be local branches at a lower level.
> 
> Roughly: the optimiser isn't smart enough to recognise sequential
> computations. It should be able to partition the control flow graph
> of a function into a sequence of "super blocks" if you like, which
> can be optimised independently and the results merged.
> This would reduce the cost of a non-linear algorithm significantly.
> 
> I have no idea how to do this at the moment. However in the bug finding
> code I worked on, there was a natural way to "clump" the control flow graph.
> The graph was built directly on the C AST, so in general each assignment
> 
> 	x = expr
> 
> would be a clump. Inside the expr there is control flow, but it is usually
> entirely linear. So when doing path analysis, you could basically look
> at the C level control flow, and disregard the flow in the expression.
> Most of the time, data flow followed because sub-expressions only
> assigned to temporaries (so these variables are created and destroyed
> within the statement and can usually be ignored).
> 
> Obviously LLVM cannot do that (the higher level stuff has been compiled
> away) but it could use heuristics to search for  the patterns that would
> be left behind. I do that in my compiler with some success.
> 
> 
> --
> john skaller
> skaller at users.sourceforge.net
> http://felix-lang.org
> 
> 
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev