[LLVMdev] Speculative phi elimination at the top of a loop?

Fri Jun 4 05:18:02 PDT 2010

I am working on heavily optimising unusually static C++ code, and have encountered a situation where I basically want an optimiser that would speculatively unroll a loop to see if the first round of the loop could be optimised further.  (I happen to know that it is possible.)  The previous optimisations that produce the loop in the first place already do a magical job (relying heavily on constant propagation), transforming cross-class tail recursion into the loop that I am now addressing. Hence, there is probably little than can be done in the previous optimisations.

So, has anyone worked on an optimisation where the optimiser unrolls a loop so a way that allows it to speculatively try if the first round can be further simplified?  

In my case, the loop actually calls a C++ virtual method, but since the instance object is a *constant* on the first round of the loop, it is possible to resolve the method call into a static function, and then inline the static function, which in this case essentially eliminates the first round of the loop.

Here is an example.  Let me have a simple loop bb.i, calling a C++ inner class virtual method in SSA:

---------
define void @test() {
entry:
  %0 = alloca %struct.Foo, align 8
  br label %bb.i

bb.i:
  %1 = phi %struct.Foo* [ %11, %bb.i ], [ %0, %entry ]
  %t = phi %struct.Baseclass* [ %5, %bb.i ], [ getelementptr inbounds (%struct.Bar* @_Const, i64 0, i32 0), %entry ]
  %2 = getelementptr inbounds %struct.Baseclass* %t, i64 0, i32 1
  %3 = load %struct.Innerclass** %2, align 8
  %4 = getelementptr inbounds %struct.Innerclass* %3, i64 0, i32 1
  %5 = load %struct.Baseclass** %4, align 8
  %6 = getelementptr inbounds %struct.Baseclass* %5, i64 0, i32 0 
  %7 = load i32 (...)*** %6, align 8
  %8 = getelementptr inbounds i32 (...)** %7, i64 2
  %9 = load i32 (...)** %8, align 8
  %10 = bitcast i32 (...)* %9 to %struct.Foo* (%struct.Baseclass*, %struct.Foo*)*
  %11 = call %struct.Foo *%10(%struct.Baseclass *%5, %struct.Foo* %1)
  %12 = icmp eq %struct.Foo* %11, null
  br i1 %12, label %exit, label %bb.i

exit:
  ret void
}
--------

What I want essentially to do is to unroll this in a way that allows speculative optimisation of the first round:

--------
define void @test() {
entry:
  %0 = alloca %struct.Foo, align 8
  br label %unrolled

unrolled:
  %ut = %struct.Baseclass* getelementptr inbounds %struct.Bar* @_Const, i64 0, i32 0
  %u2 = getelementptr inbounds %struct.Baseclass* %ut, i64 0, i32 1
  %u3 = load %struct.Innerclass** %u2, align 8
  %u4 = getelementptr inbounds %struct.Innerclass* %u3, i64 0, i32 1
  %u5 = load %struct.Baseclass** %u4, align 8
  %u6 = getelementptr inbounds %struct.Baseclass* %u5, i64 0, i32 0 
  %u7 = load i32 (...)*** %u6, align 8
  %u8 = getelementptr inbounds i32 (...)** %u7, i64 2
  %u9 = load i32 (...)** %u8, align 8
  %u10 = bitcast i32 (...)* %u9 to %struct.Foo* (%struct.Baseclass*, %struct.Foo*)*
  %u11 = call %struct.Foo *%u10(%struct.Baseclass *%u5, %struct.Foo* %0)
  %u12 = icmp eq %struct.Foo* %u11, null
  br i1 %u12, label %exit, label %bb.i

bb.i:
  %1 = phi %struct.Foo* [ %11, %bb.i ], [ %u11, %unrolled ]
  %t = phi %struct.Baseclass* [ %5, %bb.i ], [ %u5, %unrolled ]
  %2 = getelementptr inbounds %struct.Baseclass* %t, i64 0, i32 1
  %3 = load %struct.Innerclass** %2, align 8
  %4 = getelementptr inbounds %struct.Innerclass* %3, i64 0, i32 1
  %5 = load %struct.Baseclass** %4, align 8
  %6 = getelementptr inbounds %struct.Baseclass* %5, i64 0, i32 0 
  %7 = load i32 (...)*** %6, align 8
  %8 = getelementptr inbounds i32 (...)** %7, i64 2
  %9 = load i32 (...)** %8, align 8
  %10 = bitcast i32 (...)* %9 to %struct.Foo* (%struct.Baseclass*, %struct.Foo*)*
  %11 = call %struct.Foo *%10(%struct.Baseclass *%5, %struct.Foo* %1)
  %12 = icmp eq %struct.Foo* %11, null
  br i1 %12, label %exit, label %bb.i

exit:
  ret void
}
--------

Now, since %ut is above a constant, it allows constant propagation to resolve %u2, %u3, %u4 etc up to %u10, which also turns out to be a constant.  As a result, the dynamic call to %u10 can be resolved to a static function call, which then can be inlined.

Being a newbie with LLVM (but with quite some past experience with GCC from 1999-2000), I'd be interested in any ideas of how to approach this.  Would the best way be to add an option to -loop-unroll, and hack away at lib/Transforms/Utils/LoopUnroll.cpp?

--Pekka Nikander