[LLVMdev] What would LLVM need to do this optimization?

Mon Nov 10 15:39:15 PST 2008

I have a simple test .ll file that does something like..

int i = 0;
if (argc < 1) i++; else i = foo(i);
if (argc < 2) i++; else i = foo(i);
if (argc < 3) i++; else i = foo(i);
if (argc < 4) i++; else i = foo(i);
return i;

It gets optimized to just..

return 4;

opt -atomic-region-clone -mem2reg -predsimplify -simplifycfg -instcombine

.. where -atomic-region-clone is a pass I've written (-mem2reg to
clean up the PHIs I reg2mem'd). My pass inserts a custom atomic_begin
instruction and a couple intrinsics to help start/rollback a region of
code.

The optimized .ll code is roughly..

entry:
  ; start an atomic region at %atomic, rollback to %original_code
  atomic_begin label %atomic, %original_code
atomic:
  %c = icmp slt i32 %argc, 1
  br i1 %c, label %done, label %abort
done:
  call void @llvm.atomic.end( )
  ret i32 4
abort:
  ; undo all work since atomic_begin and continue at %original_code
  call void @llvm.atomic.abort( )
  unreachable
original_code:
  ; the original compiled method

A few things happening with each optimization pass:
1) -atomic-region-clone duplicates the method and removes untaken
branches (the foo() cases)
2) -predsimplify propagates the "i must be less than 1" information
and gets rid of comparisons
3) -simplifycfg eliminates the newly created "br i1 true" branches
4) -instcombine sees a sequence of 0 + 1 + 1 + 1 + 1 and makes it 4

The optimizations are run on both original_code and the atomic code,
but there's still a huge mess in original_code. (I've attached the
original and optimized output for those curious.) Just for fun, I ran
things with -std-compile-opts and it wasn't much better.

But back to the original question.. what would LLVM need to do this
without the -atomic-region-clone pass that uses hardware support for
rollback. The resulting optimized code doesn't have very much in the
atomic region -- just a compare and branch. (Arguably the compare
could be shared across the atomic and original_code perhaps by PRE.)
So it's not like the hardware is being used much in this case to undo
much/any work at all.

Is there current work towards something that can do this kind of
optimization? Seems like something to do with the profiling interface
for feedback directed optimizations, but last I heard there hasn't
been much activity there.. ?

Ed