[LLVMdev] Hoisting elements of array argument into registers

Sun Nov 7 05:52:19 PST 2010

David Peixotto <dmp <at> rice.edu> writes:
> I am seeing the wf loop get optimized just fine with llvm 2.8 (and almost
as good with head).

I rechecked this and am I actually seeing the same results as you. I think I
must have made a stupid mistake in my tests before - sorry for the noise.

However, I found that we have a phase ordering problem which is preventing us
getting as much optimisation from LLVM as we might expect. For example, I get
this code for g(int):

"""
define i32 @g(i32 %a) nounwind readnone ssp {
entry:
  %sp = alloca [4 x i32], align 4
  %0 = getelementptr inbounds [4 x i32]* %sp, i64 0, i64 0
  store i32 %a, i32* %0, align 4
  %1 = getelementptr inbounds [4 x i32]* %sp, i64 0, i64 1
  store i32 %a, i32* %1, align 4
  %2 = getelementptr inbounds [4 x i32]* %sp, i64 0, i64 2
  store i32 0, i32* %2, align 4
  %3 = getelementptr inbounds [4 x i32]* %sp, i64 0, i64 3
  store i32 0, i32* %3, align 4
  %4 = icmp eq i32 %a, 0
  br i1 %4, label %wf.exit, label %bb.nph.i

bb.nph.i:                                         ; preds = %entry
  %.promoted1.i = load i32* %1, align 4
  %tmp12.i = add i32 %a, -1
  %tmp13.i = zext i32 %tmp12.i to i33
  %tmp14.i = add i32 %a, -2
  %tmp15.i = zext i32 %tmp14.i to i33
  %tmp16.i = mul i33 %tmp13.i, %tmp15.i
  %tmp17.i = lshr i33 %tmp16.i, 1
  %tmp18.i = trunc i33 %tmp17.i to i32
  %tmp20.i = mul i32 %.promoted1.i, 5
  %tmp21.i = add i32 %tmp20.i, -5
  %tmp22.i = mul i32 %tmp21.i, %tmp12.i
  %tmp9.i = mul i32 %a, %a
  %.promoted2.i = load i32* %2, align 4
  %tmp25.i = mul i32 %tmp18.i, 5
  %tmp.i = sub i32 %.promoted1.i, %a
  %tmp10.i = add i32 %tmp9.i, 1
  %tmp11.i = sub i32 %tmp10.i, %tmp18.i
  %tmp19.i = add i32 %tmp11.i, %.promoted2.i
  %tmp23.i = sub i32 %tmp20.i, %tmp25.i
  %tmp26.i = add i32 %tmp23.i, %tmp22.i
  store i32 0, i32* %0, align 4
  store i32 %tmp19.i, i32* %2, align 4
  store i32 %tmp.i, i32* %1, align 4
  store i32 %tmp26.i, i32* %3, align 4
  br label %wf.exit

wf.exit:                                          ; preds = %entry, %bb.nph.i
  %5 = phi i32 [ %tmp26.i, %bb.nph.i ], [ 0, %entry ]
  %6 = load i32* %2, align 4
  %7 = add nsw i32 %6, %5
  ret i32 %7
}
"""

As you can see, we have "load i32* %1, align 4" which is dominated by
"store i32 %a, i32* %1, align 4",  and similarly for the load/store to %2.
What is more, it is clear than no intervening store aliases with that 
store because all the other stores are to different pointers - and indeed
LLVM can even see that the memory to which they store is allocaed.

In fact there are at least 3 missed opportunities:
* No forwarding of the stores to loads: if we did this, we could spot that
%tmp.i is just 0, and %tmp19.iis just the same as %tmp11.i
* The stores to %2 could be forwarded to the final "load i32* %2, align 4"
by introducing a new phi node for the wf.exit block
* All of the stores to %sp+N can then be eliminated since the memory is
allocaed, and there will be no loads from any of that memory before the
function returns

However, it turns out that if you just run "opt -O3" twice then all of this
stuff gets eliminated and you get truly optimal code - pretty awesome!

Cheers,
Max