[LLVMdev] Unrolling an arithmetic expression inside a loop

Mon Nov 22 23:21:47 PST 2010

Hello,

I've been redirected from cfe-dev, as code optimizations in clang are 
done in llvm layer.
I'm investigating how optimized code clang generates, and have come
across such an example:
I have two procedures:

void exec0(const int *X, const int *Y, int *res, const int N) {
      int t1[N],t2[N],t3[N],t4[N],t5[N],t6[N];
      for(int i = 0; i < N; i++) {
          t1[i] = X[i]+Y[i];        // X+Y
          t2[i] = t1[i]*Y[i];       // XY + Y2
          t3[i] = X[i]*Y[i];        // XY
          t4[i] = Y[i]*Y[i];        // Y2
          t5[i] = t3[i]+t4[i];      // XY + Y2
          t6[i] = t2[i]-t5[i];      // 0
          res[i] = t6[i] + t3[i];   // XY
      }
}
vs
void exec1(const int *X, const int *Y, int *res, const int N) {
      for(int i = 0; i < N; i++) {
          int t1,t2,t3,t4,t5,t6;
          t1 = X[i]+Y[i];        // X+Y
          t2 = t1*Y[i];          // XY + Y2
          t3 = X[i]*Y[i];        // XY
          t4 = Y[i]*Y[i];        // Y2
          t5 = t3+t4;            // XY + Y2
          t6 = t2-t5;            // 0
          res[i] = t6 + t3;      // XY
      }
}
where I perform some arithmetic operations on a vector. The expression
in written in an obfuscated way, but it actually boils down to just res
= X*Y. In the first case I create tables to store temporaries, in the
second case I have just temporary variables to store intermediate results.

I compile with
clang -S -emit-llvm -std=gnu99 -O3
clang version 2.9 (trunk 118238)
Target: x86_64-unknown-linux-gnu
Thread model: posix

What I get from clang is:
1) In exec0 it recognizes that it doesn't need the tables and doesn't
create them and also, it simplifies the expression, so that the
generated code inside the loop is just a multiplication:
    %scevgep = getelementptr i32* %X, i64 %indvar
    %scevgep2 = getelementptr i32* %Y, i64 %indvar
    %scevgep9 = getelementptr i32* %res, i64 %indvar
    %3 = load i32* %scevgep, align 4, !tbaa !0
    %4 = load i32* %scevgep2, align 4, !tbaa !0
    %5 = mul nsw i32 %4, %3
    store i32 %5, i32* %scevgep9, align 4, !tbaa !0
2) In exec1 however it fails to recognize that the temporary variables
are not reused anywhere and fails to simplify the arithmetic expression,
producing:
    %scevgep = getelementptr i32* %X, i64 %indvar
    %scevgep4 = getelementptr i32* %Y, i64 %indvar
    %scevgep5 = getelementptr i32* %res, i64 %indvar
    %3 = load i32* %scevgep, align 4, !tbaa !0
    %4 = load i32* %scevgep4, align 4, !tbaa !0
    %5 = add nsw i32 %4, %3
    %6 = mul nsw i32 %5, %4
    %7 = mul nsw i32 %4, %4
    %8 = sub i32 %6, %7
    store i32 %8, i32* %scevgep5, align 4, !tbaa !0

It is interesting, that I get better compilation output when inputting
clang with worse input source (creating redundant temporary tables on
stack).

Cheers,
Juliusz Sompolski