[cfe-dev] Unrolling an arithmetic expression inside a loop

Mon Nov 22 06:46:05 PST 2010

Hello,

I'm investigating how optimized code clang generates, and have come 
across such an example:
I have two procedures:

void exec0(const int *X, const int *Y, int *res, const int N) {
     int t1[N],t2[N],t3[N],t4[N],t5[N],t6[N];
     for(int i = 0; i < N; i++) {
         t1[i] = X[i]+Y[i];        // X+Y
         t2[i] = t1[i]*Y[i];       // XY + Y^2
         t3[i] = X[i]*Y[i];        // XY
         t4[i] = Y[i]*Y[i];        // Y^2
         t5[i] = t3[i]+t4[i];      // XY + Y^2
         t6[i] = t2[i]-t5[i];      // 0
         res[i] = t6[i] + t3[i];   // XY
     }
}
vs
void exec1(const int *X, const int *Y, int *res, const int N) {
     for(int i = 0; i < N; i++) {
         int t1,t2,t3,t4,t5,t6;
         t1 = X[i]+Y[i];        // X+Y
         t2 = t1*Y[i];          // XY + Y^2
         t3 = X[i]*Y[i];        // XY
         t4 = Y[i]*Y[i];        // Y^2
         t5 = t3+t4;            // XY + Y^2
         t6 = t2-t5;            // 0
         res[i] = t6 + t3;      // XY
     }
}
where I perform some arithmetic operations on a vector. The expression 
in written in an obfuscated way, but it actually boils down to just res 
= X*Y. In the first case I create tables to store temporaries, in the 
second case I have just temporary variables to store intermediate results.

I compile with
clang -S -emit-llvm -std=gnu99 -O3
clang version 2.9 (trunk 118238)
Target: x86_64-unknown-linux-gnu
Thread model: posix

What I get from clang is:
1) In exec0 it recognizes that it doesn't need the tables and doesn't 
create them and also, it simplifies the expression, so that the 
generated code inside the loop is just a multiplication:
   %scevgep = getelementptr i32* %X, i64 %indvar
   %scevgep2 = getelementptr i32* %Y, i64 %indvar
   %scevgep9 = getelementptr i32* %res, i64 %indvar
   %3 = load i32* %scevgep, align 4, !tbaa !0
   %4 = load i32* %scevgep2, align 4, !tbaa !0
   %5 = mul nsw i32 %4, %3
   store i32 %5, i32* %scevgep9, align 4, !tbaa !0
2) In exec1 however it fails to recognize that the temporary variables 
are not reused anywhere and fails to simplify the arithmetic expression, 
producing:
   %scevgep = getelementptr i32* %X, i64 %indvar
   %scevgep4 = getelementptr i32* %Y, i64 %indvar
   %scevgep5 = getelementptr i32* %res, i64 %indvar
   %3 = load i32* %scevgep, align 4, !tbaa !0
   %4 = load i32* %scevgep4, align 4, !tbaa !0
   %5 = add nsw i32 %4, %3
   %6 = mul nsw i32 %5, %4
   %7 = mul nsw i32 %4, %4
   %8 = sub i32 %6, %7
   store i32 %8, i32* %scevgep5, align 4, !tbaa !0

It is interesting, that I get better compilation output when inputting 
clang with worse input source (creating redundant temporary tables on 
stack).

Cheers,
Juliusz Sompolski