<div dir="ltr"><div><div><div>I was looking at the code generated from the following c code and noticed extra loads in the inner-loop of these nested for-loops:<br><br>#define DIM 8<br>#define UNROLL_DIM DIM<br>typedef double InArray[DIM][DIM];<br><br>void f1( InArray c, InArray a, InArray b ) {<br>#pragma clang loop unroll_count(UNROLL_DIM)<br>    for( int i=0;i<DIM;i++)<br>#pragma clang loop unroll_count(UNROLL_DIM)<br>        for( int j=0;j<DIM;j++)<br>#pragma clang loop  unroll_count(UNROLL_DIM)<br>            for( int k=0;k<DIM;k++) {<br>                c[i][k] = c[i][k] + a[i][j]*b[j][k];<br>            }<br>}<br><br></div>In the inner-most loop there, the generated code (and in the .ll as well) loads a[i][j] every time. In this case I've unrolled the loops, but it's the same situation if they're not unrolled.<br><br></div>Using -O3 to compile,  this is the .ll that results (just showing 2 iterations of the unrolled inner loop):<br><br>define void @f1([8 x double]* nocapture %c, [8 x double]* nocapture readonly %a, [8 x double]* nocapture readonly %b) #0 {<br>entry:<br>  %arrayidx8 = getelementptr inbounds [8 x double]* %c, i64 0, i64 0<br>  %arrayidx12 = getelementptr inbounds [8 x double]* %a, i64 0, i64 0<br>  %0 = load double* %arrayidx8, align 8, !tbaa !1<br>  %1 = load double* %arrayidx12, align 8, !tbaa !1<br>  %arrayidx16 = getelementptr inbounds [8 x double]* %b, i64 0, i64 0<br>  %2 = load double* %arrayidx16, align 8, !tbaa !1<br>  %mul = fmul double %1, %2<br>  %add = fadd double %0, %mul<br>  store double %add, double* %arrayidx8, align 8, !tbaa !1<br>  %arrayidx8.1 = getelementptr inbounds [8 x double]* %c, i64 0, i64 1<br>  %3 = load double* %arrayidx8.1, align 8, !tbaa !1<br>  %4 = load double* %arrayidx12, align 8, !tbaa !1        #EXTRA LOAD, could reuse %1!<br>  %arrayidx16.1 = getelementptr inbounds [8 x double]* %b, i64 0, i64 1<br>  %5 = load double* %arrayidx16.1, align 8, !tbaa !1<br>  %mul.1 = fmul double %4, %5<br>  %add.1 = fadd double %3, %mul.1<br>  store double %add.1, double* %arrayidx8.1, align 8, !tbaa !1<br>...               <br><br></div>Note the line with the comment at the end: #EXTRA LOAD, could reuse %1<br><div>This loading from a[i][j] happens again for each iteration and seems quite inefficient.<br><br></div><div>I changed the C code to explicitly do the load of a[i][j] outside of the innermost loop and that (as would be expected) eliminates the extra load:<br><br></div><div>void f1( InArray c, InArray a, InArray b ) {<br>int a_i_j;<br>#pragma clang loop unroll_count(UNROLL_DIM)<br>    for(int i=0;i<DIM;i++){<br>#pragma clang loop unroll_count(UNROLL_DIM)<br>        for(int j=0;j<DIM;j++) {<br>           a_i_j = a[i][j];<br>#pragma clang loop  unroll_count(UNROLL_DIM)<br>            for(int k=0;k<DIM;k++) {<br>                c[i][k] = c[i][k] + a_i_j*b[j][k];<br>            }<br>        }<br>    }<br>}<br><br><br></div><div>I guess I'm a bit surprised that -O3 wouldn't automatically do what I've done in the second version of the C code when generating code from the first version? Is there a specific optimization that can be called to do this?<br><br></div><div>(we're using LLVM 3.6 - maybe this is something that's done in later versions?)<br></div><div><br></div><div>Phil<br></div><div><div><div><div><br><br></div></div></div></div></div>