[LLVMbugs] [Bug 5501] New: Useless memory accesses removal from loops

Sun Nov 15 10:49:42 PST 2009

http://llvm.org/bugs/show_bug.cgi?id=5501

           Summary: Useless memory accesses removal from loops
           Product: new-bugs
           Version: 2.6
          Platform: PC
        OS/Version: Windows XP
            Status: NEW
          Keywords: code-quality
          Severity: normal
          Priority: P2
         Component: new bugs
        AssignedTo: unassignedbugs at nondot.org
        ReportedBy: bearophile at mailas.com
                CC: llvmbugs at cs.uiuc.edu

While testing the LDC compiler I have seen that the SciMark2 benchmark
(http://math.nist.gov/scimark2/ ) shows two performance problems compared to
Java (Java is about 20-30% faster), here I have reduced one of them to almost
minimal C code:

#include "stdlib.h"

// Reduced Scimark2 SOR benchmark
void test(int N, double omega, double** G) {
    double omega_over_four = omega * 0.25;
    double one_minus_omega = 1.0 - omega;

    int j, i = 1;
    double* Gi = G[i];
    double* Gim1 = G[i - 1];
    double* Gip1 = G[i + 1];
    for (j = 1; j < N - 1; j++)
        Gi[j] = omega_over_four * (Gim1[j] + Gip1[j] + Gi[j - 1] + Gi[j + 1]) +
                one_minus_omega * Gi[j];
}

int main() {
    int N = 100;
    double** mat = (double**)malloc(sizeof(double*) * N);

    int i;
    for (i = 0; i < N; i++)
        mat[i] = (double*)malloc(sizeof(double) * N);

    test(N, 1.25, mat);
    return 0;
}

Inner loop of test(), using llvm-gcc 2.6 (32 bit, on Windows):

llvm-gcc -Wall -O3 -fomit-frame-pointer -msse3 -march=native -ffast-math

LBB1_2:
    movapd  %xmm1, %xmm2
    mulsd   8(%ecx,%edi,8), %xmm2
    movsd   8(%esi,%edi,8), %xmm3
    addsd   8(%edx,%edi,8), %xmm3
    addsd   (%ecx,%edi,8), %xmm3
    addsd   16(%ecx,%edi,8), %xmm3
    mulsd   %xmm0, %xmm3
    addsd   %xmm2, %xmm3
    movsd   %xmm3, 8(%ecx,%edi,8)
    incl    %edi
    cmpl    %eax, %edi
    jne LBB1_2

I think the performance difference is caused by the 32bit Java server JIT that
reduces the number of memory accesses in the inner loop from 6 to 4, as in code
like (code not tested):

void test(int N, double omega, double** G) { // Scimark2 SOR reduced
    double omega_over_four = omega * 0.25;
    double one_minus_omega = 1.0 - omega;

    int j, i = 1;
    double* Gi = G[i];
    double* Gim1 = G[i - 1];
    double* Gip1 = G[i + 1];

    double pred = Gi[0];
    double curr = Gi[1];
    double succ = Gi[2];
    for (j = 1; j < N - 1; j++) {
        pred = omega_over_four * (Gim1[j] + Gip1[j] + pred + succ) +
               one_minus_omega * curr;
        Gi[i] = pred;
        curr = succ;
        succ = Gi[i + 1];
    }
}

On IRC <nicholas> has said:
hm! there's a number of loads that are trivially provably consequtive in
memory, but i don't think we have a pass that even tries to fold them

-- 
Configure bugmail: http://llvm.org/bugs/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.