[LLVMdev] Testing LLVM on OS X

Tue May 4 21:30:02 PDT 2004

On Tue, 4 May 2004, Chris Lattner wrote:
> I suspect that a large reason that LLVM does worst than a native C
> compiler with the CBE+GCC is that LLVM generates very low-level C code,
> and I'm not convinced that GCC is doing a very good job (ie, without
> syntactic loops).

Yup, this is EXACTLY what is going on.

I took this very simple C function:

int Array[1000];
void test(int X) {
  int i;
  for (i = 0; i < 1000; ++i)
    Array[i] += X;
}

Compile with -O3 on OS/X gave me this:

_test:
        mflr r5
        bcl 20,31,"L00000000001$pb"
"L00000000001$pb":
        mflr r2
        mtlr r5
        addis r4,r2,ha16(L_Array$non_lazy_ptr-"L00000000001$pb")
        li r2,0
        lwz r9,lo16(L_Array$non_lazy_ptr-"L00000000001$pb")(r4)
        li r4,1000
        mtctr r4
L9:
        lwzx r7,r2,r9          ; load
        add r6,r7,r3           ; add
        stwx r6,r2,r9          ; store
        addi r2,r2,4           ; Increment pointer
        bdnz L9                ; Decrement count register, branch while not zero
        blr

This is nice code, good GCC.  :)

Okay, LLVM currently generates this code from the CBE:

void test(int l7_X) {
  unsigned l8_indvar;
  unsigned l8_indvar__PHI_TEMPORARY;
  int *l14_tmp_2E_5;
  int l7_tmp_2E_9;
  unsigned l8_indvar_2E_next;

  l8_indvar__PHI_TEMPORARY = 0u;   /* for PHI node */

l13_no_exit:
  l8_indvar = l8_indvar__PHI_TEMPORARY;
  l14_tmp_2E_5 = &Array[l8_indvar];
  l7_tmp_2E_9 = *l14_tmp_2E_5;
  *l14_tmp_2E_5 = (l7_tmp_2E_9 + l7_X);
  l8_indvar_2E_next = l8_indvar + 1u;
  if (!(l8_indvar_2E_next == 1000u)) {
    l8_indvar__PHI_TEMPORARY = l8_indvar_2E_next;   /* for PHI node */
    goto l13_no_exit;
  }
  return;
}

This has exactly the same operations in the loop, so GCC should produce
the same code, right?  Wrong:

_test:
        mflr r4
        bcl 20,31,"L00000000001$pb"
"L00000000001$pb":
        mflr r2
        mtlr r4
        li r11,0
        addis r10,r2,ha16(_Array-"L00000000001$pb")
L2:
        slwi r2,r11,2              ; Shift left "i" by 2
        la r5,lo16(_Array-"L00000000001$pb")(r10)
        cmpwi cr0,r11,999          ; compare i to the trip count
        lwzx r7,r2,r5              ; Load from array
        addi r11,r11,1             ; increment "i"
        add r6,r7,r3               ; Add value to array value
        stwx r6,r2,r5              ; store into array
        bne+ cr0,L2                ; Loop until done
        blr

Hrm, basically gcc is not doing ANY loop optimization (e.g.
strength reduction or "do-loop" optimization) what-so-ever.  I'm sure that
the X86 GCC is suffering from the same problems, it's just that X86
doesn't depend on strength reduction and do-loop optimization as much, so
it's not so pronounced.

Interestingly, if I tweak the .cbe code to be this:

  do {
  l8_indvar = l8_indvar__PHI_TEMPORARY;
  l14_tmp_2E_5 = &Array[l8_indvar];
  l7_tmp_2E_9 = *l14_tmp_2E_5;
  *l14_tmp_2E_5 = (l7_tmp_2E_9 + l7_X);
  l8_indvar_2E_next = l8_indvar + 1u;
  l8_indvar__PHI_TEMPORARY = l8_indvar_2E_next;   /* for PHI node */
  } while (!(l8_indvar_2E_next == 1000u));

GCC generates the nice code again, virtually identical to the code from
the original source.  AAAH!  :)

Maybe this is a good argument for making the CBE generate syntactic loops
in simple cases.  I may have some time to try implementing this on the
weekend.  That is, if no one beats me to it.  :)

-Chris

-- 
http://llvm.cs.uiuc.edu/
http://www.nondot.org/~sabre/Projects/