[LLVMdev] Area for improvement

Mon Feb 21 22:11:26 PST 2005

Sorry, I thought I was running selection dag isel but I screwed up when 
trying out the really big array.  You're right, it does clean it up 
except for the multiplication.

So LoopStrengthReduce is not ready for prime time and doesn't actually 
get used?

I might consider whipping it into shape.  Does it still have to handle 
getelementptr in its full generality?

Chris Lattner wrote:

> On Mon, 21 Feb 2005, Jeff Cohen wrote:
>
>> I noticed that fourinarow is one of the programs in which LLVM is 
>> much slower than GCC, so I decided to take a look and see why that is 
>> so.  The program has many loops that look like this:
>>
>>   #define ROWS 6
>>   #define COLS 7
>>
>>   void init_board(char b[COLS][ROWS+1])
>>   {
>>     int i,j;
>>
>>     for (i=0;i<COLS;i++)
>>           for (j=0;j<ROWS;j++)
>>             b[i][j]='.';
>>                         for (i=0;i<COLS;i++)
>>           b[i][ROWS]=0;
>>   }
>>
>> This generates the following X86 code:
>>       imull $7, %ecx, %edx
>>       movl %eax, %esi
>>       addl %edx, %esi
>>       movb $46, (%esi)
>>       imull $7, %ecx, %edx
>>       movl %eax, %esi
>>       addl %edx, %esi
>>       leal 1(%esi), %edx
>
> ... (many many copies of this, see the end of the email for full 
> output) ...
>
>> The code generated by GCC is much faster.  LLVM does unroll the loop, 
>> which GCC does not do, but the addressing code is a constant 
>> repetition of these three instructions:
>>
>>       imull $7, %ecx, %edx
>>       movl %eax, %esi
>>       addl %edx, %esi
>
>
> Yup, this is an issue of simple "peephole expansion" of the 
> getelementptr instruction.
>
>> All but the first occurrance do nothing but put the same address into 
>> %esi that was already there to being with, and does it with an 
>> expensive multiply at that.  GCC correctly creates a suitable 
>> induction variable and strength reduces the multiplication down to 
>> addition.
>
>
> This is called loop strength reduction.
>
>> It turns out that using the new selection dag code generator, the 
>> outer loop is unrolled also and nothing is left but a bunch of movb 
>> instructions.  Much, much faster.
>
>
> For the record, instead of the ugly code above, the new isel generates:
>
> init_board:
>         mov %EAX, DWORD PTR [%ESP + 4]
>         mov %ECX, 0
> .LBBinit_board_1:       # loopexit.1
>         imul %EDX, %ECX, 7
>         mov BYTE PTR [%EAX + %EDX], 46
>         mov BYTE PTR [%EAX + %EDX + 1], 46
>         mov BYTE PTR [%EAX + %EDX + 2], 46
>         mov BYTE PTR [%EAX + %EDX + 3], 46
>         mov BYTE PTR [%EAX + %EDX + 4], 46
>         mov BYTE PTR [%EAX + %EDX + 5], 46
>         inc %ECX
>         cmp %ECX, 7
>         jne .LBBinit_board_1    # loopexit.1
> .LBBinit_board_2:       # return
>         mov BYTE PTR [%EAX + 6], 0
>         mov BYTE PTR [%EAX + 13], 0
>         mov BYTE PTR [%EAX + 20], 0
>         mov BYTE PTR [%EAX + 27], 0
>         mov BYTE PTR [%EAX + 34], 0
>         mov BYTE PTR [%EAX + 41], 0
>         mov BYTE PTR [%EAX + 48], 0
>         ret
>
> This code is almost "perfect" with the exception of the multiply by 7 
> inside of the loop.  Loop strength reduction would transform this into 
> an add like GCC produces.
>
>
>> But still, the optimization passes shouldn't be leaving so much 
>> garbage around for instruction selection to clean up.  At first, I 
>> thought all that was needed was to another round of basic block 
>> optimizations after the loop was unrolled, until I saw the actual 
>> LLVM bytecodes:
>
>
> There are two issues here.  First, LLVM code is NOT designed for 
> trivial peephole code generation.  This is *possible*, but not going 
> to give you good quality code (see above ;-) ).  The new isel uses an 
> extremely light-weight DAG-based optimizer that both simplifies 
> instruction selectors and optimizes the code before selection.  LLVM 
> code is meant to be a host to mid-level and interprocedural 
> optimizations, and I think it does that job well.
>
>> Now the problem is obvious.  A two dimensional array access is being 
>> performed by a single instruction.  The arithmetic needed to address 
>> the element is implicit, and therefore inaccessible to 
>> optimizations.  The redundant calculations can not be eliminated, nor 
>> can strength reduction be performed.  getelementptr needs to be 
>> broken down into its constituent operations at some point.
>
>
> Actually strength reduction can be trivially done with getelementptr 
> instructions, that was part of their design.  Part of the problem is 
> that we have an induction variable canonicalization pass which UNDOES 
> strength reduction to make loop analysis simpler, but we don't have 
> strength reduction to redo it.   Thus if you compile a loop like this 
> (complex double chosen to show a size that X86 can scale "for free"):
>
> #include <complex.h>
> void test(unsigned N, complex double *P) {
>   for (; N != 0; --N)
>     *P++ = 0;
> }
>
> We compile it to this LLVM code:
>
> void %test(uint %N, "complex long double"* %P) {
> entry:
>         %tmp.15 = seteq uint %N, 0              ; <bool> [#uses=1]
>         br bool %tmp.15, label %return, label %no_exit
>
> no_exit:                ; preds = %no_exit, %entry
>         %indvar = phi uint [ %indvar.next, %no_exit ], [ 0, %entry 
> ]            ; <uint> [#uses=3]
>         %tmp.4 = getelementptr "complex long double"* %P, uint 
> %indvar, uint 0          ; <double*> [#uses=1]
>         store double 0.000000e+00, double* %tmp.4
>         %tmp.5 = getelementptr "complex long double"* %P, uint 
> %indvar, uint 1          ; <double*> [#uses=1]
>         store double 0.000000e+00, double* %tmp.5
>         %indvar.next = add uint %indvar, 1              ; <uint> 
> [#uses=2]
>         %exitcond = seteq uint %indvar.next, %N         ; <bool> 
> [#uses=1]
>         br bool %exitcond, label %return, label %no_exit
>
> return:         ; preds = %no_exit, %entry
>         ret void
> }
>
> Note the accesses to P are using P[indvar].{real|imag}, not 
> incrementing P.  Compile this to X86, and you get this loop (using the 
> simple instruction selector):
>
> .LBBtest_1:     # no_exit
>         mov %ESI, %EDX
>         shl %ESI, 4                       --> scale by sizeof(complex 
> double)
>         mov %EDI, %ECX
>         add %EDI, %ESI
>         mov DWORD PTR [%EDI], 0
>         mov DWORD PTR [%EDI + 4], 0
>         mov %ESI, %EDX
>         shl %ESI, 4                       --> scale by sizeof(complex 
> double)
>         mov %EDI, %ECX
>         add %EDI, %ESI
>         lea %ESI, DWORD PTR [%EDI + 8]
>         mov DWORD PTR [%ESI], 0
>         mov DWORD PTR [%ESI + 4], 0
>         inc %EDX
>         cmp %EDX, %EAX
>         jne .LBBtest_1  # no_exit
>
>
> Now, if you disable the -indvars pass, you'll get this loop instead 
> (note the pointer increment of P is retained):
>
> void %test(uint %N, "complex long double"* %P) {
> entry:
>         %tmp.15 = seteq uint %N, 0              ; <bool> [#uses=1]
>         br bool %tmp.15, label %return, label %no_exit
> no_exit:                ; preds = %no_exit, %entry
>         %P_addr.0.0 = phi "complex long double"* [ %inc, %no_exit ], [ 
> %P, %entry ]             ; <"complex long double"*> [#uses=3]
>         %N_addr.0.0 = phi uint [ %dec, %no_exit ], [ %N, %entry 
> ]               ; <uint> [#uses=1]
>         %inc = getelementptr "complex long double"* %P_addr.0.0, int 
> 1          ; <"complex long double"*> [#uses=1]
>         %tmp.4 = getelementptr "complex long double"* %P_addr.0.0, int 
> 0, uint 0                ; <double*> [#uses=1]
>         store double 0.000000e+00, double* %tmp.4
>         %tmp.5 = getelementptr "complex long double"* %P_addr.0.0, int 
> 0, uint 1                ; <double*> [#uses=1]
>         store double 0.000000e+00, double* %tmp.5
>         %dec = add uint %N_addr.0.0, 4294967295         ; <uint> 
> [#uses=2]
>         %tmp.1 = seteq uint %dec, 0             ; <bool> [#uses=1]
>         br bool %tmp.1, label %return, label %no_exit
> return:         ; preds = %no_exit, %entry
>         ret void
> }
>
> which codegens to this loop (simple isel again):
>
> .LBBtest_1:     # no_exit
>         lea %EDX, DWORD PTR [%ECX + 16]
>         mov DWORD PTR [%ECX], 0
>         mov DWORD PTR [%ECX + 4], 0
>         mov DWORD PTR [%ECX + 8], 0
>         mov DWORD PTR [%ECX + 12], 0
>         dec %EAX
>         test %EAX, %EAX
>         mov %ECX, %EDX
>         jne .LBBtest_1  # no_exit
>
> I think you'll agree that this loop is much nicer, though there is 
> some wierd register shuffling going on between ECX and EDX.  This is 
> the code we would generate if we did loop strength reduction on the 
> LLVM level. Note that on X86, this isn't a HUGE problem in all cases: 
> it can scale by 2,4 and 8 for "free".  The problem is much worse on 
> RISC targets.
>
>> Or am I missing something?  Is this breaking down and reapplying 
>> optimizations part of the selection dag process?  I am somewhat 
>> surprised that using it affected what loops got unrolled.
>
>
> I'm not sure what you mean.  The LLVM code input to the selection dag 
> isel is exactly the same as that input to the simple isel, and it does 
> no loop unrolling.  The full code produced by the simple isel (for 
> Jeff's example) is attached below for reference.  Contrast this to the 
> pattern isel output at the top.
>
> init_board:
>         sub %ESP, 4
>         mov DWORD PTR [%ESP], %ESI
>         mov %EAX, DWORD PTR [%ESP + 8]
>         mov %ECX, 0
> .LBBinit_board_1:       # loopexit.1
>         imul %EDX, %ECX, 7
>         mov %ESI, %EAX
>         add %ESI, %EDX
>         mov BYTE PTR [%ESI], 46
>         imul %EDX, %ECX, 7
>         mov %ESI, %EAX
>         add %ESI, %EDX
>         lea %EDX, DWORD PTR [%ESI + 1]
>         mov BYTE PTR [%EDX], 46
>         imul %EDX, %ECX, 7
>         mov %ESI, %EAX
>         add %ESI, %EDX
>         lea %EDX, DWORD PTR [%ESI + 2]
>         mov BYTE PTR [%EDX], 46
>         imul %EDX, %ECX, 7
>         mov %ESI, %EAX
>         add %ESI, %EDX
>         lea %EDX, DWORD PTR [%ESI + 3]
>         mov BYTE PTR [%EDX], 46
>         imul %EDX, %ECX, 7
>         mov %ESI, %EAX
>         add %ESI, %EDX
>         lea %EDX, DWORD PTR [%ESI + 4]
>         mov BYTE PTR [%EDX], 46
>         imul %EDX, %ECX, 7
>         mov %ESI, %EAX
>         add %ESI, %EDX
>         lea %EDX, DWORD PTR [%ESI + 5]
>         mov BYTE PTR [%EDX], 46
>         inc %ECX
>         cmp %ECX, 7
>         jne .LBBinit_board_1    # loopexit.1
> .LBBinit_board_2:       # return
>         mov BYTE PTR [%EAX + 6], 0
>         mov BYTE PTR [%EAX + 13], 0
>         mov BYTE PTR [%EAX + 20], 0
>         mov BYTE PTR [%EAX + 27], 0
>         mov BYTE PTR [%EAX + 34], 0
>         mov BYTE PTR [%EAX + 41], 0
>         mov BYTE PTR [%EAX + 48], 0
>         mov %ESI, DWORD PTR [%ESP]
>         add %ESP, 4
>     ret
>
> -Chris
>