[LLVMdev] Area for improvement

Mon Feb 21 22:20:10 PST 2005

On Mon, 21 Feb 2005, Jeff Cohen wrote:

> Sorry, I thought I was running selection dag isel but I screwed up when 
> trying out the really big array.  You're right, it does clean it up except 
> for the multiplication.
>
> So LoopStrengthReduce is not ready for prime time and doesn't actually get 
> used?

I don't know what the status of it is.  You could try it out, and see what 
it does. :)  I would suggest adding it to gccas immediately after the SCCP 
pass.

> I might consider whipping it into shape.  Does it still have to handle 
> getelementptr in its full generality?

What do you mean?

-Chris

> Chris Lattner wrote:
>
>> On Mon, 21 Feb 2005, Jeff Cohen wrote:
>> 
>>> I noticed that fourinarow is one of the programs in which LLVM is much 
>>> slower than GCC, so I decided to take a look and see why that is so.  The 
>>> program has many loops that look like this:
>>> 
>>>   #define ROWS 6
>>>   #define COLS 7
>>> 
>>>   void init_board(char b[COLS][ROWS+1])
>>>   {
>>>     int i,j;
>>> 
>>>     for (i=0;i<COLS;i++)
>>>           for (j=0;j<ROWS;j++)
>>>             b[i][j]='.';
>>>                         for (i=0;i<COLS;i++)
>>>           b[i][ROWS]=0;
>>>   }
>>> 
>>> This generates the following X86 code:
>>>       imull $7, %ecx, %edx
>>>       movl %eax, %esi
>>>       addl %edx, %esi
>>>       movb $46, (%esi)
>>>       imull $7, %ecx, %edx
>>>       movl %eax, %esi
>>>       addl %edx, %esi
>>>       leal 1(%esi), %edx
>> 
>> ... (many many copies of this, see the end of the email for full output) 
>> ...
>> 
>>> The code generated by GCC is much faster.  LLVM does unroll the loop, 
>>> which GCC does not do, but the addressing code is a constant repetition of 
>>> these three instructions:
>>> 
>>>       imull $7, %ecx, %edx
>>>       movl %eax, %esi
>>>       addl %edx, %esi
>> 
>> 
>> Yup, this is an issue of simple "peephole expansion" of the getelementptr 
>> instruction.
>> 
>>> All but the first occurrance do nothing but put the same address into %esi 
>>> that was already there to being with, and does it with an expensive 
>>> multiply at that.  GCC correctly creates a suitable induction variable and 
>>> strength reduces the multiplication down to addition.
>> 
>> 
>> This is called loop strength reduction.
>> 
>>> It turns out that using the new selection dag code generator, the outer 
>>> loop is unrolled also and nothing is left but a bunch of movb 
>>> instructions.  Much, much faster.
>> 
>> 
>> For the record, instead of the ugly code above, the new isel generates:
>> 
>> init_board:
>>         mov %EAX, DWORD PTR [%ESP + 4]
>>         mov %ECX, 0
>> .LBBinit_board_1:       # loopexit.1
>>         imul %EDX, %ECX, 7
>>         mov BYTE PTR [%EAX + %EDX], 46
>>         mov BYTE PTR [%EAX + %EDX + 1], 46
>>         mov BYTE PTR [%EAX + %EDX + 2], 46
>>         mov BYTE PTR [%EAX + %EDX + 3], 46
>>         mov BYTE PTR [%EAX + %EDX + 4], 46
>>         mov BYTE PTR [%EAX + %EDX + 5], 46
>>         inc %ECX
>>         cmp %ECX, 7
>>         jne .LBBinit_board_1    # loopexit.1
>> .LBBinit_board_2:       # return
>>         mov BYTE PTR [%EAX + 6], 0
>>         mov BYTE PTR [%EAX + 13], 0
>>         mov BYTE PTR [%EAX + 20], 0
>>         mov BYTE PTR [%EAX + 27], 0
>>         mov BYTE PTR [%EAX + 34], 0
>>         mov BYTE PTR [%EAX + 41], 0
>>         mov BYTE PTR [%EAX + 48], 0
>>         ret
>> 
>> This code is almost "perfect" with the exception of the multiply by 7 
>> inside of the loop.  Loop strength reduction would transform this into an 
>> add like GCC produces.
>> 
>> 
>>> But still, the optimization passes shouldn't be leaving so much garbage 
>>> around for instruction selection to clean up.  At first, I thought all 
>>> that was needed was to another round of basic block optimizations after 
>>> the loop was unrolled, until I saw the actual LLVM bytecodes:
>> 
>> 
>> There are two issues here.  First, LLVM code is NOT designed for trivial 
>> peephole code generation.  This is *possible*, but not going to give you 
>> good quality code (see above ;-) ).  The new isel uses an extremely 
>> light-weight DAG-based optimizer that both simplifies instruction selectors 
>> and optimizes the code before selection.  LLVM code is meant to be a host 
>> to mid-level and interprocedural optimizations, and I think it does that 
>> job well.
>> 
>>> Now the problem is obvious.  A two dimensional array access is being 
>>> performed by a single instruction.  The arithmetic needed to address the 
>>> element is implicit, and therefore inaccessible to optimizations.  The 
>>> redundant calculations can not be eliminated, nor can strength reduction 
>>> be performed.  getelementptr needs to be broken down into its constituent 
>>> operations at some point.
>> 
>> 
>> Actually strength reduction can be trivially done with getelementptr 
>> instructions, that was part of their design.  Part of the problem is that 
>> we have an induction variable canonicalization pass which UNDOES strength 
>> reduction to make loop analysis simpler, but we don't have strength 
>> reduction to redo it.   Thus if you compile a loop like this (complex 
>> double chosen to show a size that X86 can scale "for free"):
>> 
>> #include <complex.h>
>> void test(unsigned N, complex double *P) {
>>   for (; N != 0; --N)
>>     *P++ = 0;
>> }
>> 
>> We compile it to this LLVM code:
>> 
>> void %test(uint %N, "complex long double"* %P) {
>> entry:
>>         %tmp.15 = seteq uint %N, 0              ; <bool> [#uses=1]
>>         br bool %tmp.15, label %return, label %no_exit
>> 
>> no_exit:                ; preds = %no_exit, %entry
>>         %indvar = phi uint [ %indvar.next, %no_exit ], [ 0, %entry ] 
>> ; <uint> [#uses=3]
>>         %tmp.4 = getelementptr "complex long double"* %P, uint %indvar, 
>> uint 0          ; <double*> [#uses=1]
>>         store double 0.000000e+00, double* %tmp.4
>>         %tmp.5 = getelementptr "complex long double"* %P, uint %indvar, 
>> uint 1          ; <double*> [#uses=1]
>>         store double 0.000000e+00, double* %tmp.5
>>         %indvar.next = add uint %indvar, 1              ; <uint> [#uses=2]
>>         %exitcond = seteq uint %indvar.next, %N         ; <bool> [#uses=1]
>>         br bool %exitcond, label %return, label %no_exit
>> 
>> return:         ; preds = %no_exit, %entry
>>         ret void
>> }
>> 
>> Note the accesses to P are using P[indvar].{real|imag}, not incrementing P. 
>> Compile this to X86, and you get this loop (using the simple instruction 
>> selector):
>> 
>> .LBBtest_1:     # no_exit
>>         mov %ESI, %EDX
>>         shl %ESI, 4                       --> scale by sizeof(complex 
>> double)
>>         mov %EDI, %ECX
>>         add %EDI, %ESI
>>         mov DWORD PTR [%EDI], 0
>>         mov DWORD PTR [%EDI + 4], 0
>>         mov %ESI, %EDX
>>         shl %ESI, 4                       --> scale by sizeof(complex 
>> double)
>>         mov %EDI, %ECX
>>         add %EDI, %ESI
>>         lea %ESI, DWORD PTR [%EDI + 8]
>>         mov DWORD PTR [%ESI], 0
>>         mov DWORD PTR [%ESI + 4], 0
>>         inc %EDX
>>         cmp %EDX, %EAX
>>         jne .LBBtest_1  # no_exit
>> 
>> 
>> Now, if you disable the -indvars pass, you'll get this loop instead (note 
>> the pointer increment of P is retained):
>> 
>> void %test(uint %N, "complex long double"* %P) {
>> entry:
>>         %tmp.15 = seteq uint %N, 0              ; <bool> [#uses=1]
>>         br bool %tmp.15, label %return, label %no_exit
>> no_exit:                ; preds = %no_exit, %entry
>>         %P_addr.0.0 = phi "complex long double"* [ %inc, %no_exit ], [ %P, 
>> %entry ]             ; <"complex long double"*> [#uses=3]
>>         %N_addr.0.0 = phi uint [ %dec, %no_exit ], [ %N, %entry ] 
>> ; <uint> [#uses=1]
>>         %inc = getelementptr "complex long double"* %P_addr.0.0, int 1 
>> ; <"complex long double"*> [#uses=1]
>>         %tmp.4 = getelementptr "complex long double"* %P_addr.0.0, int 0, 
>> uint 0                ; <double*> [#uses=1]
>>         store double 0.000000e+00, double* %tmp.4
>>         %tmp.5 = getelementptr "complex long double"* %P_addr.0.0, int 0, 
>> uint 1                ; <double*> [#uses=1]
>>         store double 0.000000e+00, double* %tmp.5
>>         %dec = add uint %N_addr.0.0, 4294967295         ; <uint> [#uses=2]
>>         %tmp.1 = seteq uint %dec, 0             ; <bool> [#uses=1]
>>         br bool %tmp.1, label %return, label %no_exit
>> return:         ; preds = %no_exit, %entry
>>         ret void
>> }
>> 
>> which codegens to this loop (simple isel again):
>> 
>> .LBBtest_1:     # no_exit
>>         lea %EDX, DWORD PTR [%ECX + 16]
>>         mov DWORD PTR [%ECX], 0
>>         mov DWORD PTR [%ECX + 4], 0
>>         mov DWORD PTR [%ECX + 8], 0
>>         mov DWORD PTR [%ECX + 12], 0
>>         dec %EAX
>>         test %EAX, %EAX
>>         mov %ECX, %EDX
>>         jne .LBBtest_1  # no_exit
>> 
>> I think you'll agree that this loop is much nicer, though there is some 
>> wierd register shuffling going on between ECX and EDX.  This is the code we 
>> would generate if we did loop strength reduction on the LLVM level. Note 
>> that on X86, this isn't a HUGE problem in all cases: it can scale by 2,4 
>> and 8 for "free".  The problem is much worse on RISC targets.
>> 
>>> Or am I missing something?  Is this breaking down and reapplying 
>>> optimizations part of the selection dag process?  I am somewhat surprised 
>>> that using it affected what loops got unrolled.
>> 
>> 
>> I'm not sure what you mean.  The LLVM code input to the selection dag isel 
>> is exactly the same as that input to the simple isel, and it does no loop 
>> unrolling.  The full code produced by the simple isel (for Jeff's example) 
>> is attached below for reference.  Contrast this to the pattern isel output 
>> at the top.
>> 
>> init_board:
>>         sub %ESP, 4
>>         mov DWORD PTR [%ESP], %ESI
>>         mov %EAX, DWORD PTR [%ESP + 8]
>>         mov %ECX, 0
>> .LBBinit_board_1:       # loopexit.1
>>         imul %EDX, %ECX, 7
>>         mov %ESI, %EAX
>>         add %ESI, %EDX
>>         mov BYTE PTR [%ESI], 46
>>         imul %EDX, %ECX, 7
>>         mov %ESI, %EAX
>>         add %ESI, %EDX
>>         lea %EDX, DWORD PTR [%ESI + 1]
>>         mov BYTE PTR [%EDX], 46
>>         imul %EDX, %ECX, 7
>>         mov %ESI, %EAX
>>         add %ESI, %EDX
>>         lea %EDX, DWORD PTR [%ESI + 2]
>>         mov BYTE PTR [%EDX], 46
>>         imul %EDX, %ECX, 7
>>         mov %ESI, %EAX
>>         add %ESI, %EDX
>>         lea %EDX, DWORD PTR [%ESI + 3]
>>         mov BYTE PTR [%EDX], 46
>>         imul %EDX, %ECX, 7
>>         mov %ESI, %EAX
>>         add %ESI, %EDX
>>         lea %EDX, DWORD PTR [%ESI + 4]
>>         mov BYTE PTR [%EDX], 46
>>         imul %EDX, %ECX, 7
>>         mov %ESI, %EAX
>>         add %ESI, %EDX
>>         lea %EDX, DWORD PTR [%ESI + 5]
>>         mov BYTE PTR [%EDX], 46
>>         inc %ECX
>>         cmp %ECX, 7
>>         jne .LBBinit_board_1    # loopexit.1
>> .LBBinit_board_2:       # return
>>         mov BYTE PTR [%EAX + 6], 0
>>         mov BYTE PTR [%EAX + 13], 0
>>         mov BYTE PTR [%EAX + 20], 0
>>         mov BYTE PTR [%EAX + 27], 0
>>         mov BYTE PTR [%EAX + 34], 0
>>         mov BYTE PTR [%EAX + 41], 0
>>         mov BYTE PTR [%EAX + 48], 0
>>         mov %ESI, DWORD PTR [%ESP]
>>         add %ESP, 4
>>     ret
>> 
>> -Chris
>> 
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://mail.cs.uiuc.edu/mailman/listinfo/llvmdev
>

-Chris

-- 
http://nondot.org/sabre/
http://llvm.cs.uiuc.edu/