[LLVMdev] Area for improvement

Mon Feb 21 19:51:27 PST 2005

Jeff,

This is an important find. I started looking at fourinarow and other
programs that don't do well compared to GCC. I think there are probably
only a couple things remaining for LLVM to smoke GCC on all tests
(floating point is another one). I know that Chris has been looking for
good analyses like the one you've provided.

Could you put this into a bugzilla for us? That way it definitely gets
on the "to do" list.

Thanks,

Reid.

On Mon, 2005-02-21 at 19:17, Jeff Cohen wrote:
> When I increased COLS to the point where the loop could no longer be
> unrolled, the selection dag code generator generated effectively the
> same code as the default X86 code generator.  Lots of redundant
> imul/movl/addl sequences.  It can't clean it up either.  Only
> unrolling all nested loops permits it to be optimized away, regardless
> of code generator.
> 
> Jeff Cohen wrote: 
> > I noticed that fourinarow is one of the programs in which LLVM is
> > much slower than GCC, so I decided to take a look and see why that
> > is so.  The program has many loops that look like this:
> >         #define ROWS 6
> >         #define COLS 7
> >         
> >         void init_board(char b[COLS][ROWS+1])
> >         {
> >           int i,j;
> >         
> >           for (i=0;i<COLS;i++)
> >                 for (j=0;j<ROWS;j++)
> >                   b[i][j]='.';
> >                           
> >           for (i=0;i<COLS;i++)
> >                 b[i][ROWS]=0;
> >         }
> > This generates the following X86 code:
> >             .text
> >             .align    16
> >             .globl    init_board
> >             .type    init_board, @function
> >         init_board:
> >             subl $4, %esp
> >             movl %esi, (%esp)
> >             movl 8(%esp), %eax
> >             movl $0, %ecx
> >         .LBBinit_board_1:    # loopexit.1
> >             imull $7, %ecx, %edx
> >             movl %eax, %esi
> >             addl %edx, %esi
> >             movb $46, (%esi)
> >             imull $7, %ecx, %edx
> >             movl %eax, %esi
> >             addl %edx, %esi
> >             leal 1(%esi), %edx
> >             movb $46, (%edx)
> >             imull $7, %ecx, %edx
> >             movl %eax, %esi
> >             addl %edx, %esi
> >             leal 2(%esi), %edx
> >             movb $46, (%edx)
> >             imull $7, %ecx, %edx
> >             movl %eax, %esi
> >             addl %edx, %esi
> >             leal 3(%esi), %edx
> >             movb $46, (%edx)
> >             imull $7, %ecx, %edx
> >             movl %eax, %esi
> >             addl %edx, %esi
> >             leal 4(%esi), %edx
> >             movb $46, (%edx)
> >             imull $7, %ecx, %edx
> >             movl %eax, %esi
> >             addl %edx, %esi
> >             leal 5(%esi), %edx
> >             movb $46, (%edx)
> >             incl %ecx
> >             cmpl $7, %ecx
> >             jne .LBBinit_board_1    # loopexit.1
> >         .LBBinit_board_2:    # return
> >             movb $0, 6(%eax)
> >             movb $0, 13(%eax)
> >             movb $0, 20(%eax)
> >             movb $0, 27(%eax)
> >             movb $0, 34(%eax)
> >             movb $0, 41(%eax)
> >             movb $0, 48(%eax)
> >             movl (%esp), %esi
> >             addl $4, %esp
> >             ret
> > The code generated by GCC is much faster.  LLVM does unroll the
> > loop, which GCC does not do, but the addressing code is a constant
> > repetition of these three instructions:
> >             imull $7, %ecx, %edx
> >             movl %eax, %esi
> >             addl %edx, %esi
> > All but the first occurrance do nothing but put the same address
> > into %esi that was already there to being with, and does it with an
> > expensive multiply at that.  GCC correctly creates a suitable
> > induction variable and strength reduces the multiplication down to
> > addition.
> > 
> > It turns out that using the new selection dag code generator, the
> > outer loop is unrolled also and nothing is left but a bunch of movb
> > instructions.  Much, much faster.  But still, the optimization
> > passes shouldn't be leaving so much garbage around for instruction
> > selection to clean up.  At first, I thought all that was needed was
> > to another round of basic block optimizations after the loop was
> > unrolled, until I saw the actual LLVM bytecodes:
> >         void %init_board([7 x sbyte]* %b) {
> >         entry:
> >             br label %loopexit.1
> >         
> >         loopexit.1:        ; preds = %loopexit.1, %entry
> >             %indvar14 = phi uint [ 0, %entry ], [ %indvar.next15,
> >         %loopexit.1 ]        ; <uint> [#uses=7]
> >             %tmp.10 = getelementptr [7 x sbyte]* %b, uint %indvar14,
> >         int 0        ; <sbyte*> [#uses=1]
> >             store sbyte 46, sbyte* %tmp.10
> >             %tmp.10.1 = getelementptr [7 x sbyte]* %b, uint
> >         %indvar14, int 1        ; <sbyte*> [#uses=1]
> >             store sbyte 46, sbyte* %tmp.10.1
> >             %tmp.10.2 = getelementptr [7 x sbyte]* %b, uint
> >         %indvar14, int 2        ; <sbyte*> [#uses=1]
> >             store sbyte 46, sbyte* %tmp.10.2
> >             %tmp.10.3 = getelementptr [7 x sbyte]* %b, uint
> >         %indvar14, int 3        ; <sbyte*> [#uses=1]
> >             store sbyte 46, sbyte* %tmp.10.3
> >             %tmp.10.4 = getelementptr [7 x sbyte]* %b, uint
> >         %indvar14, int 4        ; <sbyte*> [#uses=1]
> >             store sbyte 46, sbyte* %tmp.10.4
> >             %tmp.10.5 = getelementptr [7 x sbyte]* %b, uint
> >         %indvar14, int 5        ; <sbyte*> [#uses=1]
> >             store sbyte 46, sbyte* %tmp.10.5
> >             %indvar.next15 = add uint %indvar14, 1        ; <uint>
> >         [#uses=2]
> >             %exitcond16 = seteq uint %indvar.next15, 7        ;
> >         <bool> [#uses=1]
> >             br bool %exitcond16, label %return, label %loopexit.1
> >         
> >         return:        ; preds = %loopexit.1
> >             %tmp.19 = getelementptr [7 x sbyte]* %b, int 0, int 6   
> >         ; <sbyte*> [#uses=1]
> >             store sbyte 0, sbyte* %tmp.19
> >             %tmp.19.1 = getelementptr [7 x sbyte]* %b, int 1, int
> >         6        ; <sbyte*> [#uses=1]
> >             store sbyte 0, sbyte* %tmp.19.1
> >             %tmp.19.2 = getelementptr [7 x sbyte]* %b, int 2, int
> >         6        ; <sbyte*> [#uses=1]
> >             store sbyte 0, sbyte* %tmp.19.2
> >             %tmp.19.3 = getelementptr [7 x sbyte]* %b, int 3, int
> >         6        ; <sbyte*> [#uses=1]
> >             store sbyte 0, sbyte* %tmp.19.3
> >             %tmp.19.4 = getelementptr [7 x sbyte]* %b, int 4, int
> >         6        ; <sbyte*> [#uses=1]
> >             store sbyte 0, sbyte* %tmp.19.4
> >             %tmp.19.5 = getelementptr [7 x sbyte]* %b, int 5, int
> >         6        ; <sbyte*> [#uses=1]
> >             store sbyte 0, sbyte* %tmp.19.5
> >             %tmp.19.6 = getelementptr [7 x sbyte]* %b, int 6, int
> >         6        ; <sbyte*> [#uses=1]
> >             store sbyte 0, sbyte* %tmp.19.6
> >             ret void
> >         }
> > Now the problem is obvious.  A two dimensional array access is being
> > performed by a single instruction.  The arithmetic needed to address
> > the element is implicit, and therefore inaccessible to
> > optimizations.  The redundant calculations can not be eliminated,
> > nor can strength reduction be performed.  getelementptr needs to be
> > broken down into its constituent operations at some point.
> > 
> > Or am I missing something?  Is this breaking down and reapplying
> > optimizations part of the selection dag process?  I am somewhat
> > surprised that using it affected what loops got unrolled.
> > 
> > 
> > ____________________________________________________________________
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://mail.cs.uiuc.edu/mailman/listinfo/llvmdev
> >   
> 
> 
> ______________________________________________________________________
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://mail.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20050221/17ee9e35/attachment.sig>