Hi,<br><br><div class="gmail_quote">On Tue, Jul 26, 2011 at 11:32 AM, Matt Johnson <span dir="ltr"><<a href="mailto:johnso87@crhc.illinois.edu">johnso87@crhc.illinois.edu</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

Hi Daniel,<br>

<br>

> Hi folks,<br>

><br>

> I couldn't find a specific XOR (OR and AND) optimization on llvm, and<br>

> therefore I am about to implement it.<br>

> But first I would like to check with you guys that it really does not exist.<br>

><br>

> For a simple loop like this:<br>

><br>

> nbits = 128;<br>

> bit_addr = 0;<br>

>   while(nbits--)<br>

> {<br>

>   bindex=bit_addr>>5;     /* Index is number /32 */<br>

> bitnumb=bit_addr % 32;  /* Bit number in longword */<br>

>   bitmap[bindex]^=(1L<<bitnumb);<br>

> bit_addr++;<br>

> }<br>

><br>

><br>

> The -O3 set of optimizations generates a code like this:<br>

><br>

><br>

> entry:<br>

>    br label %while.body<br>

><br>

> while.body:                                       ; preds = %while.body,<br>

> %entry<br>

>    %0 = phi i32 [ 0, %entry ], [ %inc.3, %while.body ]<br>

>    %shr = lshr i32 %0, 5<br>

>    %rem = and i32 %0, 28<br>

>    %shl = shl i32 1, %rem<br>

>    %arrayidx = getelementptr inbounds i32* %bitmap, i32 %shr<br>

>    %tmp6 = load i32* %arrayidx, align 4<br>

>    %xor = xor i32 %tmp6, %shl<br>

>    %rem.1 = or i32 %rem, 1<br>

>    %shl.1 = shl i32 1, %rem.1<br>

>    %xor.1 = xor i32 %xor, %shl.1<br>

>    %rem.2 = or i32 %rem, 2<br>

>    %shl.2 = shl i32 1, %rem.2<br>

>    %xor.2 = xor i32 %xor.1, %shl.2<br>

>    %rem.3 = or i32 %rem, 3<br>

>    %shl.3 = shl i32 1, %rem.3<br>

>    %xor.3 = xor i32 %xor.2, %shl.3<br>

>    store i32 %xor.3, i32* %arrayidx, align 4<br>

>    %inc.3 = add i32 %0, 4<br>

>    %exitcond.3 = icmp eq i32 %inc.3, 128<br>

>    br i1 %exitcond.3, label %while.end, label %while.body<br>

><br>

> while.end:                                        ; preds = %while.body<br>

>    ret void<br>

><br>

><br>

><br>

> It is clear that we are able to fold all XORs into a single XOR, and the<br>

> same happens to all SHLs and ORs.<br>

> I am using -O3, but the code is not optimized, so I am assuming there is no<br>

> optimization for this case. Am I correct?<br>

<br>

The loop is being unrolled by a factor of 4.  This breaks the artificial<br>

dependence between loop iterations, and should yield a substantial improvement<br>

on machines with enough registers and functional units to take advantage of it.<br>

The fact that the loop is unrolled explains why the XORs, SHLs, and ORs are not<br>

folded into 1.<br></blockquote><div><br></div><div>I think he is trying to say this expression generated by unrolling by a factor of 4 can indeed be folded into a single XOR, SHL and OR. </div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


<br>

> If yes, I have a few other questions:<br>

><br>

>   - Do you know of any other similar optimization that could do something<br>

> here but is not being triggered for some reason??<br>

<br>

Try compiling with -Os instead of -O3.  I'm guessing this will turn off loop<br>

unrolling.<br>

<br>

>   - Do you know why a OR instruction is used for increments? instead of using<br>

> a INC or ADD?<br>

<br>

OR is a simpler operation than ADD at the hardware level, and<br>

usually faster and more energy-efficient, so it should be preferred when the compiler<br>

can statically infer that its usage is correct.<br></blockquote><div><br></div><div>I haven't seen a machine in which OR is faster than ADD nor more energy-efficient. They're all done by the same ALU circuitry which delays the pipeline by its worstcase path timing. So, for modern processor hardware purposes, OR is exactly equal ADD. Transforming ADD to OR isn't strenght reduction at all. Maybe this is benefical only if you have a backend generating circuitry (programming FPGAs).</div>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<br>

>   - Is there a straight forward way to know if an instruction belongs to a<br>

> loop? (just curiosity)<br>

<br>

I'll defer to others on this one.<br>

<br>

><br>

> Thanks very much<br>

><br>

> Daniel Nicacio<br>

Best,<br>

Matt<br>

<br>

<br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

</blockquote></div><br>