Hello, i've noticed a new possible missed optimization while testing more trivial code.<br>This time it's not a with a xor but with a multiplication instruction and the example is little bit more involved.<br><br>C code:<br>
<br>typedef short t;<br>t foo(t a, t b)<br>{<br> t a4 = a*b;<br> return a4;<br>}<br><br>argument "a" is passed in R15:R14, argument "b" in R13:R12, the return value is stored in R15:R14.<br>The mul instruction takes in two 8bit regs and returns a 16bit result in R1:R0, this is handled in the selectionDAG same way as x86 (btw mul is marked as commutable).<br>
<br>Asm code:<br><br> mul r12, r15<br> mov r8, r0<br> mul r12, r14<br> mov r9, r0<br> mov r10, r1<br> add r10, r8<br> mul r13, r14<br> mov r15, r0<br> add r15, r10<br>
mov r14, r9<br><br>This can be tuned further to the following:<br><br> mov r8, r14<br> mov r9, r15<br> mul r12, r8<br> mov r14, r0<br> mov r15, r1<br> mul r12, r9<br>
add r15, r0<br> mul r13, r8<br> add r15, r0<br><br>The difference between both versions is that the second has one instruction less and saves a scratch register. <br>If we start by multiplying the lower parts of both arguments instead of mixing upper and lower parts from a start we can save r8 in the first example and a later move, notice that the second version stores directly the result of a.low*b.low into R15:R14. I'm unsure if this is related to <a href="http://llvm.org/bugs/show_bug.cgi?id=8112">http://llvm.org/bugs/show_bug.cgi?id=8112</a><br>
I've attached a txt file with the regcoalescing output incase it's useful like requested in the previous emails.<br><br>Thanks<br>