<div dir="ltr">I wonder if we can do this in a separate analysis MachineFunction SSA pass.<div>1) SelectionDAG will generate a pseudo instruction MutatingFMA. When it's generated it's allowed to have d = a * b + c form, where d doesn't have to be in {a, b, c}.</div><div>2) Later, the proposed pass uses an algorithm to decide for instruction MI: `%vreg0 = MutatingFMA %vreg1, %vreg2, %vreg3`, it should tie %vreg0 with, say %vreg2. Then it tells the scheduler (through ScheduleDAG?) "If you put instruction MI as the last use of %vreg2, the killed %vreg2 will save us a copy". The algorithm saves as many copies as possible, and those suggestions are guarantee to be compatible with each other, so that scheduler can choose to follow all of them.</div><div>3) In Two-Address instruction pass, all MutatingFMA instructions are still handled and copies get inserted. That's fine. No change is required.</div><div>4) Scheduler optionally follows our suggestions.</div><div>5) After Scheduling, we do a copy cleanup pass that change:</div><div> %vreg0<def> = COPY %vreg2<kill></div><div> %vreg0<def, tied2> = MutatingFMA %vreg1, %vreg0<tied0>, %vreg3</div><div> To:</div><div> %vreg2<def, tie2> = MutatingFMA %vreg1, %vreg2<tied0>, %vreg3</div><div> In the same block.</div><div>6) 1)~5) are target independent. Now the tied MutatingFMA instruction gets lowered to one of its real forms.</div><div><br></div><div>The only thing left is to formalize the problem and come up with an algorithm in 2):</div><div>Given a DAG of uninteresting nodes and interesting (e.g. MutatingFMA) nodes, for each interesting nodes, select zero or one of its operand uses (corresponding to an edge) to mark. Mark as many edges as possible.</div><div><br></div><div>I can't come up with an optimal algorithm for 2). Here's my current greedy solution:</div><div>1) For each vreg, calculate how many MutatingFMA instructions are using them, and put the result into ref_count. Set up a worklist that's holding all MutatingFMA instructions.</div><div>2) Try to get a MutatingFMA from worklist, as MI, such that, it has at least one operand %x that has ref_count 1. If we can't find one, go to 4).</div><div>3) Make a suggestion about MI and %x. Remove MI from active set and decrease the ref_count for each of its uses by 1.</div><div>4) If active set is empty, then we are done; otherwise remove a random (in a heuristic manner) MutatingFMA from worklist (by making no suggestions, aka keeping the copy) and go back to 2).</div><div><br></div><div>For small amount (e.g. 5) of instructions, we can also enumerate copy/no-copy decisions for each of them. That will take 2^5 = 32 different cases, which isn't that many.</div><div><br></div><div>Another concern is that ABI may require values to be put to designated physical registers (e.g. put return value to F1). It would be good to plumb a path from F1 as an input, finally to F1 as an output, for example:</div><div>f = a * b + c * d + e</div><div><br></div><div>it's better to do:</div><div>e = FMA c d e</div><div>a = FMA a b e</div><div>where the first will be an add form and the second will be a multiply form. The greedy algorithm isn't aware of this, but the exponential algorithm can pick this up.</div><div><br></div><div>The algorithm is unaware of any FMA related knowledge though, so it can be used for any mixed instructions with in-out registers.</div><br><div class="gmail_quote"><div dir="ltr">On Wed, Mar 2, 2016 at 3:40 PM Eric Christopher <<a href="mailto:echristo@gmail.com">echristo@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Relaying an IRC conversation here:<div><br></div><div>Currently we're going to disable the VSXFMAMutate pass in ToT and talk about ways to either rewrite or fix it to deal with more complicated cfgs.</div><div><br></div><div>Thanks!</div></div><div dir="ltr"><div><br></div><div>-eric</div></div><br><div class="gmail_quote"><div dir="ltr">On Mon, Feb 29, 2016 at 2:21 PM Tim Shen <<a href="mailto:timshen@google.com" target="_blank">timshen@google.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Ping?</div><br><div class="gmail_quote"><div dir="ltr">On Mon, Feb 22, 2016 at 1:06 PM Tim Shen <<a href="mailto:timshen@google.com" target="_blank">timshen@google.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Fri, Feb 19, 2016 at 5:10 PM Tim Shen <<a href="mailto:timshen@google.com" target="_blank">timshen@google.com</a>> wrote:</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>I wonder if we can fix this by making the transformation simpler, that is, instead of doing:</div></div></blockquote><div><br></div></div></div><div dir="ltr"><div dir="ltr"><div class="gmail_quote"><div>I wrote a prototype (see attach) for this idea, it actually improves some of the test cases (e.g. fma-assoc.ll: test_FMADD_ASSOC1), but pessimize several other cases (e.g. test_FMADD_ASSOC_EXT1).</div><div><br></div><div>I'm not sure what to do at this point, I have several options:</div><div>1) This pass simply omits the optimization for certain cases ("certain" needs to be defined).</div><div>2) This pass should never mutate anything out of the block, and may add a copy at the end of the block, if the value is live-out.</div><div>3) tune the transformation as I attempted, to make it simpler, working, and producing better result? It seems interesting to get the improvement as my prototype shown without hurting other test cases, but it needs a bit thinking and I'd like to fix the compiler crash soon.</div></div></div></div></blockquote></div>
</blockquote></div>
</blockquote></div></div>