[llvm-dev] [VSXFMAMutate] OldFMAReg may be wrongly rewritten

Tue Mar 22 17:13:40 PDT 2016

I think we can probably go ahead and throw this up on Phabricator for
review. I'd probably bring in Matthias for review as well.

Thanks!

-eric

On Wed, Mar 16, 2016 at 10:53 AM Tim Shen <timshen at google.com> wrote:

> I implemented a proof of concept of a new generic MachineFunction SSA
> pass. The code is not readable and not efficient yet, but it shows
> interesting results:
>
> In fma.ll @test_FMSUB2 (return dummy(A * B + C, A * B - D)):
> before:
>         fmr 0, 1
>         xsmaddadp 3, 0, 2
>         xsmsubmdp 0, 2, 4
>         fmr 1, 3
>         fmr 2, 0
>         bl dummy2
>
> after:
>         xsmsubadp 4, 1, 2
>         xsmaddmdp 1, 2, 3
>         fmr 2, 4
>         bl dummy2
>
> In fma-assoc.ll all 9 copies are eliminated, for example (A * B + C * D +
> E):
> before:
>         xsmaddmdp 3, 4, 5
>         xsmaddadp 3, 1, 2
>         fmr 1, 3
>
> after:
>         xsmaddadp 5, 3, 4
>         xsmaddmdp 1, 2, 5
>
> In fma-ext.ll, the only copy is also eliminated.
>
> The new pass in theory can support any kinds of instructions that may pick
> one among multiple its operands as an in-out register. VSX FMA instructions
> are just some examples.
>
> The drawback is this pass requires LiveIntervals analysis at SSA form and
> Machine MachineDominatorTree analysis. It definitely preserves the former,
> but not sure about the latter - it rearranges operands and re-tie them.
>
> If you think this result is good enough, I'll go ahead and do a bit
> cleanup and code optimization. It's in theory as fast as a topological sort.
>
> On Fri, Mar 4, 2016 at 5:14 PM Tim Shen <timshen at google.com> wrote:
>
>> In 5), by saying "In the same block" I mean we should replace uses of
>> vreg0 with vreg2 afterwards, but only limited in the same basicblock.
>>
>> In 6), we need another tiny piece to actually lower MutatingFMA to real
>> forms, which should be trivial.
>>
>> On Fri, Mar 4, 2016 at 5:09 PM Tim Shen <timshen at google.com> wrote:
>>
>>> I wonder if we can do this in a separate analysis MachineFunction SSA
>>> pass.
>>> 1) SelectionDAG will generate a pseudo instruction MutatingFMA. When
>>> it's generated it's allowed to have d = a * b + c form, where d doesn't
>>> have to be in {a, b, c}.
>>> 2) Later, the proposed pass uses an algorithm to decide for instruction
>>> MI: `%vreg0 = MutatingFMA %vreg1, %vreg2, %vreg3`, it should tie %vreg0
>>> with, say %vreg2. Then it tells the scheduler (through ScheduleDAG?) "If
>>> you put instruction MI as the last use of %vreg2, the killed %vreg2 will
>>> save us a copy". The algorithm saves as many copies as possible, and those
>>> suggestions are guarantee to be compatible with each other, so that
>>> scheduler can choose to follow all of them.
>>> 3) In Two-Address instruction pass, all MutatingFMA instructions are
>>> still handled and copies get inserted. That's fine. No change is required.
>>> 4) Scheduler optionally follows our suggestions.
>>> 5) After Scheduling, we do a copy cleanup pass that change:
>>>     %vreg0<def> = COPY %vreg2<kill>
>>>     %vreg0<def, tied2> = MutatingFMA %vreg1, %vreg0<tied0>, %vreg3
>>>     To:
>>>     %vreg2<def, tie2> = MutatingFMA %vreg1, %vreg2<tied0>, %vreg3
>>>     In the same block.
>>> 6) 1)~5) are target independent. Now the tied MutatingFMA instruction
>>> gets lowered to one of its real forms.
>>>
>>> The only thing left is to formalize the problem and come up with an
>>> algorithm in 2):
>>> Given a DAG of uninteresting nodes and interesting (e.g. MutatingFMA)
>>> nodes, for each interesting nodes, select zero or one of its operand uses
>>> (corresponding to an edge) to mark. Mark as many edges as possible.
>>>
>>> I can't come up with an optimal algorithm for 2). Here's my current
>>> greedy solution:
>>> 1) For each vreg, calculate how many MutatingFMA instructions are using
>>> them, and put the result into ref_count. Set up a worklist that's holding
>>> all MutatingFMA instructions.
>>> 2) Try to get a MutatingFMA from worklist, as MI, such that, it has at
>>> least one operand %x that has ref_count 1. If we can't find one, go to 4).
>>> 3) Make a suggestion about MI and %x. Remove MI from active set and
>>> decrease the ref_count for each of its uses by 1.
>>> 4) If active set is empty, then we are done; otherwise remove a random
>>> (in a heuristic manner) MutatingFMA from worklist (by making no
>>> suggestions, aka keeping the copy) and go back to 2).
>>>
>>> For small amount (e.g. 5) of instructions, we can also enumerate
>>> copy/no-copy decisions for each of them. That will take 2^5 = 32 different
>>> cases, which isn't that many.
>>>
>>> Another concern is that ABI may require values to be put to designated
>>> physical registers (e.g. put return value to F1). It would be good to plumb
>>> a path from F1 as an input, finally to F1 as an output, for example:
>>> f = a * b + c * d + e
>>>
>>> it's better to do:
>>> e = FMA c d e
>>> a = FMA a b e
>>> where the first will be an add form and the second will be a multiply
>>> form. The greedy algorithm isn't aware of this, but the exponential
>>> algorithm can pick this up.
>>>
>>> The algorithm is unaware of any FMA related knowledge though, so it can
>>> be used for any mixed instructions with in-out registers.
>>>
>>> On Wed, Mar 2, 2016 at 3:40 PM Eric Christopher <echristo at gmail.com>
>>> wrote:
>>>
>>>> Relaying an IRC conversation here:
>>>>
>>>> Currently we're going to disable the VSXFMAMutate pass in ToT and talk
>>>> about ways to either rewrite or fix it to deal with more complicated cfgs.
>>>>
>>>> Thanks!
>>>>
>>>> -eric
>>>>
>>>> On Mon, Feb 29, 2016 at 2:21 PM Tim Shen <timshen at google.com> wrote:
>>>>
>>>>> Ping?
>>>>>
>>>>> On Mon, Feb 22, 2016 at 1:06 PM Tim Shen <timshen at google.com> wrote:
>>>>>
>>>>>> On Fri, Feb 19, 2016 at 5:10 PM Tim Shen <timshen at google.com> wrote:
>>>>>>
>>>>>>> I wonder if we can fix this by making the transformation simpler,
>>>>>>> that is, instead of doing:
>>>>>>>
>>>>>>
>>>>>> I wrote a prototype (see attach) for this idea, it actually improves
>>>>>> some of the test cases (e.g. fma-assoc.ll: test_FMADD_ASSOC1), but
>>>>>> pessimize several other cases (e.g. test_FMADD_ASSOC_EXT1).
>>>>>>
>>>>>> I'm not sure what to do at this point, I have several options:
>>>>>> 1) This pass simply omits the optimization for certain cases
>>>>>> ("certain" needs to be defined).
>>>>>> 2) This pass should never mutate anything out of the block, and may
>>>>>> add a copy at the end of the block, if the value is live-out.
>>>>>> 3) tune the transformation as I attempted, to make it simpler,
>>>>>> working, and producing better result? It seems interesting to get the
>>>>>> improvement as my prototype shown without hurting other test cases, but it
>>>>>> needs a bit thinking and I'd like to fix the compiler crash soon.
>>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160323/136da5b1/attachment-0001.html>