[PATCH] Improve performance of vector code on A15

Fri Feb 22 07:59:04 PST 2013

Hi,

Thanks all for the swift review. Tim, I think I've addressed all your
points. Thanks for the help today - it helps that you're two desks over
from me...

The attached patch has fixes for Quentin and Tim's comments.

Jakob,

> Passes like this are notoriously hard to get right
>
Don't I know it - Myself and Silviu have been working on this for ages,
paid careful attention to the impdef flags and still managed to cock
them up royally.

> It would be much easier to do this transformation before register
> allocation while the code is still in SSA form.

> James, could this pass work as a preRegAlloc pass instead? Or could it
> even be part of isel?
>
When I was creating this patch I did prototype both a pre-regalloc and a
post-regalloc patch. We have a pre-regalloc version in *some* sort of
readiness but it's really not up to scratch. The way I see it:

  + Pre-regalloc is inherently less buggy as the code is in SSA form and
implicit dependencies are much easier / don't exist.
  + Because we're pre-regalloc, finding scratch registers is trivial
(compared to postRA, where we have to register scavenge).
  - It's essentially an educated guessing game which instructions at the
preRA stage will form these sequences we're trying to avoid/work around.
INSERT_SUBREG, REG_SEQUENCE and some COPYs do it, but we end up putting
more palliation sequences than we need.
  - Because preRA passes run so early, the fixups we do are at the mercy
of a load of passes before assembly writing, meaning there's more chance
our analysis and work can be undone.

The basic gist is that we haven't managed to get a pre-RA solution to
work.

> X86 has the same problems with its 8-bit sub-registers, and we simply
> avoid generating code that uses 16-bit registers after writing the
> 8-bit sub-registers.

I've looked at what X86 does (and Swift too, given our situation is very
similar). Unfortunately the situation with the A15 is just slightly too
dissimilar to reuse the same mechanism.

There are legitimate code sequences for vector code that we just can't
avoid - the obvious example being a simple setting of one vector lane:

%rgba = insertelement <4 x float> %rgb, i32 3
store <4 x float>* %p, <4 x float> %rgba

There's no way we can get around that an QPR:ssub_3 needs to be written
to, and we need to detect that and fix up afterwards before that QPR is
used again.

I note in the comments that Swift converts most float operations to <2 x
float> to get around the similar problem it has - we considered that a
tad too heavyweight for A15 (you'd be doing twice the work on all ops,
and consider doing a divide or square root!)

> and it is likely to be a source of many future bugs related to
> sub-register liveness.

The good news:
  * We enabled this pass and made it insert its sequences in many more
situations than it normally would and ran the LNT test-suite over it. No
failures (after fixing many). This gives us some amount of confidence in
it.

The bad news:
  * All the implicit dependencies that Tim noticed were wrong didn't
result in any LNT fails. This removes some amount of confidence.

> I would actually like to get rid of all the implicit operands that are
> currently added by the register allocator, and instead revert to a
> more conservative liveness model for post-RA passes. The extra
> complexity added by the implicit operands don't buy us a lot, and it
> is just too hard to work with the post-RA code as it looks now.

I totally support this :)

Cheers,

James

On Fri, 2013-02-22 at 01:04 +0000, Jakob Stoklund Olesen wrote:
> On Feb 21, 2013, at 11:36 AM, Tim Northover <t.p.northover at gmail.com> wrote:
>
> > From my understanding, the sequence should be:
> >
> > (mark defining instruction as ImpDef<imp-def> if it's not already)
> > Scratch<def> = VDUP ImpDef, #1; DefReg<imp-use> (, PairedReg<imp-use>)
> > ImpDef<def> = VDUP ImpDef, #1; DefReg<imp-use> (, PairedReg<imp-use>)
> > ImpDef<def> = VEXT ImpDef, Scratch; DefReg<imp-def> (, PairedReg<imp-def>)
> >
> > The other sequences also seem a little iffy, but we can probably sort
> > them out together tomorrow if someone confirms I'm not off my rocker
> > in this most complex case.
>
> This looks more plausible. Adding <undef> flags on uses makes the verifier shut up, but only by lying to it.
>
> Passes like this are notoriously hard to get right, and it is likely to be a source of many future bugs related to sub-register liveness. It would be much easier to do this transformation before register allocation while the code is still in SSA form.
>
> I would actually like to get rid of all the implicit operands that are currently added by the register allocator, and instead revert to a more conservative liveness model for post-RA passes. The extra complexity added by the implicit operands don't buy us a lot, and it is just too hard to work with the post-RA code as it looks now.
>
> The conservative liveness model would only require that some part of a used register is live, the current model requires that the full register is live, which is why all the implicit operands are needed.
>
> James, could this pass work as a preRegAlloc pass instead? Or could it even be part of isel?
>
> X86 has the same problems with its 8-bit sub-registers, and we simply avoid generating code that uses 16-bit registers after writing the 8-bit sub-registers.
>
> /jakob
>
>
>

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: a15-sd-optimizer.diff
Type: text/x-patch
Size: 21357 bytes
Desc: a15-sd-optimizer.diff
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130222/2d1db0fa/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: implicit-ops.diff
Type: text/x-patch
Size: 1197 bytes
Desc: implicit-ops.diff
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130222/2d1db0fa/attachment-0001.bin>