[AArch64] FP load balancing pass for Cortex-A57

Tue Aug 5 08:18:27 PDT 2014

Hi Renato,

Thanks for the feedback! "More tests!" - gotcha. I'll add a bunch more. I'll sort all the formatting weirdnesses too.

> Is that comment about ties right?

No it isn't. It is a leftover from a previous implementation that got missed due to my tunnel vision. I'll remove it.

> +  // Tie-break equivalent sizes by sorting chains requiring fixups 
> + before  // those without fixups.
>
> I'd assume the opposite... focusing on simpler ones first, since...
>
> +    // If we'll need a fixup FMOV, don't bother. Testing has shown that this
> +    // happens infrequently and when it does it has at least a 50% chance of
> +    // slowing code down instead of speeding it up.

The reason we pick those requiring fixups first is exactly because of that comment. We can't change the parity of those chains, so it is best to look at them first, so the global "parity counter" has taken them into account when we get to the chains that we *can* recolor. This, I think, is the best way to achieve the most balanced parity overall.

> If I got it right, this pass will only check chains contained in the same basic block. If that's correct, what would be the added complexity of checking across multiple basic blocks?

Yes, this is on a basic block level. Crossing basic blocks I think would be difficult - as soon as you get a phi node you have multiple incoming blocks to worry about, and you have to worry about execution paths. I could think about it but I'm not sure it'd be easy!

New version will appear later.

Cheers,

James

-----Original Message-----
From: Renato Golin [mailto:renato.golin at linaro.org] 
Sent: 05 August 2014 16:10
To: James Molloy
Cc: Tim Northover; llvm-commits
Subject: Re: [AArch64] FP load balancing pass for Cortex-A57

On 5 August 2014 13:51, James Molloy <james.molloy at arm.com> wrote:
> For best-case performance on Cortex-A57, we should try to use a 
> balanced mix of odd and even D-registers when performing a critical 
> sequence of independent, non-quadword FP/ASIMD floating-point multiply 
> or multiply-accumulate operations.

Hi James,

Great patch! Do you have some tests that could show us the improvements and make sure we're not regressing in the future?

If I got it right, this pass will only check chains contained in the same basic block. If that's correct, what would be the added complexity of checking across multiple basic blocks?

I had a good read and the algorithm looks right, though I have to say I didn't get everything with ultimate precision. It'd really be good to have some tests showing current coverage, future planned coverage and cases where we don't want to change due to conflicts.

I have some comments inline...

+  void add(MachineInstr *MI, unsigned Idx, Color C) {
+    LastInst = MI;
+    LastInstIdx = Idx;
+    Insts.insert(MI);
+
+    LastColor = C;
+  }

Weird formatting, insert(MI) as a last line, separated, would be more meaningful.

+  /// Return the preferred color of this chain. Ties break to even.
+  Color getPreferredColor() {
+    if (OverrideBalance != 0)
+      return OverrideBalance == 1 ? Color::Even : Color::Odd;
+    return LastColor;
+  }

Is that comment about ties right? This just seem to return the forced colour or the last colour... Unless the last colour breaks ties to even...

+  //   (a) That's rather heavyweight for only two colors.
+  //   (b) We expect multiple disjoint interference regions - in
practice the live
+  //   (c) range of chains is quite small and they are clustered between loads
+  //       and stores.

the formatting is weird, making it hard to get the last two topics

+  std::sort(V.begin(), V.end(), [](const std::vector<Chain*> &A,
+                                   const std::vector<Chain*> &B) {
+      return A.front()->startsBefore(B.front());
+    });

weird formatting...

+  // Tie-break equivalent sizes by sorting chains requiring fixups 
+ before  // those without fixups.

I'd assume the opposite... focusing on simpler ones first, since...

+    // If we'll need a fixup FMOV, don't bother. Testing has shown that this
+    // happens infrequently and when it does it has at least a 50% chance of
+    // slowing code down instead of speeding it up.

and that's it. :)

cheers,
--renato