[AArch64] FP load balancing pass for Cortex-A57

Tue Aug 5 08:09:35 PDT 2014

On 5 August 2014 13:51, James Molloy <james.molloy at arm.com> wrote:
> For best-case performance on Cortex-A57, we should try to use a balanced mix
> of odd and even D-registers when performing a critical sequence of
> independent, non-quadword FP/ASIMD floating-point multiply or
> multiply-accumulate operations.

Hi James,

Great patch! Do you have some tests that could show us the
improvements and make sure we're not regressing in the future?

If I got it right, this pass will only check chains contained in the
same basic block. If that's correct, what would be the added
complexity of checking across multiple basic blocks?

I had a good read and the algorithm looks right, though I have to say
I didn't get everything with ultimate precision. It'd really be good
to have some tests showing current coverage, future planned coverage
and cases where we don't want to change due to conflicts.

I have some comments inline...

+  void add(MachineInstr *MI, unsigned Idx, Color C) {
+    LastInst = MI;
+    LastInstIdx = Idx;
+    Insts.insert(MI);
+
+    LastColor = C;
+  }

Weird formatting, insert(MI) as a last line, separated, would be more
meaningful.

+  /// Return the preferred color of this chain. Ties break to even.
+  Color getPreferredColor() {
+    if (OverrideBalance != 0)
+      return OverrideBalance == 1 ? Color::Even : Color::Odd;
+    return LastColor;
+  }

Is that comment about ties right? This just seem to return the forced
colour or the last colour... Unless the last colour breaks ties to
even...

+  //   (a) That's rather heavyweight for only two colors.
+  //   (b) We expect multiple disjoint interference regions - in
practice the live
+  //   (c) range of chains is quite small and they are clustered between loads
+  //       and stores.

the formatting is weird, making it hard to get the last two topics

+  std::sort(V.begin(), V.end(), [](const std::vector<Chain*> &A,
+                                   const std::vector<Chain*> &B) {
+      return A.front()->startsBefore(B.front());
+    });

weird formatting...

+  // Tie-break equivalent sizes by sorting chains requiring fixups before
+  // those without fixups.

I'd assume the opposite... focusing on simpler ones first, since...

+    // If we'll need a fixup FMOV, don't bother. Testing has shown that this
+    // happens infrequently and when it does it has at least a 50% chance of
+    // slowing code down instead of speeding it up.

and that's it. :)

cheers,
--renato