[PATCH] [AtomicExpand] Set branch weights when new basic blocks are created

Tue May 19 23:28:07 PDT 2015

In http://reviews.llvm.org/D7804#175083, @reames wrote:

> Seems like a reasonable plan.
>
> I'm going to list a couple of ideas for alternate approaches below, but it's been long enough I've completely lost context on this patch.  Feel free to ignore if my suggestions don't seem worthwhile.
>
> 1. We could add a piece of metadata to record how contented a given RMW operation is.  This could be as simple as "uncontended" or a more generic "continition_count(x, of y)".  With that in place, you change could be applied only for uncontended accesses.  (I'd also be open to switching the default to uncontended and then having the metadata for the contended case.  Your reported data made this sound plausible; you just need to make the argument on llvm-dev.)
> 2. You could investigate why the branch weights cause a slow down in the contended cases.  Looking at the loop structure, I find it slightly odd that it has any impact at all since it's not likely to effect code placement.  There might be an independent tweak that could be made here.

Judging from the code clang generates, machine block placement is the pass that is making the difference. The basic blocks are laid out in a way that appears to hurt performance in the contended cases (more backward and forward branches to non-consecutive blocks).

This is part of the program that ran slower because of my patch. You can probably see the difference if you compile it for arm64 or aarch64.

  struct Node {
    Node(int ii, Node *nn) : i(ii), next(nn) {}
    Node() : i(0), next(nullptr) {}
    unsigned i;
    Node *next;
  };

  struct List {
    atomic<Node*> head;
  };

  List list1;

  std::atomic<unsigned> flag;
  unsigned sum = 0;

  void addNodes(unsigned b, unsigned e) {
    while (b != e) {
      Node *n = new Node(b++, nullptr);
      Node *expected = list1.head;

      do {
        n->next = expected;
      } while (!atomic_compare_exchange_weak(&list1.head, &expected, n));
    }
  }

> Also, can you explain *why* you expected the branch weights to help in the first place?  (i.e. what part of the optimizer/micro-architecture were you trying to exploit?)  Maybe that will spark an idea for another approach.

I was simply assuming real workloads would typically have little contention, and if I set a low weight for the "failure" branch and a high weight for the "success" branch, the optimizing passes would generate code that would run faster in the uncontended case. I wasn't trying to exploit any optimization passes in particular, I was just expecting it should make a difference in the generated code.

http://reviews.llvm.org/D7804

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/