[PATCH] D75145: [PassManager] adjust VectorCombine placement

Fri Mar 6 12:09:00 PST 2020

dmgreen added a comment.

My benchmarks were still running. D75757 <https://reviews.llvm.org/D75757> wasn't in review long enough for them to complete before it went in (they seem to be being a bit slow, and phab seems to be sending emails through in chunks).

It looks like it's made things (a lot) worse, not better. For "normal" code this time, not vectorised. The issue here with vector loops might be improved. It's hard to tell. There are so many other regressions I can't really give you a quick answer. I mean, there are some improvements mixed in, but the total is definitely down. Not sure if this is an ARM issue again, or something more general. It doesn't effect (-Oz) codesize at all, or 6m, which might suggest that it's not just as simple as it disabling some analyses. I will see what I can find out, but we are going in the wrong direction here.

Adding some phase ordering tests for some of this sounds very useful. I'll see what I can add. With unrolling and vectorisation and the rest, they might get quite verbose. I'll see.

And you asked a question; The part of the assembly that was important for performance, from this first case was this vector body:

  vldrh.u16       q0, [r0], #16
  subs.w  r12, r12, #8
  vqabs.s16       q0, q0
  vstrb.8 q0, [r1], #16
  bne     .LBB0_4

Which could be using a LE low overhead loop instruction:

  vldrh.u16       q0, [r0], #16
  vqabs.s16       q0, q0
  vstrb.8 q0, [r1], #16
  le     lr, .LBB0_4

There is a pass in the IR part of the backend that looks for loops, finds the BETC and adds hardware loop intrinsics for it. It's essentially a hardware loop so you don't need to execute the subs or the bne on each iteration.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D75145/new/

https://reviews.llvm.org/D75145