[PATCH] D122829: [AArch64] Optimize SDIV with pow2 constant divisor

Fri Apr 1 09:49:23 PDT 2022

efriedma added a comment.

In D122829#3422022 <https://reviews.llvm.org/D122829#3422022>, @bcl5980 wrote:

> And one other point is:
> Case @dont_fold_srem_i16_smax save 6 instructions with 3 extra add+shift
> Case @dont_fold_srem_power_of_two, it save a 9 instructions with 1 extra add+shift
> So maybe we can use the general path for vector case at least?

Where is the savings actually coming from here?  I don't think it's related to it being a vector; we're just unrolling it into scalar ops.

In D122829#3421545 <https://reviews.llvm.org/D122829#3421545>, @david-arm wrote:

> Is it also possible that in some contexts we may want to avoid setting flags where possible? i.e. for loops with control flow? The architecture only has one flags register, but has many GPRs.

Doesn't seem likely to me?  From what I've seen, flags don't normally have a long live range in practice.

In D122829#3421002 <https://reviews.llvm.org/D122829#3421002>, @bcl5980 wrote:

> I think AArch64's current BuildSDIVPow2 has two AArch64ISD which means less optinuity to combine.

True; are there specific opportunities that matter?

> Add with shift can be implemented by 3 ways:
>
> 1. Split to two micro op, one add one shift, it is always slower than independent add+cmp.
> 2. One pipeline stage to do the shift, generally it can get better IPC than independent add+cmp but the worst case will be slower than independent add+cmp.
> 3. Direct three input combinational logic circuit, it is always better than independent add+cmp.
>
> Can someone tell me which way the mainstream AArch64 processor is?
> I know some of Arm processors use case 2

You can download the software optimization guides for most Cortex cores from ARM.  Generally, add-with-shift is one micro-op, but it has two cycles of latency, where basic arithmetic has one.  So the current sequence saves a cycle of latency in most situations.

> And if we confirmed add with shift is really slower, how about the gcc's implementation for mod (2^N) ?

That looks like an improvement, sure.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D122829/new/

https://reviews.llvm.org/D122829