[PATCH] [AArch64] Lower sdiv x, pow2 using add + select + shift.
mcrosier at codeaurora.org
Wed Jul 9 11:30:45 PDT 2014
>>! In D4438#10, @jmolloy wrote:
> This is interesting. I suppose this is relying on the fact that X is >= 0 much more often than it is negative?
> I can't think of why this sequence would be faster otherwise - the csel is resolvable to nothing as soon as X is known (if X >= 0).
The specific cases in EEMBC do not remove the csel, so I don't think that is the case. My understanding is that the add with shift is rather expensive (at least on A53). I don't know if this would be an enhancement on A57 or other processors, but I hope so.
> Extending this, it seems not improbable that a sequence involving a branch instead of a select would be even faster on OoO cores as it would allow the branch to resolve as soon as X is known:
>> add w0, X, 15
>> cmp X, wzr
>> b.lt 2f
>> ... continue basic block
>> ... end basic block
>> mov X, w0
>> b 1b
That seems reasonable, but I don't have any way of testing this.
> Have you tried generating such a sequence? What core did you measure the speedup on - A53, A57 or another?
I have not tried such a sequence. This was measured on an A53 device, which is the only device I have available.
If anyone could check this on A57/Cyclone I would greatly appreciate it.
BTW, gcc performs the same transformation.
More information about the llvm-commits