[PATCH] [AArch64] Lower sdiv x, pow2 using add + select + shift.

Wed Jul 9 11:30:45 PDT 2014

James,

>>! In D4438#10, @jmolloy wrote:
> This is interesting. I suppose this is relying on the fact that X is >= 0 much more often than it is negative?
> 
> I can't think of why this sequence would be faster otherwise - the csel is resolvable to nothing as soon as X is known (if X >= 0).

The specific cases in EEMBC do not remove the csel, so I don't think that is the case.  My understanding is that the add with shift is rather expensive (at least on A53).  I don't know if this would be an enhancement on A57 or other processors, but I hope so.

> Extending this, it seems not improbable that a sequence involving a branch instead of a select would be even faster on OoO cores as it would allow the branch to resolve as soon as X is known:
> 
>> add w0, X, 15
>> cmp X, wzr
>> b.lt 2f
>> 1:
>> ... continue basic block
>> ... end basic block
>> 
>> 2:
>> mov X, w0
>> b 1b

That seems reasonable, but I don't have any way of testing this.

> Have you tried generating such a sequence? What core did you measure the speedup on - A53, A57 or another?

I have not tried such a sequence.  This was measured on an A53 device, which is the only device I have available.

If anyone could check this on A57/Cyclone I would greatly appreciate it.

BTW, gcc performs the same transformation.

> Cheers,
> 
> James

Thanks, James.

http://reviews.llvm.org/D4438