[PATCH] D140087: [X86] Replace (31/63 -/^ X) with (NOT X) and ignore (32/64 ^ X) when computing shift count

Noah Goldstein via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Wed Dec 21 09:17:27 PST 2022


goldstein.w.n added a comment.

In D140087#4010506 <https://reviews.llvm.org/D140087#4010506>, @pengfei wrote:

> In D140087#4008910 <https://reviews.llvm.org/D140087#4008910>, @goldstein.w.n wrote:
>
>> In D140087#4008802 <https://reviews.llvm.org/D140087#4008802>, @goldstein.w.n wrote:
>>
>>> In D140087#4008747 <https://reviews.llvm.org/D140087#4008747>, @craig.topper wrote:
>>>
>>>>>>    3,782,331      port0                                                          
>>>>>>    3,207,023      port1                                                          
>>>>>>    1,001,220      port23                                                         
>>>>>>    3,216,022      port5                                                          
>>>>>>    4,940,975      port6                                                          
>>>>>>   11,575,101      port49                                                         
>>>>
>>>> I'm having trouble accounting for these numbers. As far as I know
>>>> shlx is 1 uop
>>>> mov is 1 uop
>>>> shr is 1 uop
>>>> and with load+store is 4 uops
>>>> dec is 1 uop
>>>> bnz is 1uop  and could possibly be macrofused with the dec.
>>>>
>>>> so that's 9 or maybe 8 with macrofusion uops per iteration. what am I missing?
>>>>
>>>> You're also missing counts on port 7 and 8 which is where the store AGU uops should go. The port 4 and 9 would be the store data uops.
>>>
>>> See my other comment, but I think the benchmark was misleading and the numbers
>>> where dramatically skewed by uop replay (maybe something else, but thats the only
>>> thing I can think of at the moment).
>>>
>>> Think you and @lebedev.ri  are right and unless there is intense register pressure
>>> or `-Os` it doesn't win out.
>>
>> If its all the same, would still like guidance about how to implement it. Think it may
>> be useful at the very least for `AtomicExpansionKind::BitTestIntrinsic`.
>
> `AtomicExpansionKind::BitTestIntrinsic` is different, it is intended to replace the expensive `lock cmpxchg`. It may not be beneficial in non atomic cases.

Yeah, I realized I needed new definitions for `int_x86_atomic_bts` that take gpr arguments.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D140087/new/

https://reviews.llvm.org/D140087



More information about the llvm-commits mailing list