[PATCH] D140087: [X86] Replace (31/63 -/^ X) with (NOT X) and ignore (32/64 ^ X) when computing shift count
Noah Goldstein via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Tue Dec 20 11:39:11 PST 2022
goldstein.w.n added a comment.
In D140087#4008693 <https://reviews.llvm.org/D140087#4008693>, @craig.topper wrote:
> In D140087#4008678 <https://reviews.llvm.org/D140087#4008678>, @goldstein.w.n wrote:
>
>> In D140087#4008536 <https://reviews.llvm.org/D140087#4008536>, @craig.topper wrote:
>>
>>> In D140087#4008447 <https://reviews.llvm.org/D140087#4008447>, @goldstein.w.n wrote:
>>>
>>>> @pengfei Somewhat unrelated so if this is not the right place the ask, can you let me know where is.
>>>>
>>>> I was looking to add a peephole to change something like:
>>>>
>>>> ptr[x / 32] |= (1 << (x % 32))
>>>>
>>>> Currently codegen is something like:
>>>>
>>>> mov $0x1,%gpr1
>>>> shlx %cnt,%gpr1,%mask
>>>> shr $0x5,%cnt
>>>> or %mask, (%ptr, %cnt, 4)
>>>>
>>>> And it could be as simple as:
>>>>
>>>> bts %cnt, (%ptr)
>>>>
>>>> (other pattern with `bt{s|r|c}` could also be improved)
>>>>
>>>> I saw `one_bit_patterns` in `X86InstrCompiler` but don't see a way to extend
>>>> the peephole s.t `addr` is a function of the inputs and not just one of the inputs.
>>>>
>>>> Any chance you could direct me as where I should look at add this type of
>>>> peephole?
>>>
>>> `bts %cnt, (%ptr)` is a 10 or 11 uop instruction. It might not be better than current code.
>>
>> I think that translates to worse throughput (so worse in a tight loop iff no carried
>> dependency (better latency so if carried dependency still preferable)) but outside
>> of that once case have to imagine its a win.
>>
>> 1. Better latency.
>> 2. Less register pressure
>> 3. Less code size.
>> 4. Less Backend resources(unless this is some bizarre program thats retirement bound)
>>
>> on ICX:
>> Loop using `shlx` method with hoisted `movl $1, %gpr`. 1,000,000 iterations (with a `decl; jne` for loop impl)
>>
>> 3,782,331 port0
>> 3,207,023 port1
>> 1,001,220 port23
>> 3,216,022 port5
>> 4,940,975 port6
>> 11,575,101 port49
>>
>> Same loop using `btr`
>>
>> 2,055,213 port0
>> 1,298,859 port1
>> 1,000,372 port23
>> 1,505,077 port5
>> 3,261,176 port6
>> 1,088,049 port49
>>
>> The loop:
>>
>> .global _start
>> .p2align 6
>> .text
>> _start:
>> movl $1, %eax
>> movl $123, %ecx
>> leaq (buf_start)(%rip), %rdi
>>
>> movl $1000000, %edx
>>
>> loop:
>> #if 0
>> btr %rcx, (%rdi)
>> #else
>> shlx %ecx, %eax, %ebx
>> movl %ecx, %esi
>> shr $5, %esi
>> andl %ebx, (%rdi, %rsi, 4)
>> #endif
>> decl %edx
>> jnz loop
>>
>> movl $60, %eax
>> xorl %edi, %edi
>> syscall
>>
>> .section .data
>> .balign 4096
>> buf_start: .space 4096
>> buf_end:
>
> Is the 11,575,101 for port49 for the shlx version a typo? It's 10x larger than the btr version.
No (was suprised too! thats why I added the code), although looking a bit more
into it I think the benchmark probably isn't so good and is misleading.
if you add some arbitrary padding the perf numbers changed (and the p49 for the `shlx` version
goes down 1/cycle as expected).
If I had to guess something is going awry with uop replay <https://stackoverflow.com/questions/59905395/are-load-ops-deallocated-from-the-rs-when-they-dispatch-complete-or-some-other>.
Changing the benchmark:
.global _start
.p2align 6
.text
_start:
movl $1, %eax
movl $128, %ecx
leaq (buf_start)(%rip), %rdi
xorl %ebp, %ebp
movl $1000000, %edx
loop:
#if 1
btr %rcx, (%rdi)
#else
shlx %ecx, %eax, %ebx
movl %ecx, %esi
shr $5, %esi
andl %ebx, (%rdi, %rsi, 4)
#endif
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
decl %edx
jnz loop
movl $60, %eax
xorl %edi, %edi
syscall
.section .data
.balign 4096
buf_start: .space 4096
buf_end:
The `shlx` version performs much better.
1,224,172 p0
272,089 p1
1,000,431 p23
527,889 p5
2,210,403 p6
1,000,316 p78
1,217,652 p49
Versus the `btr` version:
2,300,942 p0
1,001,087 p1
1,000,150 p23
1,000,933 p5
4,001,759 p6
1,000,100 p78
1,300,149 p49
Which for some reason ends up bottenecking on
`p6` uops although should be able to schedule
on `p0` too.
Think I was wrong.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D140087/new/
https://reviews.llvm.org/D140087
More information about the llvm-commits
mailing list