[PATCH] D140087: [X86] Replace (31/63 -/^ X) with (NOT X) and ignore (32/64 ^ X) when computing shift count
Craig Topper via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Tue Dec 20 11:13:12 PST 2022
craig.topper added a comment.
In D140087#4008678 <https://reviews.llvm.org/D140087#4008678>, @goldstein.w.n wrote:
> In D140087#4008536 <https://reviews.llvm.org/D140087#4008536>, @craig.topper wrote:
>
>> In D140087#4008447 <https://reviews.llvm.org/D140087#4008447>, @goldstein.w.n wrote:
>>
>>> @pengfei Somewhat unrelated so if this is not the right place the ask, can you let me know where is.
>>>
>>> I was looking to add a peephole to change something like:
>>>
>>> ptr[x / 32] |= (1 << (x % 32))
>>>
>>> Currently codegen is something like:
>>>
>>> mov $0x1,%gpr1
>>> shlx %cnt,%gpr1,%mask
>>> shr $0x5,%cnt
>>> or %mask, (%ptr, %cnt, 4)
>>>
>>> And it could be as simple as:
>>>
>>> bts %cnt, (%ptr)
>>>
>>> (other pattern with `bt{s|r|c}` could also be improved)
>>>
>>> I saw `one_bit_patterns` in `X86InstrCompiler` but don't see a way to extend
>>> the peephole s.t `addr` is a function of the inputs and not just one of the inputs.
>>>
>>> Any chance you could direct me as where I should look at add this type of
>>> peephole?
>>
>> `bts %cnt, (%ptr)` is a 10 or 11 uop instruction. It might not be better than current code.
>
> I think that translates to worse throughput (so worse in a tight loop iff no carried
> dependency (better latency so if carried dependency still preferable)) but outside
> of that once case have to imagine its a win.
>
> 1. Better latency.
> 2. Less register pressure
> 3. Less code size.
> 4. Less Backend resources(unless this is some bizarre program thats retirement bound)
>
> on ICX:
> Loop using `shlx` method with hoisted `movl $1, %gpr`. 1,000,000 iterations (with a `decl; jne` for loop impl)
>
> 3,782,331 port0
> 3,207,023 port1
> 1,001,220 port23
> 3,216,022 port5
> 4,940,975 port6
> 11,575,101 port49
>
> Same loop using `btr`
>
> 2,055,213 port0
> 1,298,859 port1
> 1,000,372 port23
> 1,505,077 port5
> 3,261,176 port6
> 1,088,049 port49
>
> The loop:
>
> .global _start
> .p2align 6
> .text
> _start:
> movl $1, %eax
> movl $123, %ecx
> leaq (buf_start)(%rip), %rdi
>
> movl $1000000, %edx
>
> loop:
> #if 0
> btr %rcx, (%rdi)
> #else
> shlx %ecx, %eax, %ebx
> movl %ecx, %esi
> shr $5, %esi
> andl %ebx, (%rdi, %rsi, 4)
> #endif
> decl %edx
> jnz loop
>
> movl $60, %eax
> xorl %edi, %edi
> syscall
>
> .section .data
> .balign 4096
> buf_start: .space 4096
> buf_end:
Is the 11,575,101 for port49 for the shlx version a typo? It's 10x larger than the btr version.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D140087/new/
https://reviews.llvm.org/D140087
More information about the llvm-commits
mailing list