[PATCH] D140087: [X86] Replace (31/63 -/^ X) with (NOT X) and ignore (32/64 ^ X) when computing shift count

Tue Dec 20 11:13:12 PST 2022

craig.topper added a comment.

In D140087#4008678 <https://reviews.llvm.org/D140087#4008678>, @goldstein.w.n wrote:

> In D140087#4008536 <https://reviews.llvm.org/D140087#4008536>, @craig.topper wrote:
>
>> In D140087#4008447 <https://reviews.llvm.org/D140087#4008447>, @goldstein.w.n wrote:
>>
>>> @pengfei Somewhat unrelated so if this is not the right place the ask, can you let me know where is.
>>>
>>> I was looking to add a peephole to change something like:
>>>
>>>   ptr[x / 32] |= (1 << (x % 32))
>>>
>>> Currently codegen is something like:
>>>
>>>   mov    $0x1,%gpr1
>>>   shlx   %cnt,%gpr1,%mask
>>>   shr    $0x5,%cnt
>>>   or  %mask, (%ptr, %cnt, 4)
>>>
>>> And it could be as simple as:
>>>
>>>   bts %cnt, (%ptr)
>>>
>>> (other pattern with `bt{s|r|c}` could also be improved)
>>>
>>> I saw `one_bit_patterns` in `X86InstrCompiler` but don't see a way to extend
>>> the peephole s.t `addr` is a function of the inputs and not just one of the inputs.
>>>
>>> Any chance you could direct me as where I should look at add this type of
>>> peephole?
>>
>> `bts %cnt, (%ptr)` is a 10 or 11 uop instruction. It might not be better than current code.
>
> I think that translates to worse throughput (so worse in a tight loop iff no carried
> dependency (better latency so if carried dependency still preferable)) but outside
> of that once case have to imagine its a win.
>
> 1. Better latency.
> 2. Less register pressure
> 3. Less code size.
> 4. Less Backend resources(unless this is some bizarre program thats retirement bound)
>
> on ICX:
> Loop using `shlx` method with hoisted `movl $1, %gpr`. 1,000,000 iterations (with a `decl; jne` for loop impl)
>
>    3,782,331      port0                                                          
>    3,207,023      port1                                                          
>    1,001,220      port23                                                         
>    3,216,022      port5                                                          
>    4,940,975      port6                                                          
>   11,575,101      port49                                                         
>
> Same loop using `btr`
>
>   2,055,213      port0                                                          
>   1,298,859      port1                                                          
>   1,000,372      port23                                                         
>   1,505,077      port5                                                          
>   3,261,176      port6                                                          
>   1,088,049      port49                                                         
>
> The loop:
>
>   	.global	_start
>   	.p2align 6
>   	.text
>   _start:
>   	movl	$1, %eax
>   	movl	$123, %ecx
>   	leaq	(buf_start)(%rip), %rdi
>   
>   	movl	$1000000, %edx
>   
>   loop:
>   #if 0
>   	btr	%rcx, (%rdi)
>   #else
>   	shlx	%ecx, %eax, %ebx
>   	movl	%ecx, %esi
>   	shr	$5, %esi
>   	andl	%ebx, (%rdi, %rsi, 4)
>   #endif
>   	decl	%edx
>   	jnz	loop
>   
>   	movl	$60, %eax
>   	xorl	%edi, %edi
>   	syscall
>   
>   	.section .data
>   	.balign	4096
>   buf_start:	.space 4096
>   buf_end:

Is the 11,575,101 for port49 for the shlx version a typo? It's 10x larger than the btr version.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D140087/new/

https://reviews.llvm.org/D140087