[PATCH] D140087: [X86] Replace (31/63 -/^ X) with (NOT X) and ignore (32/64 ^ X) when computing shift count

Tue Dec 20 11:39:11 PST 2022

goldstein.w.n added a comment.

In D140087#4008693 <https://reviews.llvm.org/D140087#4008693>, @craig.topper wrote:

> In D140087#4008678 <https://reviews.llvm.org/D140087#4008678>, @goldstein.w.n wrote:
>
>> In D140087#4008536 <https://reviews.llvm.org/D140087#4008536>, @craig.topper wrote:
>>
>>> In D140087#4008447 <https://reviews.llvm.org/D140087#4008447>, @goldstein.w.n wrote:
>>>
>>>> @pengfei Somewhat unrelated so if this is not the right place the ask, can you let me know where is.
>>>>
>>>> I was looking to add a peephole to change something like:
>>>>
>>>>   ptr[x / 32] |= (1 << (x % 32))
>>>>
>>>> Currently codegen is something like:
>>>>
>>>>   mov    $0x1,%gpr1
>>>>   shlx   %cnt,%gpr1,%mask
>>>>   shr    $0x5,%cnt
>>>>   or  %mask, (%ptr, %cnt, 4)
>>>>
>>>> And it could be as simple as:
>>>>
>>>>   bts %cnt, (%ptr)
>>>>
>>>> (other pattern with `bt{s|r|c}` could also be improved)
>>>>
>>>> I saw `one_bit_patterns` in `X86InstrCompiler` but don't see a way to extend
>>>> the peephole s.t `addr` is a function of the inputs and not just one of the inputs.
>>>>
>>>> Any chance you could direct me as where I should look at add this type of
>>>> peephole?
>>>
>>> `bts %cnt, (%ptr)` is a 10 or 11 uop instruction. It might not be better than current code.
>>
>> I think that translates to worse throughput (so worse in a tight loop iff no carried
>> dependency (better latency so if carried dependency still preferable)) but outside
>> of that once case have to imagine its a win.
>>
>> 1. Better latency.
>> 2. Less register pressure
>> 3. Less code size.
>> 4. Less Backend resources(unless this is some bizarre program thats retirement bound)
>>
>> on ICX:
>> Loop using `shlx` method with hoisted `movl $1, %gpr`. 1,000,000 iterations (with a `decl; jne` for loop impl)
>>
>>    3,782,331      port0                                                          
>>    3,207,023      port1                                                          
>>    1,001,220      port23                                                         
>>    3,216,022      port5                                                          
>>    4,940,975      port6                                                          
>>   11,575,101      port49                                                         
>>
>> Same loop using `btr`
>>
>>   2,055,213      port0                                                          
>>   1,298,859      port1                                                          
>>   1,000,372      port23                                                         
>>   1,505,077      port5                                                          
>>   3,261,176      port6                                                          
>>   1,088,049      port49                                                         
>>
>> The loop:
>>
>>   	.global	_start
>>   	.p2align 6
>>   	.text
>>   _start:
>>   	movl	$1, %eax
>>   	movl	$123, %ecx
>>   	leaq	(buf_start)(%rip), %rdi
>>   
>>   	movl	$1000000, %edx
>>   
>>   loop:
>>   #if 0
>>   	btr	%rcx, (%rdi)
>>   #else
>>   	shlx	%ecx, %eax, %ebx
>>   	movl	%ecx, %esi
>>   	shr	$5, %esi
>>   	andl	%ebx, (%rdi, %rsi, 4)
>>   #endif
>>   	decl	%edx
>>   	jnz	loop
>>   
>>   	movl	$60, %eax
>>   	xorl	%edi, %edi
>>   	syscall
>>   
>>   	.section .data
>>   	.balign	4096
>>   buf_start:	.space 4096
>>   buf_end:
>
> Is the 11,575,101 for port49 for the shlx version a typo? It's 10x larger than the btr version.

No (was suprised too! thats why I added the code), although looking a bit more
into it I think the benchmark probably isn't so good and is misleading.

if you add some arbitrary padding the perf numbers changed (and the p49 for the `shlx` version
goes down 1/cycle as expected).

If I had to guess something is going awry with uop replay <https://stackoverflow.com/questions/59905395/are-load-ops-deallocated-from-the-rs-when-they-dispatch-complete-or-some-other>.

Changing the benchmark:

  	.global	_start
  	.p2align 6
  	.text
  _start:
  	movl	$1, %eax
  	movl	$128, %ecx
  	leaq	(buf_start)(%rip), %rdi
  	xorl	%ebp, %ebp
  	movl	$1000000, %edx

  loop:
  #if 1
  	btr	%rcx, (%rdi)
  #else
  	shlx	%ecx, %eax, %ebx
  	movl	%ecx, %esi
  	shr	$5, %esi
  	andl	%ebx, (%rdi, %rsi, 4)
  #endif
  	nop
  	nop
  	nop
  	nop
  	nop
  	nop
  	nop
  	nop
  	nop
  	nop
  	nop
  	nop
  	nop
  	nop
  	nop
  	nop
  	nop
  	nop
  	nop
  	nop

  	decl	%edx
  	jnz	loop

  	movl	$60, %eax
  	xorl	%edi, %edi
  	syscall

  	.section .data
  	.balign	4096
  buf_start:	.space 4096
  buf_end:

The `shlx` version performs much better.

  1,224,172      p0                                                          
    272,089      p1                                                          
  1,000,431      p23                                                         
    527,889      p5                                                          
  2,210,403      p6                                                          
  1,000,316      p78                                                         
  1,217,652      p49                                           

Versus the `btr` version:

  2,300,942      p0                                                          
  1,001,087      p1                                                          
  1,000,150      p23                                                         
  1,000,933      p5                                                          
  4,001,759      p6                                                          
  1,000,100      p78                                                         
  1,300,149      p49                                                         

Which for some reason ends up bottenecking on
`p6` uops although should be able to schedule
on `p0` too.

Think I was wrong.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D140087/new/

https://reviews.llvm.org/D140087