[PATCH] D27695: Add Instruction number to LSR cost model (PR23384)

Thu Jan 5 15:09:36 PST 2017

>I attach a runable testcase foo.cc which is extracted from an internal benchmark. Compiled with O2, it shows 1.5% degradation with the patch on my sandybridge desktop while for the original benchmark it shows 3% degradation on sandybridge and 5% on ivybridge machine.

I'm able to reproduce ~1.5 regression on HSW.
I've got even gain for Atom.

We can fix the regression if add number of complex address
calculations to LSR cost model. For x86 (+sandybridge) we can increase
instruction number by 1 for each, say, 4 complicated addresses. With
this the test stays unchanged and therefore the regression goes away.
When (and if) we add instruction number to LSR cost model under an
option we can file a PR on this and I'll fix this in separate patch.
When (and if) we put LSR cost comparison into target part we can
accept it for HSW and Atom now as is.

Thanks,
Evgeny

On Wed, Jan 4, 2017 at 4:18 PM, Wei Mi <wmi at google.com> wrote:
>
>
> On Wed, Jan 4, 2017 at 12:55 PM, Wei Mi <wmi at google.com> wrote:
>>
>>
>>
>> On Wed, Jan 4, 2017 at 11:51 AM, Evgeny Stupachenko via Phabricator
>> <reviews at reviews.llvm.org> wrote:
>>>
>>> evstupac added a comment.
>>>
>>> Quentin,
>>>
>>>   I've put first part in a separate review:
>>> https://reviews.llvm.org/D28307
>>>
>>> Wei,
>>>
>>>   Did you have a chance to test the patch performance on your benchmarks?
>>>
>>
>> Yes, I run the patch through internal benchmarks. It is flat overall
>> except two regressions. I look into one and I am trying to reduce a testcase
>> from it. Another one is probably from the same cause, but I will verify.
>>
>> Thanks,
>> Wei.
>>
>
>
> The two regressions mentioned above are from the same cause.
>
> I attach a runable testcase foo.cc which is extracted from an internal
> benchmark. Compiled with O2, it shows 1.5% degradation with the patch on my
> sandybridge desktop while for the original benchmark it shows 3% degradation
> on sandybridge and 5% on ivybridge machine.
>
> The instruction number for the testcase is actually reduced with the patch,
> but stalled-cycles-backend is increased significantly because the patch uses
> many more memory accesses with complex addressing mode.
>
> Base:
> .LBB0_12:                               #   Parent Loop BB0_2 Depth=1
>                                         # =>  This Inner Loop Header:
> Depth=2
>         movsd   -24(%rcx), %xmm1        # xmm1 = mem[0],zero
>         movsd   -16(%rcx), %xmm2        # xmm2 = mem[0],zero
>         mulsd   -24(%rdx), %xmm1
>         addsd   %xmm0, %xmm1
>         mulsd   -16(%rdx), %xmm2
>         addsd   %xmm1, %xmm2
>         movsd   -8(%rcx), %xmm1         # xmm1 = mem[0],zero
>         mulsd   -8(%rdx), %xmm1
>         addsd   %xmm2, %xmm1
>         movsd   (%rcx), %xmm0           # xmm0 = mem[0],zero
>         mulsd   (%rdx), %xmm0
>         addsd   %xmm1, %xmm0
>         addq    $32, %rdx
>         addq    $32, %rcx
>         addq    $-4, %rdi
>         jne     .LBB0_12
>
> With the patch:
> .LBB0_12:                               #   Parent Loop BB0_2 Depth=1
>                                         # =>  This Inner Loop Header:
> Depth=2
>         movsd   -24(%rdi,%rbx,8), %xmm1 # xmm1 = mem[0],zero
>         mulsd   -24(%rcx,%rbx,8), %xmm1
>         addsd   %xmm0, %xmm1
>         movsd   -16(%rdi,%rbx,8), %xmm0 # xmm0 = mem[0],zero
>         mulsd   -16(%rcx,%rbx,8), %xmm0
>         addsd   %xmm1, %xmm0
>         movsd   -8(%rdi,%rbx,8), %xmm1  # xmm1 = mem[0],zero
>         mulsd   -8(%rcx,%rbx,8), %xmm1
>         addsd   %xmm0, %xmm1
>         movsd   (%rdi,%rbx,8), %xmm0    # xmm0 = mem[0],zero
>         mulsd   (%rcx,%rbx,8), %xmm0
>         addsd   %xmm1, %xmm0
>         addq    $4, %rbx
>         cmpq    %rbx, %rdx
>         jne     .LBB0_12
>
> Thanks,
> Wei.
>
>>>
>>> Repository:
>>>   rL LLVM
>>>
>>> https://reviews.llvm.org/D27695
>>>
>>>
>>>
>>
>