[PATCH] PR 23155 - Improvement to X86 16 bit operation promotion for better performance.

Tue Apr 28 19:11:10 PDT 2015

OK.

-----Original Message-----
From: Quentin Colombet [mailto:qcolombet at apple.com] 
Sent: Tuesday, April 28, 2015 4:53 PM
To: Smith, Kevin B
Cc: reviews+D9209+public+d99c88a751bf019f at reviews.llvm.org; chandlerc at gmail.com; chisophugis at gmail.com; llvm-dev at redking.me.uk; Demikhovsky, Elena; Commit Messages and Patches for LLVM
Subject: Re: [PATCH] PR 23155 - Improvement to X86 16 bit operation promotion for better performance.

The ExecDomainFixer made sense only for the xor, mov approach, which you proved is worthless :).

I now think a new pass is more suitable.

Q.

> On Apr 28, 2015, at 4:43 PM, Smith, Kevin B <kevin.b.smith at intel.com> wrote:
> 
> Quentin,
> 
> Thanks for pointing me at ExecDomainFixer.  I'll go take a look at that, see if I can do some work there to solve this.
> Others, please pipe in if you think this isn't a desirable direction, or think there is a better one.
> 
> Kevin
> 
> -----Original Message-----
> From: Quentin Colombet [mailto:qcolombet at apple.com] 
> Sent: Tuesday, April 28, 2015 4:40 PM
> To: Smith, Kevin B
> Cc: reviews+D9209+public+d99c88a751bf019f at reviews.llvm.org; chandlerc at gmail.com; chisophugis at gmail.com; llvm-dev at redking.me.uk; Demikhovsky, Elena; Commit Messages and Patches for LLVM
> Subject: Re: [PATCH] PR 23155 - Improvement to X86 16 bit operation promotion for better performance.
> 
> 
>> On Apr 28, 2015, at 4:36 PM, Smith, Kevin B <kevin.b.smith at intel.com> wrote:
>> 
>> For register-register moves here are the ones that the Opt manual says are zero-cost (done through renaming)
>> 
>> MOV reg32, reg32
>> MOV reg64, reg64
>> MOVUPD/MOVAPD xmm, xmm
>> MOVUPD/MOVAPD ymm, ymm
>> MOVUPS?MOVAPS xmm, xmm
>> MOVUPS/MOVAPS ymm, ymm
>> MOVDQA/MOVDQU xmm, xmm
>> MOVDQA/MOVDQU ymm, ymm
>> MOVZX reg32, reg8 (if not AH/BH/CH/DH)
>> MOVZX reg64, reg8 (if not AH/BH/CH/DH)
>> 
>> So, the movzbl  is covered.  Neither the 8 or 16 bit register-register moves are covered here, nor is the movzwl.  The 8 and 16 are
>> specifically called out as not zero cost.
> 
> Hmm, I guess my recollection is bad, I thought plain mov, even 8 and 16 bits, had a faster throughput than the movzx and co.
> 
> Thanks for checking.
> 
> Q.
> 
>> 
>> Kevin
>> 
>> -----Original Message-----
>> From: Quentin Colombet [mailto:qcolombet at apple.com] 
>> Sent: Tuesday, April 28, 2015 4:28 PM
>> To: Smith, Kevin B
>> Cc: reviews+D9209+public+d99c88a751bf019f at reviews.llvm.org; chandlerc at gmail.com; chisophugis at gmail.com; llvm-dev at redking.me.uk; Demikhovsky, Elena
>> Subject: Re: [PATCH] PR 23155 - Improvement to X86 16 bit operation promotion for better performance.
>> 
>> 
>>> On Apr 28, 2015, at 4:25 PM, Smith, Kevin B <kevin.b.smith at intel.com> wrote:
>>> 
>>> Code size wise movzwl/movzbl is a win.  It takes 3 bytes vs 4 for the xor, movb sequence, and 3 bytes vs 5 for xor, movw.
>> 
>> For code size size, sure.
>> 
>>> 
>>> My expectation is that movz would be either faster or the same from execution perspective (compared to xor, mov)
>>> across a broader range of processor implementations.
>> 
>> Not sure. The plain moves could probably be eliminated by register renaming, I am not sure that is the case for movz.
>> 
>> Q.
>> 
>>> 
>>> Kevin
>>> 
>>> -----Original Message-----
>>> From: Quentin Colombet [mailto:qcolombet at apple.com] 
>>> Sent: Tuesday, April 28, 2015 3:46 PM
>>> To: Smith, Kevin B
>>> Cc: reviews+D9209+public+d99c88a751bf019f at reviews.llvm.org; chandlerc at gmail.com; chisophugis at gmail.com; llvm-dev at redking.me.uk; Demikhovsky, Elena
>>> Subject: Re: [PATCH] PR 23155 - Improvement to X86 16 bit operation promotion for better performance.
>>> 
>>> 
>>>> On Apr 28, 2015, at 3:28 PM, Smith, Kevin B <kevin.b.smith at intel.com> wrote:
>>>> 
>>>> For ease of others, here is the comment I added to 23155
>>>> 
>>>> As in Agner's in in 17113, I agree that the newer Intel architectures don't really suffer from partial register stalls in the sense that Pentium Pro, Pentium 4, and older architectures did. As noted in 17113
>>>> 
>>>> * There is no penalty on Haswell for partial register access. 
>>>> * On Sandy Bridge, the cost is a single uop that gets automatically inserted at the cost of 1 cycle latency.
>>>> * On Ivy Bridge there is no penalty except for the "high" byte subregs (AH, BH, etc.), in which case it behaves like Sandy Bridge.
>>>> 
>>>> However, whenever a partial register is the destination of an operation, and doesn't otherwise need to read the register (such as occurs with movw, movb)
>>>> then this creates a read dependence on the upper portion of the register.  If
>>>> a movzbl or movzwl is instead used, then the destination register is fully killed, eliminating this "false" dependence on the upper portion of the register.  This issue impacts both word and byte operations. However, it is worth noting that this only really matters in relatively tight loops where the false dependence arc causes a loop carried dependence, and that loop carried dependence effectively keeps the out-of-order processor from being able to perform multiple iterations of the loop without loop carried "false" dependencies.
>>>> 
>>>> From Chandler's comments in 22473:
>>>> We need to add a pass that replaces movb (and movw) with movzbl (and movzwl) when the destination is a register and the high bytes aren't used. Then we need to benchmark bzip2 to ensure that this recovers all of the performance that forcing the use of cmpl did, and probably some other sanity benchmarking. Then we can swap the cmpl formation for the movzbl formation.
>>> 
>>> I haven’t actually checked the machine IR, but I am guessing we should have something like:
>>> vreg1 = movb ... ; vreg1 GR8
>>> 
>>> If we change that to something like:
>>> vreg0 = IMPLICIT_DEF ; vreg GR32
>>> vreg1 = movb … <undef, imp-use, tied>vreg0
>>> 
>>> We could rely on the thing that already remove the false dependency in the ExecDomainFixer (IIRC), that would generate:
>>> dstReg = xor dstReg, dstReg
>>> dstReg = movb …
>>> 
>>> The bottom line is that may just work without any additional pass.
>>> That being said, one should check if xor -> movb is better than movzbl… I believe it is, but haven’t checked.
>>> 
>>> Q.
>>> 
>>>> 
>>>> I am in agreement that this would be a good solution.  If you, Chandler, and Eric all like that direction, I will be willing to work on that.  I also have access to SPEC benchmarks, both 2000 and 2006 to be able to benchmark as well for bzip2 specifically since that is something the community considers important.
>>>> 
>>>> Kevin
>>>> 
>>>> -----Original Message-----
>>>> From: Sanjay Patel [mailto:spatel at rotateright.com] 
>>>> Sent: Tuesday, April 28, 2015 3:18 PM
>>>> To: Smith, Kevin B; chandlerc at gmail.com; qcolombet at apple.com; chisophugis at gmail.com; llvm-dev at redking.me.uk; Demikhovsky, Elena; spatel at rotateright.com
>>>> Subject: Re: [PATCH] PR 23155 - Improvement to X86 16 bit operation promotion for better performance.
>>>> 
>>>> Hi Kevin -
>>>> 
>>>> Roping in some other potentially interested reviewers based on past activity.
>>>> 
>>>> I also added some comments to https://llvm.org/bugs/show_bug.cgi?id=23155 and linked some other partial reg update bugs.
>>>> 
>>>> We need some clarification on what the expected behavior is wrt partial reg updates and the various micro-architectures. Eg, I'm unable to reproduce all of your Haswell perf results locally...which seems to line up with Agner's advice, but then we definitely see a perf hit on bzip in https://llvm.org/bugs/show_bug.cgi?id=22473 ...but maybe there are different factors in play there and we're confusing the issues?
>>>> 
>>>> 
>>>> REPOSITORY
>>>> rL LLVM
>>>> 
>>>> http://reviews.llvm.org/D9209
>>>> 
>>>> EMAIL PREFERENCES
>>>> http://reviews.llvm.org/settings/panel/emailpreferences/
>>>> 
>>>> 
>>> 
>> 
>