[llvm-dev] RFC: Insertion of nops for performance stability

Sun Nov 20 13:26:21 PST 2016

----- Original Message -----

> From: "Paparo Bivas, Omer" <omer.paparo.bivas at intel.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: llvm-dev at lists.llvm.org
> Sent: Sunday, November 20, 2016 8:08:20 AM
> Subject: RE: [llvm-dev] RFC: Insertion of nops for performance
> stability

> Hi Hal,

> A pre-emit pass will indeed be preferable. I originally thought of
> it, too, however I could not figure out how can such a pass have an
> access to information on instruction sizes and block alignments.
> I know that for X86, at least, the branch relaxation is happening
> during the layout phase in the Assembler, where I plan to integrate
> the nop insertion such that the new MCPerfNopFragment will just be
> yet another kind of fragment to handle when laying out all the
> fragments: layout for a relaxable branch is relaxing it if
> necessary, and layout for a MCPerfNopFragment will be computing the
> number of nops it will contain.

> Can you please refer me to a pre-emit pass that does something
> similar?
Just as a quick example, the PowerPC branch relaxation pass, lib/Target/PowerPC/PPCBranchSelector.cpp, uses TII->getInstSizeInBytes to determine instruction sizes and calculates alignment-based padding that will be inserted based on combining instruction-size information with the alignment information from each block (MBB.getAlignment()) and the containing function (MBB.getParent()->getAlignment()). Perhaps I'm missing some complexity, but I'd think that nops could be inserted using a very similar algorithm. 

Two other thoughts: 

1. You can "insert nops" in some (many?) relevant cases by splitting the block and adjusting the block's alignment. 
2. You might want to fill the slots with other instructions which can be speculated from successor blocks. 

Thanks again, 
Hal 

> Thanks,
> Omer

> From: Hal Finkel [mailto:hfinkel at anl.gov]
> Sent: Friday, November 18, 2016 20:57
> To: Paparo Bivas, Omer <omer.paparo.bivas at intel.com>
> Cc: llvm-dev at lists.llvm.org
> Subject: Re: [llvm-dev] RFC: Insertion of nops for performance
> stability

> ----- Original Message -----

> > From: "Paparo Bivas, Omer via llvm-dev" < llvm-dev at lists.llvm.org >
> 
> > To: llvm-dev at lists.llvm.org
> 
> > Sent: Thursday, November 17, 2016 3:55:43 AM
> 
> > Subject: [llvm-dev] RFC: Insertion of nops for performance
> > stability
> 
> > Hi all,
> 

> > These days I am working on a feature designed to insert nops for IA
> > code generation that will provide performance improvements and
> > performance stability. This feature will not affect other
> > architectures. It will, however, set up an infrastructure for other
> > architectures to do the same, if ever needed.
> 

> > Here are some examples for cases in which nops can improve
> > performance:
> 
> > 1. DSB (Decoded Stream Buffer) Thrashing.
> 
> > DSB is a cache for uops that have been decoded. It is limited to 3
> > ways per 32B window; 4 or more ways will cause a performance
> > degradation. This can be avoided by aligning with nops. See Zia
> > Ansari's presentation for more information:
> > http://schd.ws/hosted_files/llvmdevelopersmeetingbay2016/d9/LLVM-Conference-US-2016-Code-Alignment.pdf
> 
> > 2. Two or more branches with the same target in a single
> > instruction
> > window may cause poor branch prediction.
> 
> > Our recommendation to programmers is to try and keep 2 branches out
> > of the same 16B chunk if they both go to the same target. From the
> > manual:
> 
> > “Avoid putting two conditional branch instructions in a loop so
> > that
> > both have the same branch target address and, at the same time,
> > belong to (i.e. have their last bytes’ addresses within) the same
> > 16-byte aligned code block.”
> 

> > Let’s see an example for the second case. Consider the following
> > code:
> 

> > int y;
> 
> > int x;
> 

> > int foo(int i, int m, int p, int q, int *p1, int *p2)
> 
> > {
> 
> > if ( i+*p1 == p || i-*p2 > m || i == q ) {
> 
> > y++;
> 
> > x++;
> 
> > }
> 

> > return 0;
> 
> > }
> 

> > The two last || clauses of the if will be translated to two
> > conditional jumps to the same target. The generated code will look
> > like so:
> 

> > foo:
> 
> > 0: 48 8b 44 24 28 movq 40(%rsp), %rax
> 
> > 5: 8b 00 movl (%rax), %eax
> 
> > 7: 01 c8 addl %ecx, %eax
> 
> > 9: 44 39 c0 cmpl %r8d, %eax
> 
> > c: 75 0f jne 15 <foo+0x1D>
> 
> > e: ff 05 00 00 00 00 incl (%rip)
> 
> > 14: ff 05 00 00 00 00 incl (%rip)
> 
> > 1a: 31 c0 xorl %eax, %eax
> 
> > 1c: c3 retq
> 
> > 1d: 44 39 c9 cmpl %r9d, %ecx
> 
> > 20: 74 ec je -20 <foo+0xE>
> 
> > 22: 48 8b 44 24 30 movq 48(%rsp), %rax
> 
> > 27: 2b 08 subl (%rax), %ecx
> 
> > 29: 39 d1 cmpl %edx, %ecx
> 
> > 2b: 7f e1 jg -31 <foo+0xE>
> 
> > 2d: 31 c0 xorl %eax, %eax
> 
> > 2f: c3 retq
> 

> > Note: the first if clause jump is the jne instruction at 0x0C, the
> > second if clause jump is the jg instruction at 0x2B and the third
> > if
> > clause jump is the je instruction at 0x20. Also note that the jg
> > and
> > je share a 16 byte window, which is exactly the situation we wish
> > to
> > avoid (consider the case in which foo is called from inside a loop.
> > This will cause performance penalty).
> 

> > The generated code after my changes will look like so:
> 

> > foo:
> 
> > 0: 48 8b 44 24 28 movq 40(%rsp), %rax
> 
> > 5: 8b 00 movl (%rax), %eax
> 
> > 7: 01 c8 addl %ecx, %eax
> 
> > 9: 44 39 c0 cmpl %r8d, %eax
> 
> > c: 75 0f jne 15 <foo+0x1D>
> 
> > e: ff 05 00 00 00 00 incl (%rip)
> 
> > 14: ff 05 00 00 00 00 incl (%rip)
> 
> > 1a: 31 c0 xorl %eax, %eax
> 
> > 1c: c3 retq
> 
> > 1d: 44 39 c9 cmpl %r9d, %ecx
> 
> > 20: 74 ec je -20 <foo+0xE>
> 
> > 22: 48 8b 44 24 30 movq 48(%rsp), %rax
> 
> > 27: 2b 08 subl (%rax), %ecx
> 
> > 29: 39 d1 cmpl %edx, %ecx
> 
> > 2b: 0f 1f 40 00 nopl (%rax)
> 
> > 2f: 7f dd jg -35 <foo+0xE>
> 
> > 31: 31 c0 xorl %eax, %eax
> 
> > 33: c3 retq
> 

> > The only change is the nopl at 0x2B, before the jg, which pushes
> > the
> > jg instruction to the next 16 byte window, and thus avoids the
> > undesired situation. Note that, in this case, in order to push an
> > instruction to the next window it is sufficient to push its last
> > byte to the next window.
> 

> > Due to the nature of those performance nops, which often rely on
> > the
> > layout of the code (e.g. number of instructions in a 16/32B window)
> > that is only known in assembly time, the insertion is divided to
> > two
> > phases:
> 
> > 1. Marking "suspicious" places while encoding, i.e. hotspots that
> > may
> > cause a performance penalty, during the encoding.
> 
> > 2. Computing the actual number (zero or greater) of required nops
> > in
> > order to avoid a performance penalty.
> 

> > In order to achieve that, I am introducing a new kind of MCFragment
> > named MCPerfNopFragment. It will be the connecting link between the
> > two phases:
> 
> > 1. During encoding, the object streamer (MCObjectStreamer) will
> > query
> > the backend for the need of a MCPerfNopFragment before every
> > encoded
> > instruction. If required, a MCPerfNopFragment will be created and
> > inserted to the MCSection.
> 
> > 2. During the assembly phase:
> 
> > a. When computing the fragment size in
> > MCAssembler::computeFragmentSize the Assembler will consult the
> > backend and will provide it the layout in order to determine the
> > required size of the fragment. Note that the computed size may
> > change from call to call, similarly to the process of computing the
> > size of a MCAlignFragment.
> 
> > b. When writing the fragment the assembler will again use the
> > backend
> > in order to write the required number of nops.
> 

> > Comments and questions are welcome.
> 
> Do you need to insert the nops as such a late stage? We have other
> components that insert nops during post-RA scheduling, for example.
> Even later, nops could be inserted during a pre-emit pass (similar
> to where other backends do branch relaxation/selection). The branch
> relaxation process also relies on knowing the layout of the code,
> but gathers this information from the available information on
> instruction sizes, block alignments, etc.

> -Hal
> > Thanks,
> 
> > Omer
> 
> > ---------------------------------------------------------------------
> 
> > Intel Israel (74) Limited
> 
> > This e-mail and any attachments may contain confidential material
> > for
> 
> > the sole use of the intended recipient(s). Any review or
> > distribution
> 
> > by others is strictly prohibited. If you are not the intended
> 
> > recipient, please contact the sender and delete all copies.
> 

> > _______________________________________________
> 
> > LLVM Developers mailing list
> 
> > llvm-dev at lists.llvm.org
> 
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
> --

> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
-- 

Hal Finkel 
Lead, Compiler Technology and Programming Languages 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161120/5a9d2c4f/attachment.html>