[llvm-dev] RFC: Insertion of nops for performance stability

Sun Nov 20 13:35:22 PST 2016

----- Original Message -----

> From: "Hal Finkel via llvm-dev" <llvm-dev at lists.llvm.org>
> To: "Paparo Bivas, Omer" <omer.paparo.bivas at intel.com>
> Cc: llvm-dev at lists.llvm.org
> Sent: Sunday, November 20, 2016 3:26:21 PM
> Subject: Re: [llvm-dev] RFC: Insertion of nops for performance
> stability

> ----- Original Message -----

> > From: "Paparo Bivas, Omer" <omer.paparo.bivas at intel.com>
> 
> > To: "Hal Finkel" <hfinkel at anl.gov>
> 
> > Cc: llvm-dev at lists.llvm.org
> 
> > Sent: Sunday, November 20, 2016 8:08:20 AM
> 
> > Subject: RE: [llvm-dev] RFC: Insertion of nops for performance
> > stability
> 

> > Hi Hal,
> 

> > A pre-emit pass will indeed be preferable. I originally thought of
> > it, too, however I could not figure out how can such a pass have an
> > access to information on instruction sizes and block alignments.
> 
> > I know that for X86, at least, the branch relaxation is happening
> > during the layout phase in the Assembler, where I plan to integrate
> > the nop insertion such that the new MCPerfNopFragment will just be
> > yet another kind of fragment to handle when laying out all the
> > fragments: layout for a relaxable branch is relaxing it if
> > necessary, and layout for a MCPerfNopFragment will be computing the
> > number of nops it will contain.
> 

> > Can you please refer me to a pre-emit pass that does something
> > similar?
> 
> Just as a quick example, the PowerPC branch relaxation pass,
> lib/Target/PowerPC/PPCBranchSelector.cpp, uses
> TII->getInstSizeInBytes to determine instruction sizes and
> calculates alignment-based padding that will be inserted based on
> combining instruction-size information with the alignment
> information from each block (MBB.getAlignment()) and the containing
> function (MBB.getParent()->getAlignment()). Perhaps I'm missing some
> complexity, but I'd think that nops could be inserted using a very
> similar algorithm.

> Two other thoughts:

> 1. You can "insert nops" in some (many?) relevant cases by splitting
> the block and adjusting the block's alignment.
> 2. You might want to fill the slots with other instructions which can
> be speculated from successor blocks.

Regarding this second point, I recall Philip telling me that the Java Hotspot compiler has some code-placement optimization which does this by design. Philip, am I recalling correctly? 

-Hal 

> Thanks again,
> Hal

> > Thanks,
> 
> > Omer
> 

> > From: Hal Finkel [mailto:hfinkel at anl.gov]
> 
> > Sent: Friday, November 18, 2016 20:57
> 
> > To: Paparo Bivas, Omer <omer.paparo.bivas at intel.com>
> 
> > Cc: llvm-dev at lists.llvm.org
> 
> > Subject: Re: [llvm-dev] RFC: Insertion of nops for performance
> > stability
> 

> > ----- Original Message -----
> 

> > > From: "Paparo Bivas, Omer via llvm-dev" < llvm-dev at lists.llvm.org
> > > >
> > 
> 
> > > To: llvm-dev at lists.llvm.org
> > 
> 
> > > Sent: Thursday, November 17, 2016 3:55:43 AM
> > 
> 
> > > Subject: [llvm-dev] RFC: Insertion of nops for performance
> > > stability
> > 
> 
> > > Hi all,
> > 
> 

> > > These days I am working on a feature designed to insert nops for
> > > IA
> > > code generation that will provide performance improvements and
> > > performance stability. This feature will not affect other
> > > architectures. It will, however, set up an infrastructure for
> > > other
> > > architectures to do the same, if ever needed.
> > 
> 

> > > Here are some examples for cases in which nops can improve
> > > performance:
> > 
> 
> > > 1. DSB (Decoded Stream Buffer) Thrashing.
> > 
> 
> > > DSB is a cache for uops that have been decoded. It is limited to
> > > 3
> > > ways per 32B window; 4 or more ways will cause a performance
> > > degradation. This can be avoided by aligning with nops. See Zia
> > > Ansari's presentation for more information:
> > > http://schd.ws/hosted_files/llvmdevelopersmeetingbay2016/d9/LLVM-Conference-US-2016-Code-Alignment.pdf
> > 
> 
> > > 2. Two or more branches with the same target in a single
> > > instruction
> > > window may cause poor branch prediction.
> > 
> 
> > > Our recommendation to programmers is to try and keep 2 branches
> > > out
> > > of the same 16B chunk if they both go to the same target. From
> > > the
> > > manual:
> > 
> 
> > > “Avoid putting two conditional branch instructions in a loop so
> > > that
> > > both have the same branch target address and, at the same time,
> > > belong to (i.e. have their last bytes’ addresses within) the same
> > > 16-byte aligned code block.”
> > 
> 

> > > Let’s see an example for the second case. Consider the following
> > > code:
> > 
> 

> > > int y;
> > 
> 
> > > int x;
> > 
> 

> > > int foo(int i, int m, int p, int q, int *p1, int *p2)
> > 
> 
> > > {
> > 
> 
> > > if ( i+*p1 == p || i-*p2 > m || i == q ) {
> > 
> 
> > > y++;
> > 
> 
> > > x++;
> > 
> 
> > > }
> > 
> 

> > > return 0;
> > 
> 
> > > }
> > 
> 

> > > The two last || clauses of the if will be translated to two
> > > conditional jumps to the same target. The generated code will
> > > look
> > > like so:
> > 
> 

> > > foo:
> > 
> 
> > > 0: 48 8b 44 24 28 movq 40(%rsp), %rax
> > 
> 
> > > 5: 8b 00 movl (%rax), %eax
> > 
> 
> > > 7: 01 c8 addl %ecx, %eax
> > 
> 
> > > 9: 44 39 c0 cmpl %r8d, %eax
> > 
> 
> > > c: 75 0f jne 15 <foo+0x1D>
> > 
> 
> > > e: ff 05 00 00 00 00 incl (%rip)
> > 
> 
> > > 14: ff 05 00 00 00 00 incl (%rip)
> > 
> 
> > > 1a: 31 c0 xorl %eax, %eax
> > 
> 
> > > 1c: c3 retq
> > 
> 
> > > 1d: 44 39 c9 cmpl %r9d, %ecx
> > 
> 
> > > 20: 74 ec je -20 <foo+0xE>
> > 
> 
> > > 22: 48 8b 44 24 30 movq 48(%rsp), %rax
> > 
> 
> > > 27: 2b 08 subl (%rax), %ecx
> > 
> 
> > > 29: 39 d1 cmpl %edx, %ecx
> > 
> 
> > > 2b: 7f e1 jg -31 <foo+0xE>
> > 
> 
> > > 2d: 31 c0 xorl %eax, %eax
> > 
> 
> > > 2f: c3 retq
> > 
> 

> > > Note: the first if clause jump is the jne instruction at 0x0C,
> > > the
> > > second if clause jump is the jg instruction at 0x2B and the third
> > > if
> > > clause jump is the je instruction at 0x20. Also note that the jg
> > > and
> > > je share a 16 byte window, which is exactly the situation we wish
> > > to
> > > avoid (consider the case in which foo is called from inside a
> > > loop.
> > > This will cause performance penalty).
> > 
> 

> > > The generated code after my changes will look like so:
> > 
> 

> > > foo:
> > 
> 
> > > 0: 48 8b 44 24 28 movq 40(%rsp), %rax
> > 
> 
> > > 5: 8b 00 movl (%rax), %eax
> > 
> 
> > > 7: 01 c8 addl %ecx, %eax
> > 
> 
> > > 9: 44 39 c0 cmpl %r8d, %eax
> > 
> 
> > > c: 75 0f jne 15 <foo+0x1D>
> > 
> 
> > > e: ff 05 00 00 00 00 incl (%rip)
> > 
> 
> > > 14: ff 05 00 00 00 00 incl (%rip)
> > 
> 
> > > 1a: 31 c0 xorl %eax, %eax
> > 
> 
> > > 1c: c3 retq
> > 
> 
> > > 1d: 44 39 c9 cmpl %r9d, %ecx
> > 
> 
> > > 20: 74 ec je -20 <foo+0xE>
> > 
> 
> > > 22: 48 8b 44 24 30 movq 48(%rsp), %rax
> > 
> 
> > > 27: 2b 08 subl (%rax), %ecx
> > 
> 
> > > 29: 39 d1 cmpl %edx, %ecx
> > 
> 
> > > 2b: 0f 1f 40 00 nopl (%rax)
> > 
> 
> > > 2f: 7f dd jg -35 <foo+0xE>
> > 
> 
> > > 31: 31 c0 xorl %eax, %eax
> > 
> 
> > > 33: c3 retq
> > 
> 

> > > The only change is the nopl at 0x2B, before the jg, which pushes
> > > the
> > > jg instruction to the next 16 byte window, and thus avoids the
> > > undesired situation. Note that, in this case, in order to push an
> > > instruction to the next window it is sufficient to push its last
> > > byte to the next window.
> > 
> 

> > > Due to the nature of those performance nops, which often rely on
> > > the
> > > layout of the code (e.g. number of instructions in a 16/32B
> > > window)
> > > that is only known in assembly time, the insertion is divided to
> > > two
> > > phases:
> > 
> 
> > > 1. Marking "suspicious" places while encoding, i.e. hotspots that
> > > may
> > > cause a performance penalty, during the encoding.
> > 
> 
> > > 2. Computing the actual number (zero or greater) of required nops
> > > in
> > > order to avoid a performance penalty.
> > 
> 

> > > In order to achieve that, I am introducing a new kind of
> > > MCFragment
> > > named MCPerfNopFragment. It will be the connecting link between
> > > the
> > > two phases:
> > 
> 
> > > 1. During encoding, the object streamer (MCObjectStreamer) will
> > > query
> > > the backend for the need of a MCPerfNopFragment before every
> > > encoded
> > > instruction. If required, a MCPerfNopFragment will be created and
> > > inserted to the MCSection.
> > 
> 
> > > 2. During the assembly phase:
> > 
> 
> > > a. When computing the fragment size in
> > > MCAssembler::computeFragmentSize the Assembler will consult the
> > > backend and will provide it the layout in order to determine the
> > > required size of the fragment. Note that the computed size may
> > > change from call to call, similarly to the process of computing
> > > the
> > > size of a MCAlignFragment.
> > 
> 
> > > b. When writing the fragment the assembler will again use the
> > > backend
> > > in order to write the required number of nops.
> > 
> 

> > > Comments and questions are welcome.
> > 
> 
> > Do you need to insert the nops as such a late stage? We have other
> > components that insert nops during post-RA scheduling, for example.
> > Even later, nops could be inserted during a pre-emit pass (similar
> > to where other backends do branch relaxation/selection). The branch
> > relaxation process also relies on knowing the layout of the code,
> > but gathers this information from the available information on
> > instruction sizes, block alignments, etc.
> 

> > -Hal
> 
> > > Thanks,
> > 
> 
> > > Omer
> > 
> 
> > > ---------------------------------------------------------------------
> > 
> 
> > > Intel Israel (74) Limited
> > 
> 
> > > This e-mail and any attachments may contain confidential material
> > > for
> > 
> 
> > > the sole use of the intended recipient(s). Any review or
> > > distribution
> > 
> 
> > > by others is strictly prohibited. If you are not the intended
> > 
> 
> > > recipient, please contact the sender and delete all copies.
> > 
> 

> > > _______________________________________________
> > 
> 
> > > LLVM Developers mailing list
> > 
> 
> > > llvm-dev at lists.llvm.org
> > 
> 
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > 
> 
> > --
> 

> > Hal Finkel
> 
> > Lead, Compiler Technology and Programming Languages
> 
> > Leadership Computing Facility
> 
> > Argonne National Laboratory
> 
> > ---------------------------------------------------------------------
> 
> > Intel Israel (74) Limited
> 
> > This e-mail and any attachments may contain confidential material
> > for
> 
> > the sole use of the intended recipient(s). Any review or
> > distribution
> 
> > by others is strictly prohibited. If you are not the intended
> 
> > recipient, please contact the sender and delete all copies.
> 
> --

> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory

> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 

Hal Finkel 
Lead, Compiler Technology and Programming Languages 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161120/9e5b047a/attachment.html>