[llvm-dev] RFC: Insertion of nops for performance stability

Fri Nov 18 10:57:03 PST 2016

----- Original Message -----

> From: "Paparo Bivas, Omer via llvm-dev" <llvm-dev at lists.llvm.org>
> To: llvm-dev at lists.llvm.org
> Sent: Thursday, November 17, 2016 3:55:43 AM
> Subject: [llvm-dev] RFC: Insertion of nops for performance stability

> Hi all,

> These days I am working on a feature designed to insert nops for IA
> code generation that will provide performance improvements and
> performance stability. This feature will not affect other
> architectures. It will, however, set up an infrastructure for other
> architectures to do the same, if ever needed.

> Here are some examples for cases in which nops can improve
> performance:
> 1. DSB (Decoded Stream Buffer) Thrashing.
> DSB is a cache for uops that have been decoded. It is limited to 3
> ways per 32B window; 4 or more ways will cause a performance
> degradation. This can be avoided by aligning with nops. See Zia
> Ansari's presentation for more information:
> http://schd.ws/hosted_files/llvmdevelopersmeetingbay2016/d9/LLVM-Conference-US-2016-Code-Alignment.pdf
> 2. Two or more branches with the same target in a single instruction
> window may cause poor branch prediction.
> Our recommendation to programmers is to try and keep 2 branches out
> of the same 16B chunk if they both go to the same target. From the
> manual:
> “Avoid putting two conditional branch instructions in a loop so that
> both have the same branch target address and, at the same time,
> belong to (i.e. have their last bytes’ addresses within) the same
> 16-byte aligned code block.”

> Let’s see an example for the second case. Consider the following
> code:

> int y;
> int x;

> int foo(int i, int m, int p, int q, int *p1, int *p2)
> {
> if ( i+*p1 == p || i-*p2 > m || i == q ) {
> y++;
> x++;
> }

> return 0;
> }

> The two last || clauses of the if will be translated to two
> conditional jumps to the same target. The generated code will look
> like so:

> foo:
> 0: 48 8b 44 24 28 movq 40(%rsp), %rax
> 5: 8b 00 movl (%rax), %eax
> 7: 01 c8 addl %ecx, %eax
> 9: 44 39 c0 cmpl %r8d, %eax
> c: 75 0f jne 15 <foo+0x1D>
> e: ff 05 00 00 00 00 incl (%rip)
> 14: ff 05 00 00 00 00 incl (%rip)
> 1a: 31 c0 xorl %eax, %eax
> 1c: c3 retq
> 1d: 44 39 c9 cmpl %r9d, %ecx
> 20: 74 ec je -20 <foo+0xE>
> 22: 48 8b 44 24 30 movq 48(%rsp), %rax
> 27: 2b 08 subl (%rax), %ecx
> 29: 39 d1 cmpl %edx, %ecx
> 2b: 7f e1 jg -31 <foo+0xE>
> 2d: 31 c0 xorl %eax, %eax
> 2f: c3 retq

> Note: the first if clause jump is the jne instruction at 0x0C, the
> second if clause jump is the jg instruction at 0x2B and the third if
> clause jump is the je instruction at 0x20. Also note that the jg and
> je share a 16 byte window, which is exactly the situation we wish to
> avoid (consider the case in which foo is called from inside a loop.
> This will cause performance penalty).

> The generated code after my changes will look like so:

> foo:
> 0: 48 8b 44 24 28 movq 40(%rsp), %rax
> 5: 8b 00 movl (%rax), %eax
> 7: 01 c8 addl %ecx, %eax
> 9: 44 39 c0 cmpl %r8d, %eax
> c: 75 0f jne 15 <foo+0x1D>
> e: ff 05 00 00 00 00 incl (%rip)
> 14: ff 05 00 00 00 00 incl (%rip)
> 1a: 31 c0 xorl %eax, %eax
> 1c: c3 retq
> 1d: 44 39 c9 cmpl %r9d, %ecx
> 20: 74 ec je -20 <foo+0xE>
> 22: 48 8b 44 24 30 movq 48(%rsp), %rax
> 27: 2b 08 subl (%rax), %ecx
> 29: 39 d1 cmpl %edx, %ecx
> 2b: 0f 1f 40 00 nopl (%rax)
> 2f: 7f dd jg -35 <foo+0xE>
> 31: 31 c0 xorl %eax, %eax
> 33: c3 retq

> The only change is the nopl at 0x2B, before the jg, which pushes the
> jg instruction to the next 16 byte window, and thus avoids the
> undesired situation. Note that, in this case, in order to push an
> instruction to the next window it is sufficient to push its last
> byte to the next window.

> Due to the nature of those performance nops, which often rely on the
> layout of the code (e.g. number of instructions in a 16/32B window)
> that is only known in assembly time, the insertion is divided to two
> phases:
> 1. Marking "suspicious" places while encoding, i.e. hotspots that may
> cause a performance penalty, during the encoding.
> 2. Computing the actual number (zero or greater) of required nops in
> order to avoid a performance penalty.

> In order to achieve that, I am introducing a new kind of MCFragment
> named MCPerfNopFragment. It will be the connecting link between the
> two phases:
> 1. During encoding, the object streamer (MCObjectStreamer) will query
> the backend for the need of a MCPerfNopFragment before every encoded
> instruction. If required, a MCPerfNopFragment will be created and
> inserted to the MCSection.
> 2. During the assembly phase:
> a. When computing the fragment size in
> MCAssembler::computeFragmentSize the Assembler will consult the
> backend and will provide it the layout in order to determine the
> required size of the fragment. Note that the computed size may
> change from call to call, similarly to the process of computing the
> size of a MCAlignFragment.
> b. When writing the fragment the assembler will again use the backend
> in order to write the required number of nops.

> Comments and questions are welcome.
Do you need to insert the nops as such a late stage? We have other components that insert nops during post-RA scheduling, for example. Even later, nops could be inserted during a pre-emit pass (similar to where other backends do branch relaxation/selection). The branch relaxation process also relies on knowing the layout of the code, but gathers this information from the available information on instruction sizes, block alignments, etc. 

-Hal 

> Thanks,
> Omer
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 

Hal Finkel 
Lead, Compiler Technology and Programming Languages 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161118/055c1c1b/attachment-0001.html>