[llvm-dev] RFC: Insertion of nops for performance stability

Wed Oct 4 18:12:27 PDT 2017

Hi Paparo.

I know this has been almost one year.  I would like to know whether you
have plan to contribute this back ? We have some performance instability
that would benefit from a pass like this in Facebook.
Thanks.
-Xin

On Mon, Nov 21, 2016 at 11:58 PM, Hal Finkel via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

>
> ------------------------------
>
> *From: *"Paparo Bivas, Omer" <omer.paparo.bivas at intel.com>
> *To: *"Hal Finkel" <hfinkel at anl.gov>
> *Cc: *llvm-dev at lists.llvm.org
> *Sent: *Monday, November 21, 2016 6:16:35 AM
> *Subject: *RE: [llvm-dev] RFC: Insertion of nops for performance stability
>
> Hi Hal,
>
>
>
> Thanks for the reference. I’ve looked at  PPCBranchSelector and the
> PowerPC backend. It is very different from the X86 architecture and
> unfortunately the way branch relaxation and alignment related issues are
> handled in PPC cannot be copied to X86. This is because:
>
> 1.      PPC instructions are of fixed length while X86 instructions are
> of variable length, and their length can change depending on features the
> target machine has. Not only that, but a relaxed X86 instruction will
> differ in size from its non-relaxed version. The function TrargetInstrInfo::getInstSizeInBytes
> is not, and can’t be, implemented for X86. It follows that:
>
> 2.      Target addresses of branches cannot be computed before the code
> emission in X86. That’s why X86 relies on the layout process in
> MCAssembler, which occurs after the encoding is done and (amongst the rest)
> relaxes instructions that needs relaxation.
>
> For the same reasons, the nops insertion process also cannot occur until
> after the encoding phase.
>
> Okay, interesting. Thanks for explaining.
>
>
>
> Regarding splitting blocks of code, that is essentially what I’m
> proposing, but at MC level (since this is the level I’m operating in). When
> needed, a MCPerfNopFragment will be a separator between two MCFragments
> that contain code (e.g. MCDataFragments). However, a MCPerfNopFragment is
> not a MCAlignFragment and a block that follows a MCDataFragments will not
> (necessarily) be aligned as it could be wasteful in terms of code size.
> Consider the example I provided. Aligning the fragment after the nops to
> 16B will cost one additional byte of nops in code size, and aligning it to
> 32B will cost 17 additional bytes of nops in code size. There’s no benefit
> in this extra bytes.
>
> Makes sense.
>
> As for filling the slots with other instructions, I think it won’t be
> right, because
>
> 1.      It will require some sort of branch prediction unit that will
> work in compilation time. Branch predicting is done in run time by the
> machine, and involves many parameters. In my opinion, trying to imitate
> such a mechanism will create either very complex code or inaccurate
> predictions.
>
> 2.      Execution of those other instructions can be very expensive
> compared to the execution of the nops. This means that wrong (compile-time)
> branch prediction will result in a performance penalty that may even
> overshadow the benefit from the spacing those instructions provide.
>
> We do have "compile-time branch prediction" in the form of static
> heuristics, and even better when available, profiling data. However,
> speculating in between two branches to the same destination seems unlikely,
> so only the first situation would be relevant (DSB thrashing). There,
> however, just picking a different instruction ordering seems like a
> more-interesting option than speculation. Inserting nops, however, seems
> like a reasonable general strategy regardless.
>
> 3.      As I mentioned, this process has to be done in the Assembler,
> after all the passes and the code emission. In my opinion, the Assembler
> phase is not a good time to change locations of instructions.
>
> I agree with this. We shouldn't try to reschedule instructions in the
> assembler.
>
>  -Hal
>
>
>
> Thanks,
>
> Omer
>
>
>
> *From:* Hal Finkel [mailto:hfinkel at anl.gov]
> *Sent:* Sunday, November 20, 2016 23:26
> *To:* Paparo Bivas, Omer <omer.paparo.bivas at intel.com>
> *Cc:* llvm-dev at lists.llvm.org
> *Subject:* Re: [llvm-dev] RFC: Insertion of nops for performance stability
>
>
>
>
> ------------------------------
>
> *From: *"Paparo Bivas, Omer" <omer.paparo.bivas at intel.com>
> *To: *"Hal Finkel" <hfinkel at anl.gov>
> *Cc: *llvm-dev at lists.llvm.org
> *Sent: *Sunday, November 20, 2016 8:08:20 AM
> *Subject: *RE: [llvm-dev] RFC: Insertion of nops for performance stability
>
> Hi Hal,
>
>
>
> A pre-emit pass will indeed be preferable. I originally thought of it,
> too, however I could not figure out how can such a pass have an access to
> information on instruction sizes and block alignments.
>
> I know that for X86, at least, the branch relaxation is happening during
> the layout phase in the Assembler, where I plan to integrate the nop
> insertion such that the new MCPerfNopFragment will just be yet another kind
> of fragment to handle when laying out all the fragments: layout for a
> relaxable branch is relaxing it if necessary, and layout for a
> MCPerfNopFragment will be computing the number of nops it will contain.
>
>
>
> Can you please refer me to a pre-emit pass that does something similar?
>
> Just as a quick example, the PowerPC branch relaxation pass,
> lib/Target/PowerPC/PPCBranchSelector.cpp, uses TII->getInstSizeInBytes to
> determine instruction sizes and calculates alignment-based padding that
> will be inserted based on combining instruction-size information with the
> alignment information from each block (MBB.getAlignment()) and the
> containing function (MBB.getParent()->getAlignment()). Perhaps I'm
> missing some complexity, but I'd think that nops could be inserted using a
> very similar algorithm.
>
> Two other thoughts:
>
>  1. You can "insert nops" in some (many?) relevant cases by splitting the
> block and adjusting the block's alignment.
>  2. You might want to fill the slots with other instructions which can be
> speculated from successor blocks.
>
> Thanks again,
> Hal
>
>
>
> Thanks,
>
> Omer
>
>
>
> *From:* Hal Finkel [mailto:hfinkel at anl.gov <hfinkel at anl.gov>]
> *Sent:* Friday, November 18, 2016 20:57
> *To:* Paparo Bivas, Omer <omer.paparo.bivas at intel.com>
> *Cc:* llvm-dev at lists.llvm.org
> *Subject:* Re: [llvm-dev] RFC: Insertion of nops for performance stability
>
>
>
>
> ------------------------------
>
> *From: *"Paparo Bivas, Omer via llvm-dev" <llvm-dev at lists.llvm.org>
> *To: *llvm-dev at lists.llvm.org
> *Sent: *Thursday, November 17, 2016 3:55:43 AM
> *Subject: *[llvm-dev] RFC: Insertion of nops for performance stability
>
> Hi all,
>
>
>
> These days I am working on a feature designed to insert nops for IA code
> generation that will provide performance improvements and performance
> stability. This feature will not affect other architectures. It will,
> however, set up an infrastructure for other architectures to do the same,
> if ever needed.
>
>
>
> Here are some examples for cases in which nops can improve performance:
>
> 1.      DSB (Decoded Stream Buffer) Thrashing.
>
> DSB is a cache for uops that have been decoded. It is limited to 3 ways
> per 32B window; 4 or more ways will cause a performance degradation. This
> can be avoided by aligning with nops. See Zia Ansari's presentation for
> more information: http://schd.ws/hosted_files/
> llvmdevelopersmeetingbay2016/d9/LLVM-Conference-US-2016-Code-Alignment.pdf
>
> 2.      Two or more branches with the same target in a single instruction
> window may cause poor branch prediction.
>
> Our recommendation to programmers is to try and keep 2 branches out of the
> same 16B chunk if they both go to the same target. From the manual:
> “Avoid putting two conditional branch instructions in a loop so that both
> have the same branch target address and, at the same time, belong to (i.e.
> have their last bytes’ addresses within) the same 16-byte aligned code
> block.”
>
>
>
> Let’s see an example for the second case. Consider the following code:
>
>
>
> int y;
>
> int x;
>
>
>
> int foo(int i, int m, int p, int q, int *p1, int *p2)
>
> {
>
>   if (i+*p1 == p || i-*p2 > m || i == q) {
>
>     y++;
>
>     x++;
>
>   }
>
>
>
>   return 0;
>
> }
>
>
>
> The two last || clauses of the if will be translated to two conditional
> jumps to the same target. The generated code will look like so:
>
>
>
> foo:
>
>        0:  48 8b 44 24 28 movq 40(%rsp), %rax
>
>        5:  8b 00      movl (%rax), %eax
>
>        7:  01 c8      addl %ecx, %eax
>
>        9:  44 39 c0   cmpl %r8d, %eax
>
>        c:  75 0f      jne  15 <foo+0x1D>
>
>        e:  ff 05 00 00 00 00     incl (%rip)
>
>       14:  ff 05 00 00 00 00     incl (%rip)
>
>       1a:  31 c0      xorl %eax, %eax
>
>       1c:  c3   retq
>
>       1d:  44 39 c9   cmpl %r9d, %ecx
>
>       20:  74 ec      je   -20 <foo+0xE>
>
>       22:  48 8b 44 24 30 movq 48(%rsp), %rax
>
>       27:  2b 08      subl (%rax), %ecx
>
>       29:  39 d1      cmpl %edx, %ecx
>
>       2b:  7f e1      jg   -31 <foo+0xE>
>
>       2d:  31 c0      xorl %eax, %eax
>
>       2f:  c3   retq
>
>
>
> Note: the first if clause jump is the jne instruction at 0x0C, the second
> if clause jump is the jg instruction at 0x2B and the third if clause jump
> is the je instruction at 0x20. Also note that the jg and je share a 16 byte
> window, which is exactly the situation we wish to avoid (consider the case
> in which foo is called from inside a loop. This will cause performance
> penalty).
>
>
>
> The generated code after my changes will look like so:
>
>
>
> foo:
>
>        0:  48 8b 44 24 28 movq 40(%rsp), %rax
>
>        5:  8b 00      movl (%rax), %eax
>
>        7:  01 c8      addl %ecx, %eax
>
>        9:  44 39 c0   cmpl %r8d, %eax
>
>        c:  75 0f      jne  15 <foo+0x1D>
>
>        e:  ff 05 00 00 00 00     incl (%rip)
>
>       14:  ff 05 00 00 00 00     incl (%rip)
>
>       1a:  31 c0      xorl %eax, %eax
>
>       1c:  c3   retq
>
>       1d:  44 39 c9   cmpl %r9d, %ecx
>
>       20:  74 ec      je   -20 <foo+0xE>
>
>       22:  48 8b 44 24 30 movq 48(%rsp), %rax
>
>       27:  2b 08      subl (%rax), %ecx
>
>       29:  39 d1      cmpl %edx, %ecx
>
>       2b:  0f 1f 40 00     nopl (%rax)
>
>       2f:  7f dd      jg   -35 <foo+0xE>
>
>       31:  31 c0      xorl %eax, %eax
>
>       33:  c3   retq
>
>
>
> The only change is the nopl at 0x2B, before the jg, which pushes the jg
> instruction to the next 16 byte window, and thus avoids the undesired
> situation. Note that, in this case, in order to push an instruction to the
> next window it is sufficient to push its last byte to the next window.
>
>
>
>
>
> Due to the nature of those performance nops, which often rely on the
> layout of the code (e.g. number of instructions in a 16/32B window) that is
> only known in assembly time, the insertion is divided to two phases:
>
> 1.      Marking "suspicious" places while encoding, i.e. hotspots that
> may cause a performance penalty, during the encoding.
>
> 2.      Computing the actual number (zero or greater) of required nops in
> order to avoid a performance penalty.
>
>
>
> In order to achieve that, I am introducing a new kind of MCFragment named
> MCPerfNopFragment. It will be the connecting link between the two phases:
>
> 1.      During encoding, the object streamer (MCObjectStreamer) will
> query the backend for the need of a MCPerfNopFragment before every encoded
> instruction. If required, a MCPerfNopFragment will be created and inserted
> to the MCSection.
>
> 2.      During the assembly phase:
>
> a.      When computing the fragment size in MCAssembler::computeFragmentSize
> the Assembler will consult the backend and will provide it the layout in
> order to determine the required size of the fragment. Note that the
> computed size may change from call to call, similarly to the process of
> computing the size of a MCAlignFragment.
>
> b.      When writing the fragment the assembler will again use the
> backend in order to write the required number of nops.
>
>
>
> Comments and questions are welcome.
>
> Do you need to insert the nops as such a late stage? We have other
> components that insert nops during post-RA scheduling, for example. Even
> later, nops could be inserted during a pre-emit pass (similar to where
> other backends do branch relaxation/selection). The branch relaxation
> process also relies on knowing the layout of the code, but gathers this
> information from the available information on instruction sizes, block
> alignments, etc.
>
>  -Hal
>
>
>
> Thanks,
>
> Omer
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
>
> --
>
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
>
>
>
> --
>
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
>
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>

-- 
Software Engineer - Compiler Toolchain
Employee of Facebook Inc.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20171005/71a156be/attachment-0001.html>