[llvm-dev] RFC: Insertion of nops for performance stability

Paparo Bivas, Omer via llvm-dev llvm-dev at lists.llvm.org
Sun Nov 20 05:54:46 PST 2016


Hi Stephen,

I actually not sure myself if it's better to insert extra instructions rather than increase the length of existing ones. However, in means of the current design, at least, it can complicate things.
In my solution the insertion of the nops integrates with the process of layout in the Assembler, in which every MCFragment is processed in turn and its layout is determines. When it's time for a MCFragment to be lay out we can rely on the layout of the MCFragments prior to it, but we can't rely on the layout of the MCFragments that follow it since they haven't been laid out yet and thus their layout is not yet known.
For a MCFragment of the new kind MCPerfNopFragment, "laying out" means computing the number of nops it will contain. Luckily enough, all the cases that need nops does not require later instruction data, only earlier instruction data, which is already set and laid out (keep in mind that a MCPerfNopFragment will always appear right before a "potentially hazardous" instruction and will be aware of that instruction).
In order to change instructions to another version of themselves that will take up more bytes we will have to do one of two:
1. Rely on later instruction data when laying out the fragments that contain those instructions. As I mentioned, we can't do this accurately as the layout haven't been set for the fragments that follow.
2. Retroactively change the layout of the fragments that contain those instructions once we encounter a MCPerfNopFragment. This can cause complications as it will require changing the layout while performing a layout. Some fragments in between the currently computed MCPerfNopFragment and the MCFragment we wish to change layout for may rely on the already determined layout. MCAlignFragments, for example, strongly rely on the layout of their predecessors. The entire process of fragments layout will then become very complex.

Please tell me if you have any thoughts as to how to get around this issues.

Thanks,
Omer

-----Original Message-----
From: Stephen Checkoway [mailto:s at pahtak.org] 
Sent: Thursday, November 17, 2016 18:51
To: Paparo Bivas, Omer <omer.paparo.bivas at intel.com>
Cc: llvm-dev at lists.llvm.org
Subject: Re: [llvm-dev] RFC: Insertion of nops for performance stability

Hi Omer,

> On Nov 17, 2016, at 03:55, Paparo Bivas, Omer via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
> The two last || clauses of the if will be translated to two conditional jumps to the same target. The generated code will look like so:
>  
> foo:
>        0:  48 8b 44 24 28 movq 40(%rsp), %rax
>        5:  8b 00      movl (%rax), %eax
>        7:  01 c8      addl %ecx, %eax
>        9:  44 39 c0   cmpl %r8d, %eax
>        c:  75 0f      jne  15 <foo+0x1D>
>        e:  ff 05 00 00 00 00     incl (%rip)
>       14:  ff 05 00 00 00 00     incl (%rip)
>       1a:  31 c0      xorl %eax, %eax
>       1c:  c3   retq
>       1d:  44 39 c9   cmpl %r9d, %ecx
>       20:  74 ec      je   -20 <foo+0xE>
>       22:  48 8b 44 24 30 movq 48(%rsp), %rax
>       27:  2b 08      subl (%rax), %ecx
>       29:  39 d1      cmpl %edx, %ecx
>       2b:  7f e1      jg   -31 <foo+0xE>
>       2d:  31 c0      xorl %eax, %eax
>       2f:  c3   retq
>  
> Note: the first if clause jump is the jne instruction at 0x0C, the second if clause jump is the jg instruction at 0x2B and the third if clause jump is the je instruction at 0x20. Also note that the jg and je share a 16 byte window, which is exactly the situation we wish to avoid (consider the case in which foo is called from inside a loop. This will cause performance penalty).

Rather than inserting a nop, would it be better to change the instruction encoding to use a different form? The JE at offset 0x20 could use the JE rel32 (0f 84 0f 00 00 00) form. Similarly, the MOV at offset 0x22 could use MOV r64, rm/64 with a 32-bit offset (48 8b 84 24 30 00 00 00). The latter adds 3 bytes which is insufficient in this case, but the former adds the required 4 bytes.

I have no idea if it's better to insert extra instructions rather than increase the length of existing ones, but my intuition is that it's better to decode and retire fewer instructions. I'd assume the same is true when trying to align basic blocks.


-- 
Stephen Checkoway





---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.



More information about the llvm-dev mailing list