[PATCH] D70157: Align branches within 32-Byte boundary

Fri Nov 15 07:24:44 PST 2019

jyknight added a comment.

In D70157#1747551 <https://reviews.llvm.org/D70157#1747551>, @davezarzycki wrote:

> In D70157#1747510 <https://reviews.llvm.org/D70157#1747510>, @xbolva00 wrote:
>
> > > Even though core2 isn't affected by the erratum, core2 code can run on CPUs that do have the bug (and core2 is a popular target for code that needs to run "everywhere"), therefore all target CPUs that predate a hardware fix really
> >
> > So perf of all code/generic codegen will go down. This is not very acceptable. -1 for making this enabled by default for generic codegen.
>
>
> I'm okay with this not being the default if we document somewhere that LLVM assumes that machines have up to date microcode installed.

So, my understanding is:

1. There is a CPU bug, that in very rare but unspecified circumstances, can cause branches of all kinds (conditional-jump, fused-conditional-jump, unconditional jump, call, return, indirect jumps and calls?) to go to the wrong place, if the instruction crosses a 64-byte cacheline boundary.
2. In order to fix that bug, the new CPU firmware disables the faster decoded-icache mechanism, falling back to the legacy decoder for any 32-byte block of code which ends with one of those jumps (because it works in 32-byte blocks, not 64-byte blocks). Falling back to the legacy decoder can have a sometimes-significant performance cost. Of course, this is not the _only_ reason that the CPU falls back to the legacy decoder, but it's a significant new one.

This patch, then, spends some memory and icache space (and potentially decode-time) in order to try to get back the performance lost by the above fix. This can be worthwhile because if you can cause there not to be any 32-byte code sections that have a branch at the end, then the fallback to the legacy decoder won't trigger, and the performance will hopefully return to something comparable to the initial state. Unfortunately, while this ought to increase performance impact on all of the affected processors, it will have only negative effects on unaffected processors, since then you're spending memory and icache space, and not getting anything back in return.

Similarly, if you're doing this extra alignment on code that isn't executed repeatedly out of the icache, it's also going to be useless.

So, I'd say it's _not at all_ clear that it should be enabled by default.

I'm also a bit confused as to why the default is set as it is. Why is the default only fused+jcc+jmp, not all branch types?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157