[PATCH] D70157: Align branches within 32-Byte boundary

Fri Nov 15 08:00:56 PST 2019

davezarzycki added a comment.

In D70157#1747640 <https://reviews.llvm.org/D70157#1747640>, @jyknight wrote:

> In D70157#1747551 <https://reviews.llvm.org/D70157#1747551>, @davezarzycki wrote:
>
> > In D70157#1747510 <https://reviews.llvm.org/D70157#1747510>, @xbolva00 wrote:
> >
> > > > Even though core2 isn't affected by the erratum, core2 code can run on CPUs that do have the bug (and core2 is a popular target for code that needs to run "everywhere"), therefore all target CPUs that predate a hardware fix really
> > >
> > > So perf of all code/generic codegen will go down. This is not very acceptable. -1 for making this enabled by default for generic codegen.
> >
> >
> > I'm okay with this not being the default if we document somewhere that LLVM assumes that machines have up to date microcode installed.
>
>
> So, my understanding is:
>
> 1. There is a CPU bug, that in very rare but unspecified circumstances, can cause branches of all kinds (conditional-jump, fused-conditional-jump, unconditional jump, call, return, indirect jumps and calls?) to go to the wrong place, if the instruction crosses a 64-byte cacheline boundary.
> 2. In order to fix that bug, the new CPU firmware disables the faster decoded-icache mechanism, falling back to the legacy decoder for any 32-byte block of code which ends with one of those jumps (because it works in 32-byte blocks, not 64-byte blocks). Falling back to the legacy decoder can have a sometimes-significant performance cost. Of course, this is not the _only_ reason that the CPU falls back to the legacy decoder, but it's a significant new one.
>
>   This patch, then, spends some memory and icache space (and potentially decode-time) in order to try to get back the performance lost by the above fix. This can be worthwhile because if you can cause there not to be any 32-byte code sections that have a branch at the end, then the fallback to the legacy decoder won't trigger, and the performance will hopefully return to something comparable to the initial state. Unfortunately, while this ought to increase performance impact on all of the affected processors, it will have only negative effects on unaffected processors, since then you're spending memory and icache space, and not getting anything back in return.
>
>   Similarly, if you're doing this extra alignment on code that isn't executed repeatedly out of the icache, it's also going to be useless.
>
>   So, I'd say it's _not at all_ clear that it should be enabled by default.
>
>   I'm also a bit confused as to why the default is set as it is. Why is the default only fused+jcc+jmp, not all branch types?

Strictly speaking, Intel doesn't say what happens if this bug occurs, nor do they say that branches themselves are the problem, just "branch instruction bytes" and that "unpredictable system behavior may occur". Compare and contrast with the subsequent erratum where they explicitly say that x86, AVX, and integer divide instructions (not instruction bytes) can have unpredictable behavior. See SKX102 and SKX103: https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-spec-update.html

But we're digressing. I think the default policy discussion might be better had on llvm-dev than a Phab review.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157