[PATCH] D70157: Align branches within 32-Byte boundary

Thu Nov 21 15:43:12 PST 2019

craig.topper added a comment.

In D70157#1755927 <https://reviews.llvm.org/D70157#1755927>, @jyknight wrote:

> Thanks for the comments, they help a little. But it's still somewhat confusing, so let me write down what seems to be happening:
>
> - Before emitting every instruction, a new MCMachineDependentFragment is now emitted, of one of the multiple types:
>   - For most instructions, that'll be BranchPrefix.
>   - For things that need branch-alignment, it'll be BranchPadding, unless there's a fused conditional before, in which case it's BranchSplit
>   - For fused conditionals, it'll be FusedJccPadding.
> - After emitting an instruction that needs branch-alignment, all of those previously-emitted MCMachineDependentFragment are updated to point to the branch's fragment.
> - Thus, every MCDataFragment now only contains a single instruction (this property is depended upon for getInstSize, at least).
>
>   All the MCMachineDependentFragments in a region bounded by a branch at the end and either a branch or a fragment-type which is not type in {FT_Data, FT_MachineDependent, FT_Relaxable, FT_CompactEncodedInst} at the beginning, will reference the ending branch instruction's fragment.
>
>   Then, when it comes time to do relaxation, every one of those machine-dependent-fragments has the opportunity to grow its instruction a little bit. The first instruction in a "block" will grow up to 5 segment prefixes (via modifying the BranchPrefix fragment), and then if more is needed, more prefixes will be added to the next instruction, and so on. Until you run out of instructions in the region. At which point the BranchPadding or FusedJccPadding types (right before the branch/fused-branch) will be able to emit nops to achieve the desired alignment.
>
>   An alternative would be to simply emit NOPs before branches as needed. That would be substantially simpler, since it would only require special handling for a branch or a fused-branch. I assume things were done this substantially-more-complex way in order to reduce performance cost of inserting NOP instructions? Are there numbers for how much better it is to use segment prefixes, vs a separate nop instruction? It seems a little bit surprising to me that it would be that important, but I don't know...

I don't have any numbers myself. I was only involved in some of the code review internally. My understanding is that NOP instructions would place extra nop uops into the DSB(the decoded uop buffer) and that limits the performance that can be recovered. By using redundant prefixes no extra uops are generated and more performance is recovered.

> I'll note that the method here has the semantic issue of making it effectively impossible to ever evaluate an expression like ".if . - symbol == 24" (assuming we're emitting instructions), since every instruction can now change size. I suspect that will make it impossible to turn this on by default without breaking a lot of assembly code. Previously, only certain instructions, like branches or arithmetic ops with constant arguments of unknown value, could change size.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D70157/new/

https://reviews.llvm.org/D70157