[PATCH] D42453: Use branch funnels for virtual calls when retpoline mitigation is enabled.

Tue Jan 30 22:44:41 PST 2018

chandlerc added a comment.

Sorry for the delay, took me a bit to internalize what you were proposing.

================
Comment at: llvm/lib/Transforms/IPO/LowerTypeTests.cpp:1106-1108
+    AsmOS << "leaq ${" << Args.size() << ":c}+" << T.Offset
+          << "(%rip), %r11\n";
+    Args.push_back(CombinedGlobalAddr);
----------------
pcc wrote:
> chandlerc wrote:
> > pcc wrote:
> > > chandlerc wrote:
> > > > pcc wrote:
> > > > > chandlerc wrote:
> > > > > > Emitting all of this with (I assume) inline assembly seems like a really messy design. It makes all of this entirely x86-specific, but *this* part is completely independent of architecture.
> > > > > > 
> > > > > > Why not emit LLVM IR? Is there no pattern of IR that actually lowers to reasonable branches? I find that a little bit surprising, but maybe we should just fix the lowering in that case?
> > > > > The main reason is that we need to make sure that whatever code we generate conforms with the calling convention used for the virtual call. At this point there is no guarantee that we have correct information about the calling convention to be used here because the "source of truth" for the calling convention is the call site itself, and with ThinLTO there may not be a call site in the current module. We cannot even rely on the function prototype with ThinLTO because we may have discarded the prototype by this point. Furthermore there doesn't seem to be a point in preserving the calling convention because no matter what it is, we will always want the same code. That is what led me to the conclusion that what we need here is an IR construct that represents the notion of a branch funnel call to a function with an unknown prototype -- that is what the `llvm.icall.jumptable` intrinsic is.
> > > > > 
> > > > > I considered putting the implementation of this intrinsic in the backend like a regular intrinsic, but one of the problems with that is that one of the things that we need to be able to do is stitch together multiple global variables into a single global in order to implement the branch funnel as a binary search, and it is too late to do that in the backend. The LowerTypeTests pass was a convenient place to do it (since it already needs to know how to do it in order to implement `llvm.type.test`) but admittedly it may not be the best place. It occurred to me that it may be possible to implement the stitching together of globals in a pre-isel pass and the rest of `llvm.icall.jumptable` in the backend, but I think that's something that's best considered for a followup patch since it may have implications for how we implement `llvm.type.test` as well.
> > > > The more I think about this the more I think that implementing this with inline asm is just the wrong design.
> > > > 
> > > > I understand the challenges you're hitting here, but I think we should solve them by lowering in the backend, not in the IR. The IR really can't model the kinds of things you're doing (and it shouldn't!).
> > > > 
> > > > My suggestion would be to change the `llvm.icall.<mumble>` intrinsic or add a during-lowering intrinsic that lets you prepare the global variable stuff as necessary in the IR pass, and hand that prepared information cleanly (and abstractly) to the backend where you can effectively lower it to branch instructions. Since you are (notionally) lowering a call, you may even be able to use the intrinsic all the way into the code generator and just lower this manually with an MI pass.
> > > > 
> > > > Another aspect that this will improve is that you shouldn't need to embed things into naked functions at all. This should be something that you can just expand wherever it is needed into the code.
> > > I'm not sure that I understand your proposal. Are you proposing that in the WholeProgramDevirt pass we would transform each virtual call site into an intrinsic call with the list of possible targets, and then in the backend we would lower the intrinsic call into a call to a branch funnel "function" together with its definition?
> > I'm imagining in the backend, we would lower the intrinsic call into an *inline* branch funnel across the destinations.
> > 
> > (I can also imagine using a pseudo instead of call to an intrinsic, but as you want to capture a call with some specific calling convention / signature / etc, using a call to an intrinsic seems a reasonable representation... if needed, you could even put the potential destinations into an operand bundle so that the formal argument list *is* actually the argument list which should be used for calling the target function.)
> Are you sure you want an inline branch funnel? It will make the code more "branchy", which doesn't seem great for code size.
> 
> i.e. the two options are
> ```
> call branchfunnel
> 
> branchfunnel:
> cmp ...
> jb target1
> je target2
> jmp target3
> ```
> and
> ```
> cmp ...
> jb .Ltmp1
> je .Ltmp2
> call target3
> jmp .Ltmp3
> .Ltmp1:
> call target1
> jmp .Ltmp3
> .Ltmp2:
> call target2
> .Ltmp3:
> ```
> .
> 
> Putting that aside, there are problems with the idea of putting the list of targets in every intrinsic. Most importantly, it complicates matters significantly for ThinLTO. It means that each backend job would need to know the list of targets as well as the layout of the combined global that stores the vtables. We don't need that for any other purpose, so we would need to include that information in the summary just to support this. Because we are including the information in the summary, it means that any change to the class hierarchy would cause a ThinLTO cache miss. (Right now, in many cases, we can avoid a cache miss as a result of careful summary design.) In other words, this proposal would be making the code more complicated and less efficient. So it doesn't seem like a good direction to me.
> 
> I think we can both agree that lowering to inline asm isn't great. But I see a different way for this code to evolve so that it is properly layered. Essentially we would invent a new top-level entity (like a function or a global variable) that represents a thunk. There is already a need for such a top-level entity to represent CFI jump tables (currently we represent them in the same hackish way using inline asm), so to start with there would be two kinds of thunks: jump tables and branch funnels. Target specific code in the backend would lower the thunks to MI or MCInstrs.
> 
> This would have a few advantages which would be shared with your intrinsic proposal:
> - the IR would be somewhat platform independent until the backend,
> - thus it can be easily optimized by midlevel passes,
> - the backend would be able to choose whether to emit the branch funnel inline or outline,
> and most importantly for ThinLTO:
> - it would support separate compilation.
> 
> Please let me know what you think.
Having separate inline branch funnels may actually be nice when there are very few targets as they may be better predicted. There is a size/speed trade-off here.

However, the separate compilation issue is very compelling. I think you can use an intrinsic to implement your proposal and get the advantages you describe without any new top-level construct:

You could define / declare a thunk IR function which contains just this intrinsic. Only one TU gets the definition which calls the intrinsic and needs the global information. The others just get a declaration. No new top-level construct, and you lower it exactly as you describe.

I'm not sure if this will directly map to the jump table case, but I'm happy for that to be dealt with in a follow-up as that is already in tree.

If this idea doesn't work, let's chat about what alternatives would work or how to model the top-level entity. I feel pretty strongly that the current code with the current amount of inline assembly is really not up to the quality that should go into the tree even as a temporary thing. There doesn't seem to be such urgency here that we can't take the time to engineer this the right way.

Also:
1) We didn't actually clean up the inline asm for the CFI jump tables despite those being in-tree for a long time, so I think it is reasonable to insist we don't make things worse.
2) This is a *much* more significant usage of inline asm, and so it seems much less reasonable as a temporary solution.

https://reviews.llvm.org/D42453