[PATCH] D42453: Use branch funnels for virtual calls when retpoline mitigation is enabled.

Wed Jan 24 19:56:20 PST 2018

pcc added inline comments.

================
Comment at: llvm/lib/Transforms/IPO/LowerTypeTests.cpp:1106-1108
+    AsmOS << "leaq ${" << Args.size() << ":c}+" << T.Offset
+          << "(%rip), %r11\n";
+    Args.push_back(CombinedGlobalAddr);
----------------
chandlerc wrote:
> pcc wrote:
> > chandlerc wrote:
> > > pcc wrote:
> > > > chandlerc wrote:
> > > > > Emitting all of this with (I assume) inline assembly seems like a really messy design. It makes all of this entirely x86-specific, but *this* part is completely independent of architecture.
> > > > > 
> > > > > Why not emit LLVM IR? Is there no pattern of IR that actually lowers to reasonable branches? I find that a little bit surprising, but maybe we should just fix the lowering in that case?
> > > > The main reason is that we need to make sure that whatever code we generate conforms with the calling convention used for the virtual call. At this point there is no guarantee that we have correct information about the calling convention to be used here because the "source of truth" for the calling convention is the call site itself, and with ThinLTO there may not be a call site in the current module. We cannot even rely on the function prototype with ThinLTO because we may have discarded the prototype by this point. Furthermore there doesn't seem to be a point in preserving the calling convention because no matter what it is, we will always want the same code. That is what led me to the conclusion that what we need here is an IR construct that represents the notion of a branch funnel call to a function with an unknown prototype -- that is what the `llvm.icall.jumptable` intrinsic is.
> > > > 
> > > > I considered putting the implementation of this intrinsic in the backend like a regular intrinsic, but one of the problems with that is that one of the things that we need to be able to do is stitch together multiple global variables into a single global in order to implement the branch funnel as a binary search, and it is too late to do that in the backend. The LowerTypeTests pass was a convenient place to do it (since it already needs to know how to do it in order to implement `llvm.type.test`) but admittedly it may not be the best place. It occurred to me that it may be possible to implement the stitching together of globals in a pre-isel pass and the rest of `llvm.icall.jumptable` in the backend, but I think that's something that's best considered for a followup patch since it may have implications for how we implement `llvm.type.test` as well.
> > > The more I think about this the more I think that implementing this with inline asm is just the wrong design.
> > > 
> > > I understand the challenges you're hitting here, but I think we should solve them by lowering in the backend, not in the IR. The IR really can't model the kinds of things you're doing (and it shouldn't!).
> > > 
> > > My suggestion would be to change the `llvm.icall.<mumble>` intrinsic or add a during-lowering intrinsic that lets you prepare the global variable stuff as necessary in the IR pass, and hand that prepared information cleanly (and abstractly) to the backend where you can effectively lower it to branch instructions. Since you are (notionally) lowering a call, you may even be able to use the intrinsic all the way into the code generator and just lower this manually with an MI pass.
> > > 
> > > Another aspect that this will improve is that you shouldn't need to embed things into naked functions at all. This should be something that you can just expand wherever it is needed into the code.
> > I'm not sure that I understand your proposal. Are you proposing that in the WholeProgramDevirt pass we would transform each virtual call site into an intrinsic call with the list of possible targets, and then in the backend we would lower the intrinsic call into a call to a branch funnel "function" together with its definition?
> I'm imagining in the backend, we would lower the intrinsic call into an *inline* branch funnel across the destinations.
> 
> (I can also imagine using a pseudo instead of call to an intrinsic, but as you want to capture a call with some specific calling convention / signature / etc, using a call to an intrinsic seems a reasonable representation... if needed, you could even put the potential destinations into an operand bundle so that the formal argument list *is* actually the argument list which should be used for calling the target function.)
Are you sure you want an inline branch funnel? It will make the code more "branchy", which doesn't seem great for code size.

i.e. the two options are
```
call branchfunnel

branchfunnel:
cmp ...
jb target1
je target2
jmp target3
```
and
```
cmp ...
jb .Ltmp1
je .Ltmp2
call target3
jmp .Ltmp3
.Ltmp1:
call target1
jmp .Ltmp3
.Ltmp2:
call target2
.Ltmp3:
```
.

Putting that aside, there are problems with the idea of putting the list of targets in every intrinsic. Most importantly, it complicates matters significantly for ThinLTO. It means that each backend job would need to know the list of targets as well as the layout of the combined global that stores the vtables. We don't need that for any other purpose, so we would need to include that information in the summary just to support this. Because we are including the information in the summary, it means that any change to the class hierarchy would cause a ThinLTO cache miss. (Right now, in many cases, we can avoid a cache miss as a result of careful summary design.) In other words, this proposal would be making the code more complicated and less efficient. So it doesn't seem like a good direction to me.

I think we can both agree that lowering to inline asm isn't great. But I see a different way for this code to evolve so that it is properly layered. Essentially we would invent a new top-level entity (like a function or a global variable) that represents a thunk. There is already a need for such a top-level entity to represent CFI jump tables (currently we represent them in the same hackish way using inline asm), so to start with there would be two kinds of thunks: jump tables and branch funnels. Target specific code in the backend would lower the thunks to MI or MCInstrs.

This would have a few advantages which would be shared with your intrinsic proposal:
- the IR would be somewhat platform independent until the backend,
- thus it can be easily optimized by midlevel passes,
- the backend would be able to choose whether to emit the branch funnel inline or outline,
and most importantly for ThinLTO:
- it would support separate compilation.

Please let me know what you think.

https://reviews.llvm.org/D42453