<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Fri, Jan 5, 2018 at 7:16 AM, Chandler Carruth via Phabricator <span dir="ltr"><<a href="mailto:reviews@reviews.llvm.org" target="_blank">reviews@reviews.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span>chandlerc marked an inline comment as done.<br>
</span>chandlerc added a comment.<br>
<span><br>
In <a href="https://reviews.llvm.org/D41723#967580" rel="noreferrer" target="_blank">https://reviews.llvm.org/D4172<wbr>3#967580</a>, @jyknight wrote:<br>
<br>
> In <a href="https://reviews.llvm.org/D41723#967554" rel="noreferrer" target="_blank">https://reviews.llvm.org/D4172<wbr>3#967554</a>, @chandlerc wrote:<br>
><br>
> > > That should be easy to support -- just don't do the thunk-emission -- but it does then mean the need to standardize on the names and semantics of the required thunks.<br>
> ><br>
> > I don't think this is true. We need to document LLVM's semantics and the thunks required. If they match GCCs, cool. If not and a user provides a flag to request an external thunk, then they have to give LLVM one that matches LLVM's semantics. Since they control the name, they can even have thunks with each compiler's semantics and different names. While in practice, I expect the semantics to match, I don't think we should be in the business of forcing this into the ABI. We already have waaaay to much in the ABI. If we come up with a better way to do retpoline in the future, we should rapidly switch to that without needing to coordinate about names and semantics across compilers.<br>
><br>
><br>
> Right, I shouldn't have said "standardize", I did actually mean "document".<br>
><br>
> >> And it would be good if the same function-names were used as GCC. Fine to do as a followup patch, but maybe at least have an idea what the command-line UI will be for it.<br>
> ><br>
> > I strongly disagree here. Again, this should *not* be part of the ABI and it should not be something that we have to engineer to match exactly with other compilers.<br>
><br>
> No, I agree with you, it does not _need_ to match. However, if the semantics are the same as GCC's thunks (which they do appear to be), it would be _good_ to coordinate with eachother, simply in order to make users' lives easier. Which is the same reason to coordinate the command-line UI. Neither are required, but both would be good.<br>
><br>
> > We should even be free to add `__llvm_retpoline2_foo()` thunks in the future with new semantics with zero compatibility concerns.<br>
><br>
> It wouldn't have _zero_ concerns due to externally-provided thunks -- they'll need to provide new ones to use the new compiler version. But probably that's okay.<br>
<br>
<br>
</span>Sure, sounds like we're generally in agreement here. I'll take a stab at supporting external thunks and skimming emission, shouldn't be too hard.<br>
<span><br>
<br>
<br>
================<br>
Comment at: lld/ELF/Arch/X86.cpp:491<br>
+ 0x83, 0xc4, 0x04, // next: add $4, %esp<br>
+ 0x87, 0x04, 0x24, // xchg %eax, (%esp)<br>
+ 0xc3, // ret<br>
----------------<br>
</span><div><div class="m_3097457734254088465h5">ruiu wrote:<br>
> rnk wrote:<br>
> > chandlerc wrote:<br>
> > > Does it make sense to use something like the `pushl` sequence Reid came up with here? In the non-PLT case it looks like:<br>
> > ><br>
> > > ```<br>
> > > addl $4, %esp<br>
> > > pushl 4(%esp)<br>
> > > pushl 4(%esp)<br>
> > > popl 8(%esp)<br>
> > > popl (%esp)<br>
> > > ```<br>
> > ><br>
> > > So it would potentially need to be done a touch differently to work here, but maybe something like that rather than `xchg`?<br>
> > ><br>
> > > Even if the alternative is a lot more instructions, the `xchg` instruction is a locked instruction on x86 and so this will actually create synchronization overhead on the cache line of the top of the stack.<br>
> > This is a real concern, I checked the Intel manual, and it says:<br>
> ><br>
> > > Exchanges the contents of the destination (first) and source (second) operands. The operands can be two general purpose<br>
> > > registers or a register and a memory location. If a memory operand is referenced, the processor’s locking<br>
> > > protocol is automatically implemented for the duration of the exchange operation, regardless of the presence or<br>
> > > absence of the LOCK prefix or of the value of the IOPL.<br>
> ><br>
> > One question is, do we want to try to avoid `PUSH/POP MEM` instructions? LLVM has x86 subtarget features that suggest that these instructions are slow on some CPU models.<br>
> ><br>
> > To avoid them completely, we could use this code sequence:<br>
> > ```<br>
> > movl %ecx, (%esp) # save ECX over useless RA<br>
> > movl 4(%esp), %ecx # load original EAX to ECX<br>
> > movl %eax, 4(%esp) # store callee over saved EAX<br>
> > movl %ecx, %eax # restore EAX<br>
> > popl %ecx # restore ECX<br>
> > retl # return to callee<br>
> > ```<br>
> ><br>
> > When it comes down to it, this just implements `xchg` with a scratch register.<br>
> ><br>
> > On reflection, I think the code sequence above is the best we can do. The PUSH sequence you describe is 8 memory operations vs 4 if we use ECX as scratch.<br>
> I didn't know that xchg is so slow. There's no reason not to use push/pop instructions to swap a word at the stack top and a register. Since this is a PLT header (and not a PLT entry), the size of the instrcutions doesn't really matter. Do you want me to update my patch?<br>
</div></div>Yeah, if you can email me a delta on top of this, I'll fold it in. Thanks!<br>
<br>
(Suggesting that in part to make backports a bit easier...)<br>
<span><br>
<br>
================<br>
Comment at: lld/ELF/Arch/X86_64.cpp:517-52<wbr>5<br>
+ 0x4c, 0x8b, 0x1d, 0, 0, 0, 0, // mov foo@GOTPLT(%rip), %r11<br>
+ 0xe8, 0x04, 0x00, 0x00, 0x00, // callq next<br>
+ 0xf3, 0x90, // loop: pause<br>
+ 0xeb, 0xfc, // jmp loop; .align 16<br>
+ 0x4c, 0x89, 0x1c, 0x24, // next: mov %r11, (%rsp)<br>
+ 0xc3, // ret<br>
+ 0x68, 0, 0, 0, 0, // pushq <relocation index><br>
----------------<br>
</span><span>ruiu wrote:<br>
> Chander,<br>
><br>
> I also noticed we can improve instructions here. We can use the following instructions instead so that the jump target to lazy-resolve PLT is aligned to 16 byte. I can make a change now if you want.<br>
><br>
> mov foo@GOTPLT(%rip), %r11<br>
> callq next<br>
> loop: pause<br>
> jmp plt+32; .align 16<br>
> pushq <relocation index> // lazy-resolve a PLT entry<br>
> jmpq plt[0]<br>
><br>
</span>Will this break the 16-byte property when combined with -z now? That seems more relevant. Given that the lazy-binding is intended to only run once, I don't think its worth burning much space to speed it up.</blockquote><div><br></div><div>No, since for `-z now`, we use completely different code sequence for the PLT entry, and the new code sequence shouldn't be larger than the current one. I'm not sure how important it is to align retpoline target to be aligned to 16 byte though.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span><br>
<br>
================<br>
Comment at: llvm/lib/CodeGen/IndirectBrExp<wbr>andPass.cpp:113<br>
+ // index.<br>
+ if (BBIndex == -1) {<br>
+ BBIndex = BBs.size();<br>
----------------<br>
</span><span>efriedma wrote:<br>
> blockaddresses are uniqued, so no block should ever have more than one blockaddress user. So this should probably be an assertion.<br>
</span>I just didn't want to hard code that assumption, but I can if you prefer.<br>
<span><br>
<br>
================<br>
Comment at: llvm/lib/CodeGen/IndirectBrExp<wbr>andPass.cpp:134<br>
+ // Now rewrite each indirectbr to cast its loaded pointer to an integer and<br>
+ // switch on it using the integer map from above.<br>
+ for (auto *IBr : IndirectBrs) {<br>
----------------<br>
</span><span>sanjoy wrote:<br>
> Do we care about inline assembly here? The langref says "Finally, some targets may provide defined semantics when using the value as the operand to an inline assembly, but that is target specific."<br>
</span>I mean, yes, but also no. ;]<br>
<br>
It would be nice to maybe preserve inline asm uses of blockaddr and not any others. And then force people to not rinse their blackaddr usage through inline asm and mix that with `-mretpoline`. That would allow the common usage I'm aware of to remain (showing label addresses in crash dumps in things like kernels) and really allow almost any usage wholly contained within inline asm to continue working perfectly.<br>
<br>
But it seemed reasonable for a follow-up. That said, maybe its not too complex to add now...<br>
<span><br>
<br>
================<br>
Comment at: llvm/lib/Target/X86/X86InstrCo<wbr><a href="http://mpiler.td:1160">mpiler.td:1160</a><br>
(TCRETURNdi tglobaladdr:$dst, imm:$off)>,<br>
- Requires<[NotLP64]>;<br>
+ Requires<[NotLP64, NotUseRetpoline]>;<br>
<br>
----------------<br>
</span><span>AndreiGrischenko wrote:<br>
> Hi Chandler,<br>
> Do you really want to use "NotUseRetpoline" here? It will match RETPOLINE_TCRETURN32 then even it is not an indirect branch.<br>
> I guess the following test case will crash llc.<br>
><br>
> ```<br>
> target triple = "i386-unknown-linux-gnu"<br>
><br>
> define void @FOO() {<br>
> ret void<br>
> }<br>
><br>
> define void @FOO1() {<br>
> entry:<br>
> tail call void @FOO()<br>
> ret void<br>
> }<br>
> ```<br>
><br>
</span>Reid, could you take a look?<br>
<span><br>
<br>
================<br>
Comment at: llvm/lib/Target/X86/X86Retpoli<wbr>neThunks.cpp:121<br>
+<br>
+ createThunk(M, "r11", X86::R11);<br>
+ } else {<br>
----------------<br>
</span><span>AndreiGrischenko wrote:<br>
> You will create thunk function even if it is not necessary? For example for " tail call void @FOO()"?<br>
><br>
><br>
</span>There is a FIXME above about being more careful when creating these.<br>
<br>
<br>
<a href="https://reviews.llvm.org/D41723" rel="noreferrer" target="_blank">https://reviews.llvm.org/D4172<wbr>3</a><br>
<br>
<br>
<br>
</blockquote></div><br></div></div>