<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Fri, Feb 9, 2018 at 12:26 AM David Woodhouse <<a href="mailto:dwmw2@infradead.org">dwmw2@infradead.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
<br>
On Fri, 2018-02-09 at 02:21 +0000, David Woodhouse wrote:<br>
> On Fri, 2018-02-09 at 01:18 +0000, David Woodhouse wrote:<br>
> ><br>
> ><br>
> > For now I'm just going to attempt to work around it like this in the<br>
> > kernel, so I can concentrate on the retpoline bits:<br>
> > <a href="http://david.woodhou.se/clang-percpu-hack.patch" rel="noreferrer" target="_blank">http://david.woodhou.se/clang-percpu-hack.patch</a><br>
><br>
> 32-bit doesn't boot. Built without CONFIG_RETPOLINE and with Clang 5.0<br>
> (and the above patch) it does. I'm rebuilding a Release build of<br>
> llvm/clang so that experimental kernel builds hopefully take less than<br>
> an hour, and will prod further in the morning.<br>
<br>
What is the intended ABI of __x86_indirect_thunk which I have been<br>
calling the "ret-equivalent" retpoline? I see this happening<br>
(I ♥ 'qemu -d in_asm')...<br>
<br>
----------------<br>
IN: <br>
0xc136feea: 89 d8 movl %ebx, %eax<br>
0xc136feec: 89 f2 movl %esi, %edx<br>
0xc136feee: 8b 75 f0 movl -0x10(%ebp), %esi<br>
0xc136fef1: 89 f1 movl %esi, %ecx<br>
0xc136fef3: ff 75 e0 pushl -0x20(%ebp)<br>
0xc136fef6: e8 c5 f3 58 00 calll 0xc18ff2c0 # __x86_indirect_thunk<br>
<br>
----------------<br>
IN: <br>
0xc18ff2c0: c3 retl # Early boot, so it hasn't been turned into a proper retpoline yet<br>
<br>
----------------<br>
IN: <br>
0xc136fefb: 8d 34 7e leal (%esi, %edi, 2), %esi<br>
<br>
<br>
(gdb) list *0xc136fef6<br>
0xc136fef6 is in sort (lib/sort.c:87).<br>
82 if (c < n - size &&<br>
83 cmp_func(base + c, base + c + size) < 0)<br>
84 c += size;<br>
85 if (cmp_func(base + r, base + c) >= 0)<br>
86 break;<br>
87 swap_func(base + r, base + c, size);<br>
88 }<br>
89 }<br>
90<br>
91 /* sort */<br>
<br>
You're pushing the target (-0x20(%ebp)) onto the stack and then<br>
*calling* __x86_indirect_thunk. So it looks like you're expecting<br>
__x86_indirect_thunk to do something like<br>
<br>
call *4(%esp)<br>
ret<br>
<br>
... except that final 'ret' still leaves the target address on the<br>
stack, so there would also need to be a complicated dance, without<br>
using any registers, to pop that too.<br></blockquote><div><br></div><div>Yeah, we expect a complicated dance to re-order the stack to get the correct return address into the correct place.</div><div><br></div><div>You can see the sequence in the comments here:</div><div><a href="https://github.com/llvm-project/llvm-project-20170507/blob/master/llvm/lib/Target/X86/X86RetpolineThunks.cpp#L179-L194">https://github.com/llvm-project/llvm-project-20170507/blob/master/llvm/lib/Target/X86/X86RetpolineThunks.cpp#L179-L194</a></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
I expected the emitted code for a *call* using the thunk to look more<br>
like<br>
<br>
jmp 2f <br>
1: pushl -0x20(%ebp) # cmp_func<br>
jmp __x86_thunk_indirect # jmp, not call<br>
2: call 1b # set up address for cmp_func to return to<br></blockquote><div><br></div><div>Yeah, the specific goal was to minimize the code size footprint at the call site even though it means a few more instructions in the thunk. Our pattern also has a minor reduction in the dynamic branches taken at the cost of the push/pop churn.</div><div><br></div><div>There was briefly a discussion of a different instruction sequence to minimize push/pop churn but it didn't end up happening.</div><div><br></div><div><br></div><div>Anyways, it appears that we have the first case where my suspicions were borne out and we have somewhat reasonably different ABIs for some of the thunks.</div><div><br></div><div>How should we name them to distinguish things? </div></div></div>