<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Fri, Feb 9, 2018 at 12:26 AM David Woodhouse <<a href="mailto:dwmw2@infradead.org">dwmw2@infradead.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

<br>

On Fri, 2018-02-09 at 02:21 +0000, David Woodhouse wrote:<br>

> On Fri, 2018-02-09 at 01:18 +0000, David Woodhouse wrote:<br>

> ><br>

> ><br>

> > For now I'm just going to attempt to work around it like this in the<br>

> > kernel, so I can concentrate on the retpoline bits:<br>

> >  <a href="http://david.woodhou.se/clang-percpu-hack.patch" rel="noreferrer" target="_blank">http://david.woodhou.se/clang-percpu-hack.patch</a><br>

><br>

> 32-bit doesn't boot. Built without CONFIG_RETPOLINE and with Clang 5.0<br>

> (and the above patch) it does. I'm rebuilding a Release build of<br>

> llvm/clang so that experimental kernel builds hopefully take less than<br>

> an hour, and will prod further in the morning.<br>

<br>

What is the intended ABI of __x86_indirect_thunk which I have been<br>

calling the "ret-equivalent" retpoline? I see this happening<br>

(I ♥ 'qemu -d in_asm')...<br>

<br>

----------------<br>

IN: <br>

0xc136feea:  89 d8                    movl     %ebx, %eax<br>

0xc136feec:  89 f2                    movl     %esi, %edx<br>

0xc136feee:  8b 75 f0                 movl     -0x10(%ebp), %esi<br>

0xc136fef1:  89 f1                    movl     %esi, %ecx<br>

0xc136fef3:  ff 75 e0                 pushl    -0x20(%ebp)<br>

0xc136fef6:  e8 c5 f3 58 00           calll    0xc18ff2c0 # __x86_indirect_thunk<br>

<br>

----------------<br>

IN: <br>

0xc18ff2c0:  c3                       retl     # Early boot, so it hasn't been turned into a proper retpoline yet<br>

<br>

----------------<br>

IN: <br>

0xc136fefb:  8d 34 7e                 leal     (%esi, %edi, 2), %esi<br>

<br>

<br>

(gdb) list *0xc136fef6<br>

0xc136fef6 is in sort (lib/sort.c:87).<br>

82                              if (c < n - size &&<br>

83                                              cmp_func(base + c, base + c + size) < 0)<br>

84                                      c += size;<br>

85                              if (cmp_func(base + r, base + c) >= 0)<br>

86                                      break;<br>

87                              swap_func(base + r, base + c, size);<br>

88                      }<br>

89              }<br>

90<br>

91              /* sort */<br>

<br>

You're pushing the target (-0x20(%ebp)) onto the stack and then<br>

*calling* __x86_indirect_thunk. So it looks like you're expecting<br>

__x86_indirect_thunk to do something like<br>

<br>

  call *4(%esp)<br>

  ret<br>

<br>

... except that final 'ret' still leaves the target address on the<br>

stack, so there would also need to be a complicated dance, without<br>

using any registers, to pop that too.<br></blockquote><div><br></div><div>Yeah, we expect a complicated dance to re-order the stack to get the correct return address into the correct place.</div><div><br></div><div>You can see the sequence in the comments here:</div><div><a href="https://github.com/llvm-project/llvm-project-20170507/blob/master/llvm/lib/Target/X86/X86RetpolineThunks.cpp#L179-L194">https://github.com/llvm-project/llvm-project-20170507/blob/master/llvm/lib/Target/X86/X86RetpolineThunks.cpp#L179-L194</a></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

I expected the emitted code for a *call* using the thunk to look more<br>

like<br>

<br>

   jmp 2f <br>

1: pushl -0x20(%ebp)        # cmp_func<br>

   jmp __x86_thunk_indirect # jmp, not call<br>

2: call 1b                  # set up address for cmp_func to return to<br></blockquote><div><br></div><div>Yeah, the specific goal was to minimize the code size footprint at the call site even though it means a few more instructions in the thunk. Our pattern also has a minor reduction in the dynamic branches taken at the cost of the push/pop churn.</div><div><br></div><div>There was briefly a discussion of a different instruction sequence to minimize push/pop churn but it didn't end up happening.</div><div><br></div><div><br></div><div>Anyways, it appears that we have the first case where my suspicions were borne out and we have somewhat reasonably different ABIs for some of the thunks.</div><div><br></div><div>How should we name them to distinguish things? </div></div></div>