From libc-dev at lists.llvm.org Wed Apr 1 10:40:33 2020 From: libc-dev at lists.llvm.org (Siva Chandra via libc-dev) Date: Wed, 1 Apr 2020 10:40:33 -0700 Subject: [libc-dev] Platform and architecture abstraction layers In-Reply-To: <0392857d-cd99-fe9b-ade8-883daa1e3bb3@theravensnest.org> References: <0392857d-cd99-fe9b-ade8-883daa1e3bb3@theravensnest.org> Message-ID: On Tue, Mar 31, 2020 at 7:42 AM David Chisnall via libc-dev < libc-dev at lists.llvm.org> wrote: > Hello libc people, > > When llvm-libc was approved for incorporation into LLVM, the stated goal > was to provide a portable libc implementation. I'm quite concerned that > we are increasingly seeing a load of Linux-specific code being > committed. This is all in a linux directory, which is fine in theory, > but I don't want us to end up with an entire libc in linux/x86-64/* and > nothing elsewhere. It is much easier to build portable software from > the start than it is to retrofit portability. We need to start putting > in platform and architecture abstraction layers from the start. > I totally agree. We have to start somewhere, and we have started with x86_64 on linux as that is the most easily accessible platform for us working on libc. As the structure gets more complicated, and as we start adding more architectures (which we will), we will definitely pull out the common code and build abstraction layers. The memcpy implementation already has a small example of abstraction (may be trivial at this point) for handling i386 and x86_64. Most of the code committed in linux/x86_64/start.cpp, is either > identical or almost identical on all architectures and all System-V ABI > platforms. As the code is currently structured, this is going to be > copied and pasted repeatedly. > The same points as above hold for this as well. But, I take your point and do understand what you are saying. FWIW, no one wants copy-pasted code. As I said above, we have to start somewhere and this is where we started. Few observations which are probably not related to the topic of this thread: >From my side personally, I have been very particular with getting the code structure right in my own patches as well when doing code reviews. At the same time, I also do not want the quest for perfection to hinder progress. A nice side effect of allowing progress is that it allows us to learn. And, there have been a few instances wherein we reversed our initial decisions as we realized they were not good enough. Also, for a project as vast as a libc, trying and experimenting should be encouraged even if not all of them succeed (personally, I am OK if most of them fail as long as they teach us something.) Thanks, Siva Chandra -------------- next part -------------- An HTML attachment was scrubbed... URL: From libc-dev at lists.llvm.org Wed Apr 15 21:48:44 2020 From: libc-dev at lists.llvm.org (Florian Weimer via libc-dev) Date: Thu, 16 Apr 2020 06:48:44 +0200 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200415225539.GL11469@brightrain.aerifal.cx> (Rich Felker's message of "Wed, 15 Apr 2020 18:55:39 -0400") References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> Message-ID: <87k12gf32r.fsf@mid.deneb.enyo.de> * Rich Felker: > My preference would be that it work just like the i386 AT_SYSINFO > where you just replace "int $128" with "call *%%gs:16" and the kernel > provides a stub in the vdso that performs either scv or the old > mechanism with the same calling convention. The i386 mechanism has received some criticism because it provides an effective means to redirect execution flow to anyone who can write to the TCB. I am not sure if it makes sense to copy it. From libc-dev at lists.llvm.org Thu Apr 16 09:42:32 2020 From: libc-dev at lists.llvm.org (Florian Weimer via libc-dev) Date: Thu, 16 Apr 2020 18:42:32 +0200 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200416153509.GT11469@brightrain.aerifal.cx> (Rich Felker's message of "Thu, 16 Apr 2020 11:35:09 -0400") References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <87k12gf32r.fsf@mid.deneb.enyo.de> <20200416153509.GT11469@brightrain.aerifal.cx> Message-ID: <87sgh3e613.fsf@mid.deneb.enyo.de> * Rich Felker: > On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote: >> * Rich Felker: >> >> > My preference would be that it work just like the i386 AT_SYSINFO >> > where you just replace "int $128" with "call *%%gs:16" and the kernel >> > provides a stub in the vdso that performs either scv or the old >> > mechanism with the same calling convention. >> >> The i386 mechanism has received some criticism because it provides an >> effective means to redirect execution flow to anyone who can write to >> the TCB. I am not sure if it makes sense to copy it. > > Indeed that's a good point. Do you have ideas for making it equally > efficient without use of a function pointer in the TCB? We could add a shared non-writable mapping at a 64K offset from the thread pointer and store the function pointer or the code there. Then it would be safe. However, since this is apparently tied to POWER9 and we already have a POWER9 multilib, and assuming that we are going to backport the kernel change, I would tweak the selection criterion for that multilib to include the new HWCAP2 flag. If a user runs this glibc on a kernel which does not have support, they will get set baseline (POWER8) multilib, which still works. This way, outside the dynamic loader, no run-time dispatch is needed at all. I guess this is not at all the answer you were looking for. 8-) If a single binary is needed, I would perhaps follow what Arm did for -moutline-atomics: lay out the code so that its easy to execute for the non-POWER9 case, assuming that POWER9 machines will be better at predicting things than their predecessors. Or you could also put the function pointer into a RELRO segment. Then there's overlap with the __libc_single_threaded discussion, where people objected to this kind of optimization (although I did not propose to change the TCB ABI, that would be required for __libc_single_threaded because it's an external interface). From libc-dev at lists.llvm.org Thu Apr 16 11:12:19 2020 From: libc-dev at lists.llvm.org (Florian Weimer via libc-dev) Date: Thu, 16 Apr 2020 20:12:19 +0200 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200416165257.GY11469@brightrain.aerifal.cx> (Rich Felker's message of "Thu, 16 Apr 2020 12:52:57 -0400") References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <87k12gf32r.fsf@mid.deneb.enyo.de> <20200416153509.GT11469@brightrain.aerifal.cx> <87sgh3e613.fsf@mid.deneb.enyo.de> <20200416165257.GY11469@brightrain.aerifal.cx> Message-ID: <87ftd3e1vg.fsf@mid.deneb.enyo.de> * Rich Felker: > On Thu, Apr 16, 2020 at 06:42:32PM +0200, Florian Weimer wrote: >> * Rich Felker: >> >> > On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote: >> >> * Rich Felker: >> >> >> >> > My preference would be that it work just like the i386 AT_SYSINFO >> >> > where you just replace "int $128" with "call *%%gs:16" and the kernel >> >> > provides a stub in the vdso that performs either scv or the old >> >> > mechanism with the same calling convention. >> >> >> >> The i386 mechanism has received some criticism because it provides an >> >> effective means to redirect execution flow to anyone who can write to >> >> the TCB. I am not sure if it makes sense to copy it. >> > >> > Indeed that's a good point. Do you have ideas for making it equally >> > efficient without use of a function pointer in the TCB? >> >> We could add a shared non-writable mapping at a 64K offset from the >> thread pointer and store the function pointer or the code there. Then >> it would be safe. >> >> However, since this is apparently tied to POWER9 and we already have a >> POWER9 multilib, and assuming that we are going to backport the kernel >> change, I would tweak the selection criterion for that multilib to >> include the new HWCAP2 flag. If a user runs this glibc on a kernel >> which does not have support, they will get set baseline (POWER8) >> multilib, which still works. This way, outside the dynamic loader, no >> run-time dispatch is needed at all. I guess this is not at all the >> answer you were looking for. 8-) > > How does this work with -static? :-) -static is not supported. 8-) (If you use the unsupported static libraries, you get POWER8 code.) (Just to be clear, in case someone doesn't get the joke: This is about a potential approach for a heavily constrained, vertically integrated environment. It does not reflect general glibc recommendations.) >> If a single binary is needed, I would perhaps follow what Arm did for >> -moutline-atomics: lay out the code so that its easy to execute for >> the non-POWER9 case, assuming that POWER9 machines will be better at >> predicting things than their predecessors. >> >> Or you could also put the function pointer into a RELRO segment. Then >> there's overlap with the __libc_single_threaded discussion, where >> people objected to this kind of optimization (although I did not >> propose to change the TCB ABI, that would be required for >> __libc_single_threaded because it's an external interface). > > Of course you can use a normal global, but now every call point needs > to setup a TOC pointer (= two entry points and more icache lines for > otherwise trivial functions). > > I think my choice would be just making the inline syscall be a single > call insn to an asm source file that out-of-lines the loading of TOC > pointer and call through it or branch based on hwcap so that it's not > repeated all over the place. I don't know how problematic control flow out of an inline asm is on POWER. But this is basically the -moutline-atomics approach. > Alternatively, it would perhaps work to just put hwcap in the TCB and > branch on it rather than making an indirect call to a function pointer > in the TCB, so that the worst you could do by clobbering it is execute > the wrong syscall insn and thereby get SIGILL. The HWCAP is already in the TCB. I expect this is what generic glibc builds are going to use (perhaps with a bit of tweaking favorable to POWER8 implementations, but we'll see). From libc-dev at lists.llvm.org Thu Apr 16 13:18:18 2020 From: libc-dev at lists.llvm.org (Florian Weimer via libc-dev) Date: Thu, 16 Apr 2020 22:18:18 +0200 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <1587004907.ioxh0bxsln.astroid@bobo.none> (Nicholas Piggin via Libc-alpha's message of "Thu, 16 Apr 2020 12:53:31 +1000") References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <1586994952.nnxigedbu2.astroid@bobo.none> <20200416004843.GO11469@brightrain.aerifal.cx> <1587002854.f0slo0111r.astroid@bobo.none> <20200416023542.GP11469@brightrain.aerifal.cx> <1587004907.ioxh0bxsln.astroid@bobo.none> Message-ID: <87wo6fchh1.fsf@mid.deneb.enyo.de> * Nicholas Piggin via Libc-alpha: > We may or may not be getting a new ABI that will use instructions not > supported by old processors. > > https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html > > Current ABI continues to work of course and be the default for some > time, but building for new one would give some opportunity to drop > such support for old procs, at least for glibc. If I recall correctly, during last year's GNU Tools Cauldron, I think it was pretty clear that this was only to be used for intra-DSO ABIs, not cross-DSO optimization. Relocatable object files have an ABI, too, of course, so that's why there's a ABI documentation needed. For cross-DSO optimization, the link editor would look at the DSO being linked in, check if it uses the -mfuture ABI, and apply some shortcuts. But at that point, if the DSO is swapped back to a version built without -mfuture, it no longer works with those newly linked binaries against the -mfuture version. Such a thing is a clear ABI bump, and based what I remember from Cauldron, that is not the plan here. (I don't have any insider knowledge—I just don't want people to read this think: gosh, yet another POWER ABI bump. But the PCREL stuff *is* exciting!) From libc-dev at lists.llvm.org Wed Apr 15 14:45:09 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Thu, 16 Apr 2020 07:45:09 +1000 Subject: [libc-dev] Powerpc Linux 'scv' system call ABI proposal take 2 Message-ID: <1586931450.ub4c8cq8dj.astroid@bobo.none> I would like to enable Linux support for the powerpc 'scv' instruction, as a faster system call instruction. This requires two things to be defined: Firstly a way to advertise to userspace that kernel supports scv, and a way to allocate and advertise support for individual scv vectors. Secondly, a calling convention ABI for this new instruction. Thanks to those who commented last time, since then I have removed my answered questions and unpopular alternatives but you can find them here https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html Let me try one more with a wider cc list, and then we'll get something merged. Any questions or counter-opinions are welcome. System Call Vectored (scv) ABI ============================== The scv instruction is introduced with POWER9 / ISA3, it comes with an rfscv counter-part. The benefit of these instructions is performance (trading slower SRR0/1 with faster LR/CTR registers, and entering the kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR updates. The scv instruction has 128 interrupt entry points (not enough to cover the Linux system call space). The proposal is to assign scv numbers very conservatively and allocate them as individual HWCAP features as we add support for more. The zero vector ('scv 0') will be used for normal system calls, equivalent to 'sc'. Advertisement Linux has not enabled FSCR[SCV] yet, so the instruction will cause a SIGILL in current environments. Linux has defined a HWCAP2 bit PPC_FEATURE2_SCV for SCV support, but does not set it. When scv instruction support and the scv 0 vector for system calls are added, PPC_FEATURE2_SCV will indicate support for these. Other vectors should not be used without future HWCAP bits indicating support, which is how we will allocate them. (Should unallocated ones generate SIGILL, or return -ENOSYS in r3?) Calling convention The proposal is for scv 0 to provide the standard Linux system call ABI with the following differences from sc convention[1]: - LR is to be volatile across scv calls. This is necessary because the scv instruction clobbers LR. From previous discussion, this should be possible to deal with in GCC clobbers and CFI. - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the kernel system call exit to avoid restoring the CR register (although we probably still would anyway to avoid information leak). - Error handling: I think the consensus has been to move to using negative return value in r3 rather than CR0[SO]=1 to indicate error, which matches most other architectures and is closer to a function call. The number of scratch registers (r9-r12) at kernel entry seems sufficient that we don't have any costly spilling, patch is here[2]. [1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840.html From libc-dev at lists.llvm.org Wed Apr 15 15:55:39 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Wed, 15 Apr 2020 18:55:39 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <1586931450.ub4c8cq8dj.astroid@bobo.none> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> Message-ID: <20200415225539.GL11469@brightrain.aerifal.cx> On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: > I would like to enable Linux support for the powerpc 'scv' instruction, > as a faster system call instruction. > > This requires two things to be defined: Firstly a way to advertise to > userspace that kernel supports scv, and a way to allocate and advertise > support for individual scv vectors. Secondly, a calling convention ABI > for this new instruction. > > Thanks to those who commented last time, since then I have removed my > answered questions and unpopular alternatives but you can find them > here > > https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html > > Let me try one more with a wider cc list, and then we'll get something > merged. Any questions or counter-opinions are welcome. > > System Call Vectored (scv) ABI > ============================== > > The scv instruction is introduced with POWER9 / ISA3, it comes with an > rfscv counter-part. The benefit of these instructions is performance > (trading slower SRR0/1 with faster LR/CTR registers, and entering the > kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR > updates. The scv instruction has 128 interrupt entry points (not enough > to cover the Linux system call space). > > The proposal is to assign scv numbers very conservatively and allocate > them as individual HWCAP features as we add support for more. The zero > vector ('scv 0') will be used for normal system calls, equivalent to 'sc'. > > Advertisement > > Linux has not enabled FSCR[SCV] yet, so the instruction will cause a > SIGILL in current environments. Linux has defined a HWCAP2 bit > PPC_FEATURE2_SCV for SCV support, but does not set it. > > When scv instruction support and the scv 0 vector for system calls are > added, PPC_FEATURE2_SCV will indicate support for these. Other vectors > should not be used without future HWCAP bits indicating support, which is > how we will allocate them. (Should unallocated ones generate SIGILL, or > return -ENOSYS in r3?) > > Calling convention > > The proposal is for scv 0 to provide the standard Linux system call ABI > with the following differences from sc convention[1]: > > - LR is to be volatile across scv calls. This is necessary because the > scv instruction clobbers LR. From previous discussion, this should be > possible to deal with in GCC clobbers and CFI. > > - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the > kernel system call exit to avoid restoring the CR register (although > we probably still would anyway to avoid information leak). > > - Error handling: I think the consensus has been to move to using negative > return value in r3 rather than CR0[SO]=1 to indicate error, which matches > most other architectures and is closer to a function call. > > The number of scratch registers (r9-r12) at kernel entry seems > sufficient that we don't have any costly spilling, patch is here[2]. > > [1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst > [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840.html My preference would be that it work just like the i386 AT_SYSINFO where you just replace "int $128" with "call *%%gs:16" and the kernel provides a stub in the vdso that performs either scv or the old mechanism with the same calling convention. Then if the kernel doesn't provide it (because the kernel is too old) libc would have to provide its own stub that uses the legacy method and matches the calling convention of the one the kernel is expected to provide. Note that any libc that actually makes use of the new functionality is not going to be able to make clobbers conditional on support for it; branching around different clobbers is going to defeat any gains vs always just treating anything clobbered by either method as clobbered. Likewise, it's not useful to have different error return mechanisms because the caller just has to branch to support both (or the kernel-provided stub just has to emulate one for it; that could work if you really want to change the bad existing convention). Thoughts? Rich From libc-dev at lists.llvm.org Wed Apr 15 16:11:42 2020 From: libc-dev at lists.llvm.org (Tanya Lattner via libc-dev) Date: Wed, 15 Apr 2020 16:11:42 -0700 Subject: [libc-dev] 2020 US LLVM Developers' Meeting - September 28-29 Message-ID: <7E8E10D5-9C22-4185-9DB5-C09B84A45D30@llvm.org> The LLVM Foundation is pleased to announce the 14th annual LLVM Developers’ Meeting in the Bay Area on September 28-29 in San Jose, CA. We will have additional events on September 27th in the afternoon/evening which be announced when available. The LLVM Developers' Meeting is a bi-annual 2 day gathering of the entire LLVM Project community. The conference is organized by the LLVM Foundation and many volunteers within the LLVM community. Developers and users of LLVM, Clang, and related subprojects will enjoy attending interesting talks, impromptu discussions, and networking with the many members of our community. Whether you are a new to the LLVM project or a long time member, there is something for each attendee. Given the current situation regarding COVID-19, we feel it is best to be totally transparent with our planning process. We are closely monitoring the news regarding restrictions on travel and large gatherings and also following the World Health Organization's advice. It takes about 9-12 months of planning for our developers’ meetings and given we do not know the situation in September, we are moving forward with the hope that it will be safe to host our event. What can you expect at a LLVM Developers' Meeting? Technical Talks: These 20-30 minute talks cover all topics from core infrastructure talks, to project's using LLVM's infrastructure. Attendees will take away technical information that could be pertinent to their project or general interest. Tutorials: Tutorials are 50 minute sessions that dive down deep into a technical topic. Expect in depth examples and explanations. Lightning Talks: These are fast 5 minute talks that give you a taste of a project or topic. Attendees will hear a wide range of topics and probably leave you wanting to learn more. Panels: Panel sessions are guided discussions about a specific topic. The panel consists of ~3 developers who discuss a topic through prepared questions from a moderator. The audience is also given the opportunity to ask questions of the panel. Birds of a Feather: Large round table discussions with a more formal directed discussion. Student Research Competition: Students present their research using LLVM or related subprojects. These are usually 20 minute technical presentations with Q&A. The audience will vote at the end for the winning presentation and paper. Poster Session: An hour long session where selected posters are on display for attendees to ask questions and discuss. Round Table Discussions: Informal and impromptu discussions on a specific topic. During the conference there are set time slots where groups can organize to discuss a problem or topic. Evening Reception (September 28): After a full day of technical talks and discussions, join your fellow attendees for an evening reception to continue the conversation and meet even more attendees. What types of people attend? • Active developers of projects in the LLVM Umbrella (LLVM core, Clang, LLDB, libc++, compiler_rt, klee, lld, etc). • Anyone interested in using these as part of another project. • Students and Researchers • Compiler, programming language, and runtime enthusiasts. • Those interested in using compiler and toolchain technology in novel and interesting ways. More information regarding call for papers, registration, travel grants,etc, will be coming in the next month. For future announcements or questions: Please sign up for the LLVM Developers’ Meeting mailing list . -------------- next part -------------- An HTML attachment was scrubbed... URL: From libc-dev at lists.llvm.org Wed Apr 15 17:16:54 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Thu, 16 Apr 2020 10:16:54 +1000 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200415225539.GL11469@brightrain.aerifal.cx> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> Message-ID: <1586994952.nnxigedbu2.astroid@bobo.none> Excerpts from Rich Felker's message of April 16, 2020 8:55 am: > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: >> I would like to enable Linux support for the powerpc 'scv' instruction, >> as a faster system call instruction. >> >> This requires two things to be defined: Firstly a way to advertise to >> userspace that kernel supports scv, and a way to allocate and advertise >> support for individual scv vectors. Secondly, a calling convention ABI >> for this new instruction. >> >> Thanks to those who commented last time, since then I have removed my >> answered questions and unpopular alternatives but you can find them >> here >> >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html >> >> Let me try one more with a wider cc list, and then we'll get something >> merged. Any questions or counter-opinions are welcome. >> >> System Call Vectored (scv) ABI >> ============================== >> >> The scv instruction is introduced with POWER9 / ISA3, it comes with an >> rfscv counter-part. The benefit of these instructions is performance >> (trading slower SRR0/1 with faster LR/CTR registers, and entering the >> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR >> updates. The scv instruction has 128 interrupt entry points (not enough >> to cover the Linux system call space). >> >> The proposal is to assign scv numbers very conservatively and allocate >> them as individual HWCAP features as we add support for more. The zero >> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'. >> >> Advertisement >> >> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a >> SIGILL in current environments. Linux has defined a HWCAP2 bit >> PPC_FEATURE2_SCV for SCV support, but does not set it. >> >> When scv instruction support and the scv 0 vector for system calls are >> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors >> should not be used without future HWCAP bits indicating support, which is >> how we will allocate them. (Should unallocated ones generate SIGILL, or >> return -ENOSYS in r3?) >> >> Calling convention >> >> The proposal is for scv 0 to provide the standard Linux system call ABI >> with the following differences from sc convention[1]: >> >> - LR is to be volatile across scv calls. This is necessary because the >> scv instruction clobbers LR. From previous discussion, this should be >> possible to deal with in GCC clobbers and CFI. >> >> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the >> kernel system call exit to avoid restoring the CR register (although >> we probably still would anyway to avoid information leak). >> >> - Error handling: I think the consensus has been to move to using negative >> return value in r3 rather than CR0[SO]=1 to indicate error, which matches >> most other architectures and is closer to a function call. >> >> The number of scratch registers (r9-r12) at kernel entry seems >> sufficient that we don't have any costly spilling, patch is here[2]. >> >> [1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst >> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840.html > > My preference would be that it work just like the i386 AT_SYSINFO > where you just replace "int $128" with "call *%%gs:16" and the kernel > provides a stub in the vdso that performs either scv or the old > mechanism with the same calling convention. Then if the kernel doesn't > provide it (because the kernel is too old) libc would have to provide > its own stub that uses the legacy method and matches the calling > convention of the one the kernel is expected to provide. I'm not sure if that's necessary. That's done on x86-32 because they select different sequences to use based on the CPU running and if the host kernel is 32 or 64 bit. Sure they could in theory have a bunch of HWCAP bits and select the right sequence in libc as well I suppose. > Note that any libc that actually makes use of the new functionality is > not going to be able to make clobbers conditional on support for it; > branching around different clobbers is going to defeat any gains vs > always just treating anything clobbered by either method as clobbered. Well it would have to test HWCAP and patch in or branch to two completely different sequences including register save/restores yes. You could have the same asm and matching clobbers to put the sequence inline and then you could patch the one sc/scv instruction I suppose. A bit of logic to select between them doesn't defeat gains though, it's about 90 cycle improvement which is a handful of branch mispredicts so it really is an improvement. Eventually userspace will stop supporting the old variant too. > Likewise, it's not useful to have different error return mechanisms > because the caller just has to branch to support both (or the > kernel-provided stub just has to emulate one for it; that could work > if you really want to change the bad existing convention). > > Thoughts? The existing convention has to change somewhat because of the clobbers, so I thought we could change the error return at the same time. I'm open to not changing it and using CR0[SO], but others liked the idea. Pro: it matches sc and vsyscall. Con: it's different from other common archs. Performnce-wise it would really be a wash -- cost of conditional branch is not the cmp but the mispredict. Thanks, Nick From libc-dev at lists.llvm.org Wed Apr 15 17:48:43 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Wed, 15 Apr 2020 20:48:43 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <1586994952.nnxigedbu2.astroid@bobo.none> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <1586994952.nnxigedbu2.astroid@bobo.none> Message-ID: <20200416004843.GO11469@brightrain.aerifal.cx> On Thu, Apr 16, 2020 at 10:16:54AM +1000, Nicholas Piggin wrote: > Excerpts from Rich Felker's message of April 16, 2020 8:55 am: > > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: > >> I would like to enable Linux support for the powerpc 'scv' instruction, > >> as a faster system call instruction. > >> > >> This requires two things to be defined: Firstly a way to advertise to > >> userspace that kernel supports scv, and a way to allocate and advertise > >> support for individual scv vectors. Secondly, a calling convention ABI > >> for this new instruction. > >> > >> Thanks to those who commented last time, since then I have removed my > >> answered questions and unpopular alternatives but you can find them > >> here > >> > >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html > >> > >> Let me try one more with a wider cc list, and then we'll get something > >> merged. Any questions or counter-opinions are welcome. > >> > >> System Call Vectored (scv) ABI > >> ============================== > >> > >> The scv instruction is introduced with POWER9 / ISA3, it comes with an > >> rfscv counter-part. The benefit of these instructions is performance > >> (trading slower SRR0/1 with faster LR/CTR registers, and entering the > >> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR > >> updates. The scv instruction has 128 interrupt entry points (not enough > >> to cover the Linux system call space). > >> > >> The proposal is to assign scv numbers very conservatively and allocate > >> them as individual HWCAP features as we add support for more. The zero > >> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'. > >> > >> Advertisement > >> > >> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a > >> SIGILL in current environments. Linux has defined a HWCAP2 bit > >> PPC_FEATURE2_SCV for SCV support, but does not set it. > >> > >> When scv instruction support and the scv 0 vector for system calls are > >> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors > >> should not be used without future HWCAP bits indicating support, which is > >> how we will allocate them. (Should unallocated ones generate SIGILL, or > >> return -ENOSYS in r3?) > >> > >> Calling convention > >> > >> The proposal is for scv 0 to provide the standard Linux system call ABI > >> with the following differences from sc convention[1]: > >> > >> - LR is to be volatile across scv calls. This is necessary because the > >> scv instruction clobbers LR. From previous discussion, this should be > >> possible to deal with in GCC clobbers and CFI. > >> > >> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the > >> kernel system call exit to avoid restoring the CR register (although > >> we probably still would anyway to avoid information leak). > >> > >> - Error handling: I think the consensus has been to move to using negative > >> return value in r3 rather than CR0[SO]=1 to indicate error, which matches > >> most other architectures and is closer to a function call. > >> > >> The number of scratch registers (r9-r12) at kernel entry seems > >> sufficient that we don't have any costly spilling, patch is here[2]. > >> > >> [1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst > >> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840..html > > > > My preference would be that it work just like the i386 AT_SYSINFO > > where you just replace "int $128" with "call *%%gs:16" and the kernel > > provides a stub in the vdso that performs either scv or the old > > mechanism with the same calling convention. Then if the kernel doesn't > > provide it (because the kernel is too old) libc would have to provide > > its own stub that uses the legacy method and matches the calling > > convention of the one the kernel is expected to provide. > > I'm not sure if that's necessary. That's done on x86-32 because they > select different sequences to use based on the CPU running and if the host > kernel is 32 or 64 bit. Sure they could in theory have a bunch of HWCAP > bits and select the right sequence in libc as well I suppose. It's not just a HWCAP. It's a contract between the kernel and userspace to support a particular calling convention that's not exposed except as the public entry point the kernel exports via AT_SYSINFO. > > Note that any libc that actually makes use of the new functionality is > > not going to be able to make clobbers conditional on support for it; > > branching around different clobbers is going to defeat any gains vs > > always just treating anything clobbered by either method as clobbered. > > Well it would have to test HWCAP and patch in or branch to two > completely different sequences including register save/restores yes. > You could have the same asm and matching clobbers to put the sequence > inline and then you could patch the one sc/scv instruction I suppose. > > A bit of logic to select between them doesn't defeat gains though, > it's about 90 cycle improvement which is a handful of branch mispredicts > so it really is an improvement. Eventually userspace will stop > supporting the old variant too. Oh, I didn't mean it would neutralize the benefit of svc. Rather, I meant it would be worse to do: if (hwcap & X) { __asm__(... with some clobbers); } else { __asm__(... with different clobbers); } instead of just __asm__("indirect call" ... with common clobbers); where the indirect call is to an address ideally provided like on i386, or otherwise initialized to one of two or more code addresses in libc based on hwcap bits. > > Likewise, it's not useful to have different error return mechanisms > > because the caller just has to branch to support both (or the > > kernel-provided stub just has to emulate one for it; that could work > > if you really want to change the bad existing convention). > > > > Thoughts? > > The existing convention has to change somewhat because of the clobbers, > so I thought we could change the error return at the same time. I'm > open to not changing it and using CR0[SO], but others liked the idea. > Pro: it matches sc and vsyscall. Con: it's different from other common > archs. Performnce-wise it would really be a wash -- cost of conditional > branch is not the cmp but the mispredict. If you do the branch on hwcap at each syscall, then you significantly increase code size of every syscall point, likely turning a bunch of trivial functions that didn't need stack frames into ones that do. You also potentially make them need a TOC pointer. Making them all just do an indirect call unconditionally (with pointer in TLS like i386?) is a lot more efficient in code size and at least as good for performance. Rich From libc-dev at lists.llvm.org Wed Apr 15 19:24:16 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Thu, 16 Apr 2020 12:24:16 +1000 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200416004843.GO11469@brightrain.aerifal.cx> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <1586994952.nnxigedbu2.astroid@bobo.none> <20200416004843.GO11469@brightrain.aerifal.cx> Message-ID: <1587002854.f0slo0111r.astroid@bobo.none> Excerpts from Rich Felker's message of April 16, 2020 10:48 am: > On Thu, Apr 16, 2020 at 10:16:54AM +1000, Nicholas Piggin wrote: >> Excerpts from Rich Felker's message of April 16, 2020 8:55 am: >> > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: >> >> I would like to enable Linux support for the powerpc 'scv' instruction, >> >> as a faster system call instruction. >> >> >> >> This requires two things to be defined: Firstly a way to advertise to >> >> userspace that kernel supports scv, and a way to allocate and advertise >> >> support for individual scv vectors. Secondly, a calling convention ABI >> >> for this new instruction. >> >> >> >> Thanks to those who commented last time, since then I have removed my >> >> answered questions and unpopular alternatives but you can find them >> >> here >> >> >> >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html >> >> >> >> Let me try one more with a wider cc list, and then we'll get something >> >> merged. Any questions or counter-opinions are welcome. >> >> >> >> System Call Vectored (scv) ABI >> >> ============================== >> >> >> >> The scv instruction is introduced with POWER9 / ISA3, it comes with an >> >> rfscv counter-part. The benefit of these instructions is performance >> >> (trading slower SRR0/1 with faster LR/CTR registers, and entering the >> >> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR >> >> updates. The scv instruction has 128 interrupt entry points (not enough >> >> to cover the Linux system call space). >> >> >> >> The proposal is to assign scv numbers very conservatively and allocate >> >> them as individual HWCAP features as we add support for more. The zero >> >> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'. >> >> >> >> Advertisement >> >> >> >> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a >> >> SIGILL in current environments. Linux has defined a HWCAP2 bit >> >> PPC_FEATURE2_SCV for SCV support, but does not set it. >> >> >> >> When scv instruction support and the scv 0 vector for system calls are >> >> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors >> >> should not be used without future HWCAP bits indicating support, which is >> >> how we will allocate them. (Should unallocated ones generate SIGILL, or >> >> return -ENOSYS in r3?) >> >> >> >> Calling convention >> >> >> >> The proposal is for scv 0 to provide the standard Linux system call ABI >> >> with the following differences from sc convention[1]: >> >> >> >> - LR is to be volatile across scv calls. This is necessary because the >> >> scv instruction clobbers LR. From previous discussion, this should be >> >> possible to deal with in GCC clobbers and CFI. >> >> >> >> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the >> >> kernel system call exit to avoid restoring the CR register (although >> >> we probably still would anyway to avoid information leak). >> >> >> >> - Error handling: I think the consensus has been to move to using negative >> >> return value in r3 rather than CR0[SO]=1 to indicate error, which matches >> >> most other architectures and is closer to a function call. >> >> >> >> The number of scratch registers (r9-r12) at kernel entry seems >> >> sufficient that we don't have any costly spilling, patch is here[2]. >> >> >> >> [1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst >> >> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840..html >> > >> > My preference would be that it work just like the i386 AT_SYSINFO >> > where you just replace "int $128" with "call *%%gs:16" and the kernel >> > provides a stub in the vdso that performs either scv or the old >> > mechanism with the same calling convention. Then if the kernel doesn't >> > provide it (because the kernel is too old) libc would have to provide >> > its own stub that uses the legacy method and matches the calling >> > convention of the one the kernel is expected to provide. >> >> I'm not sure if that's necessary. That's done on x86-32 because they >> select different sequences to use based on the CPU running and if the host >> kernel is 32 or 64 bit. Sure they could in theory have a bunch of HWCAP >> bits and select the right sequence in libc as well I suppose. > > It's not just a HWCAP. It's a contract between the kernel and > userspace to support a particular calling convention that's not > exposed except as the public entry point the kernel exports via > AT_SYSINFO. Right. >> > Note that any libc that actually makes use of the new functionality is >> > not going to be able to make clobbers conditional on support for it; >> > branching around different clobbers is going to defeat any gains vs >> > always just treating anything clobbered by either method as clobbered. >> >> Well it would have to test HWCAP and patch in or branch to two >> completely different sequences including register save/restores yes. >> You could have the same asm and matching clobbers to put the sequence >> inline and then you could patch the one sc/scv instruction I suppose. >> >> A bit of logic to select between them doesn't defeat gains though, >> it's about 90 cycle improvement which is a handful of branch mispredicts >> so it really is an improvement. Eventually userspace will stop >> supporting the old variant too. > > Oh, I didn't mean it would neutralize the benefit of svc. Rather, I > meant it would be worse to do: > > if (hwcap & X) { > __asm__(... with some clobbers); > } else { > __asm__(... with different clobbers); > } > > instead of just > > __asm__("indirect call" ... with common clobbers); Ah okay. Well that's debatable but if you didn't have an indirect call, rather a runtime-patched sequence, then yes saving the LR clobber or whatever wouldn't be worth a branch. > where the indirect call is to an address ideally provided like on > i386, or otherwise initialized to one of two or more code addresses in > libc based on hwcap bits. Right, I'm just skeptical we need the indirect call or need to provide it in the vdso. The "clever" reason to add it on x86-32 was because of the bugs and different combinations needed, that doesn't really apply to scv 0 and was not necessarily a great choice. > >> > Likewise, it's not useful to have different error return mechanisms >> > because the caller just has to branch to support both (or the >> > kernel-provided stub just has to emulate one for it; that could work >> > if you really want to change the bad existing convention). >> > >> > Thoughts? >> >> The existing convention has to change somewhat because of the clobbers, >> so I thought we could change the error return at the same time. I'm >> open to not changing it and using CR0[SO], but others liked the idea. >> Pro: it matches sc and vsyscall. Con: it's different from other common >> archs. Performnce-wise it would really be a wash -- cost of conditional >> branch is not the cmp but the mispredict. > > If you do the branch on hwcap at each syscall, then you significantly > increase code size of every syscall point, likely turning a bunch of > trivial functions that didn't need stack frames into ones that do. You > also potentially make them need a TOC pointer. Making them all just do > an indirect call unconditionally (with pointer in TLS like i386?) is a > lot more efficient in code size and at least as good for performance. I disagree. Doing the long vdso indirect call *necessarily* requires touching a new icache line, and even a new TLB entry. Indirect branches also tend to be more costly and/or less accurate to predict than direct even without spectre (generally fewer indirect predictor entries than direct, far branches in particular require a lot of bits for target). And with spectre we're flushing the indirect predictors on context switch or even disabling indirect prediction or flushing across privilege domains in the same context. And finally, the HWCAP test can eventually go away in future. A vdso call can not. If you really want to select with an indirect branch rather than direct conditional, you can do that all within the library. Thanks, Nick From libc-dev at lists.llvm.org Wed Apr 15 19:35:42 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Wed, 15 Apr 2020 22:35:42 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <1587002854.f0slo0111r.astroid@bobo.none> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <1586994952.nnxigedbu2.astroid@bobo.none> <20200416004843.GO11469@brightrain.aerifal.cx> <1587002854.f0slo0111r.astroid@bobo.none> Message-ID: <20200416023542.GP11469@brightrain.aerifal.cx> On Thu, Apr 16, 2020 at 12:24:16PM +1000, Nicholas Piggin wrote: > >> > Likewise, it's not useful to have different error return mechanisms > >> > because the caller just has to branch to support both (or the > >> > kernel-provided stub just has to emulate one for it; that could work > >> > if you really want to change the bad existing convention). > >> > > >> > Thoughts? > >> > >> The existing convention has to change somewhat because of the clobbers, > >> so I thought we could change the error return at the same time. I'm > >> open to not changing it and using CR0[SO], but others liked the idea. > >> Pro: it matches sc and vsyscall. Con: it's different from other common > >> archs. Performnce-wise it would really be a wash -- cost of conditional > >> branch is not the cmp but the mispredict. > > > > If you do the branch on hwcap at each syscall, then you significantly > > increase code size of every syscall point, likely turning a bunch of > > trivial functions that didn't need stack frames into ones that do. You > > also potentially make them need a TOC pointer. Making them all just do > > an indirect call unconditionally (with pointer in TLS like i386?) is a > > lot more efficient in code size and at least as good for performance. > > I disagree. Doing the long vdso indirect call *necessarily* requires > touching a new icache line, and even a new TLB entry. Indirect branches The increase in number of icache lines from the branch at every syscall point is far greater than the use of a single extra icache line shared by all syscalls. Not to mention the dcache line to access __hwcap or whatever, and the icache lines to setup access TOC-relative access to it. (Of course you could put a copy of its value in TLS at a fixed offset, which would somewhat mitigate both.) > And finally, the HWCAP test can eventually go away in future. A vdso > call can not. We support nearly arbitrarily old kernels (with limited functionality) and hardware (with full functionality) and don't intend for that to change, ever. But indeed glibc might want too eventually drop the check. > If you really want to select with an indirect branch rather than > direct conditional, you can do that all within the library. OK. It's a little bit more work if that's not the interface the kernel will give us, but it's no big deal. Rich From libc-dev at lists.llvm.org Wed Apr 15 19:53:31 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Thu, 16 Apr 2020 12:53:31 +1000 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200416023542.GP11469@brightrain.aerifal.cx> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <1586994952.nnxigedbu2.astroid@bobo.none> <20200416004843.GO11469@brightrain.aerifal.cx> <1587002854.f0slo0111r.astroid@bobo.none> <20200416023542.GP11469@brightrain.aerifal.cx> Message-ID: <1587004907.ioxh0bxsln.astroid@bobo.none> Excerpts from Rich Felker's message of April 16, 2020 12:35 pm: > On Thu, Apr 16, 2020 at 12:24:16PM +1000, Nicholas Piggin wrote: >> >> > Likewise, it's not useful to have different error return mechanisms >> >> > because the caller just has to branch to support both (or the >> >> > kernel-provided stub just has to emulate one for it; that could work >> >> > if you really want to change the bad existing convention). >> >> > >> >> > Thoughts? >> >> >> >> The existing convention has to change somewhat because of the clobbers, >> >> so I thought we could change the error return at the same time. I'm >> >> open to not changing it and using CR0[SO], but others liked the idea. >> >> Pro: it matches sc and vsyscall. Con: it's different from other common >> >> archs. Performnce-wise it would really be a wash -- cost of conditional >> >> branch is not the cmp but the mispredict. >> > >> > If you do the branch on hwcap at each syscall, then you significantly >> > increase code size of every syscall point, likely turning a bunch of >> > trivial functions that didn't need stack frames into ones that do. You >> > also potentially make them need a TOC pointer. Making them all just do >> > an indirect call unconditionally (with pointer in TLS like i386?) is a >> > lot more efficient in code size and at least as good for performance. >> >> I disagree. Doing the long vdso indirect call *necessarily* requires >> touching a new icache line, and even a new TLB entry. Indirect branches > > The increase in number of icache lines from the branch at every > syscall point is far greater than the use of a single extra icache > line shared by all syscalls. That's true, I was thinking of a single function that does the test and calls syscalls, which might be the fair comparison. > Not to mention the dcache line to access > __hwcap or whatever, and the icache lines to setup access TOC-relative > access to it. (Of course you could put a copy of its value in TLS at a > fixed offset, which would somewhat mitigate both.) > >> And finally, the HWCAP test can eventually go away in future. A vdso >> call can not. > > We support nearly arbitrarily old kernels (with limited functionality) > and hardware (with full functionality) and don't intend for that to > change, ever. But indeed glibc might want too eventually drop the > check. Ah, cool. Any build-time flexibility there? We may or may not be getting a new ABI that will use instructions not supported by old processors. https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html Current ABI continues to work of course and be the default for some time, but building for new one would give some opportunity to drop such support for old procs, at least for glibc. > >> If you really want to select with an indirect branch rather than >> direct conditional, you can do that all within the library. > > OK. It's a little bit more work if that's not the interface the kernel > will give us, but it's no big deal. Okay. Thanks, Nick From libc-dev at lists.llvm.org Wed Apr 15 20:03:04 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Wed, 15 Apr 2020 23:03:04 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <1587004907.ioxh0bxsln.astroid@bobo.none> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <1586994952.nnxigedbu2.astroid@bobo.none> <20200416004843.GO11469@brightrain.aerifal.cx> <1587002854.f0slo0111r.astroid@bobo.none> <20200416023542.GP11469@brightrain.aerifal.cx> <1587004907.ioxh0bxsln.astroid@bobo.none> Message-ID: <20200416030304.GQ11469@brightrain.aerifal.cx> On Thu, Apr 16, 2020 at 12:53:31PM +1000, Nicholas Piggin wrote: > > Not to mention the dcache line to access > > __hwcap or whatever, and the icache lines to setup access TOC-relative > > access to it. (Of course you could put a copy of its value in TLS at a > > fixed offset, which would somewhat mitigate both.) > > > >> And finally, the HWCAP test can eventually go away in future. A vdso > >> call can not. > > > > We support nearly arbitrarily old kernels (with limited functionality) > > and hardware (with full functionality) and don't intend for that to > > change, ever. But indeed glibc might want too eventually drop the > > check. > > Ah, cool. Any build-time flexibility there? > > We may or may not be getting a new ABI that will use instructions not > supported by old processors. > > https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html > > Current ABI continues to work of course and be the default for some > time, but building for new one would give some opportunity to drop > such support for old procs, at least for glibc. What does "new ABI" entail to you? In the terminology I use with musl, "new ABI" and "new ISA level" are different things. You can compile (explicit -march or compiler default) binaries that won't run on older cpus due to use of new insns etc., but we consider it the same ABI if you can link code for an older/baseline ISA level with the newer-ISA-level object files, i.e. if the interface surface for linkage remains compatible. We also try to avoid gratuitous proliferation of different ABIs unless there's a strong underlying need (like addition of softfloat ABIs for archs that usually have FPU, or vice versa). In principle the same could be done for kernels except it's a bigger silent gotcha (possible ENOSYS in places where it shouldn't be able to happen rather than a trapping SIGILL or similar) and there's rarely any serious performance or size benefit to dropping support for older kernels. Rich From libc-dev at lists.llvm.org Wed Apr 15 20:41:01 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Thu, 16 Apr 2020 13:41:01 +1000 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200416030304.GQ11469@brightrain.aerifal.cx> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <1586994952.nnxigedbu2.astroid@bobo.none> <20200416004843.GO11469@brightrain.aerifal.cx> <1587002854.f0slo0111r.astroid@bobo.none> <20200416023542.GP11469@brightrain.aerifal.cx> <1587004907.ioxh0bxsln.astroid@bobo.none> <20200416030304.GQ11469@brightrain.aerifal.cx> Message-ID: <1587007359.3k5vvojlfu.astroid@bobo.none> Excerpts from Rich Felker's message of April 16, 2020 1:03 pm: > On Thu, Apr 16, 2020 at 12:53:31PM +1000, Nicholas Piggin wrote: >> > Not to mention the dcache line to access >> > __hwcap or whatever, and the icache lines to setup access TOC-relative >> > access to it. (Of course you could put a copy of its value in TLS at a >> > fixed offset, which would somewhat mitigate both.) >> > >> >> And finally, the HWCAP test can eventually go away in future. A vdso >> >> call can not. >> > >> > We support nearly arbitrarily old kernels (with limited functionality) >> > and hardware (with full functionality) and don't intend for that to >> > change, ever. But indeed glibc might want too eventually drop the >> > check. >> >> Ah, cool. Any build-time flexibility there? >> >> We may or may not be getting a new ABI that will use instructions not >> supported by old processors. >> >> https://sourceware.org/legacy-ml/binutils/2019-05/msg00331.html >> >> Current ABI continues to work of course and be the default for some >> time, but building for new one would give some opportunity to drop >> such support for old procs, at least for glibc. > > What does "new ABI" entail to you? In the terminology I use with musl, > "new ABI" and "new ISA level" are different things. You can compile > (explicit -march or compiler default) binaries that won't run on older > cpus due to use of new insns etc., but we consider it the same ABI if > you can link code for an older/baseline ISA level with the > newer-ISA-level object files, i.e. if the interface surface for > linkage remains compatible. We also try to avoid gratuitous > proliferation of different ABIs unless there's a strong underlying > need (like addition of softfloat ABIs for archs that usually have FPU, > or vice versa). Yeah it will be a new ABI type that also requires a new ISA level. As far as I know (and I'm not on the toolchain side) there will be some call compatibility between the two, so it may be fine to continue with existing ABI for libc. But it just something that comes to mind as a build-time cutover where we might be able to assume particular features. > In principle the same could be done for kernels except it's a bigger > silent gotcha (possible ENOSYS in places where it shouldn't be able to > happen rather than a trapping SIGILL or similar) and there's rarely > any serious performance or size benefit to dropping support for older > kernels. Right, I don't think it'd be a huge problem whatever way we go, compared with the cost of the system call. Thanks, Nick From libc-dev at lists.llvm.org Thu Apr 16 02:58:00 2020 From: libc-dev at lists.llvm.org (Szabolcs Nagy via libc-dev) Date: Thu, 16 Apr 2020 11:58:00 +0200 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <1586994952.nnxigedbu2.astroid@bobo.none> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <1586994952.nnxigedbu2.astroid@bobo.none> Message-ID: <20200416095800.GC23945@port70.net> * Nicholas Piggin via Libc-alpha [2020-04-16 10:16:54 +1000]: > Well it would have to test HWCAP and patch in or branch to two > completely different sequences including register save/restores yes. > You could have the same asm and matching clobbers to put the sequence > inline and then you could patch the one sc/scv instruction I suppose. how would that 'patch' work? there are many reasons why you don't want libc to write its .text From libc-dev at lists.llvm.org Thu Apr 16 07:16:04 2020 From: libc-dev at lists.llvm.org (Adhemerval Zanella via libc-dev) Date: Thu, 16 Apr 2020 11:16:04 -0300 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200415225539.GL11469@brightrain.aerifal.cx> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> Message-ID: On 15/04/2020 19:55, Rich Felker wrote: > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: >> I would like to enable Linux support for the powerpc 'scv' instruction, >> as a faster system call instruction. >> >> This requires two things to be defined: Firstly a way to advertise to >> userspace that kernel supports scv, and a way to allocate and advertise >> support for individual scv vectors. Secondly, a calling convention ABI >> for this new instruction. >> >> Thanks to those who commented last time, since then I have removed my >> answered questions and unpopular alternatives but you can find them >> here >> >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html >> >> Let me try one more with a wider cc list, and then we'll get something >> merged. Any questions or counter-opinions are welcome. >> >> System Call Vectored (scv) ABI >> ============================== >> >> The scv instruction is introduced with POWER9 / ISA3, it comes with an >> rfscv counter-part. The benefit of these instructions is performance >> (trading slower SRR0/1 with faster LR/CTR registers, and entering the >> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR >> updates. The scv instruction has 128 interrupt entry points (not enough >> to cover the Linux system call space). >> >> The proposal is to assign scv numbers very conservatively and allocate >> them as individual HWCAP features as we add support for more. The zero >> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'. >> >> Advertisement >> >> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a >> SIGILL in current environments. Linux has defined a HWCAP2 bit >> PPC_FEATURE2_SCV for SCV support, but does not set it. >> >> When scv instruction support and the scv 0 vector for system calls are >> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors >> should not be used without future HWCAP bits indicating support, which is >> how we will allocate them. (Should unallocated ones generate SIGILL, or >> return -ENOSYS in r3?) >> >> Calling convention >> >> The proposal is for scv 0 to provide the standard Linux system call ABI >> with the following differences from sc convention[1]: >> >> - LR is to be volatile across scv calls. This is necessary because the >> scv instruction clobbers LR. From previous discussion, this should be >> possible to deal with in GCC clobbers and CFI. >> >> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the >> kernel system call exit to avoid restoring the CR register (although >> we probably still would anyway to avoid information leak). >> >> - Error handling: I think the consensus has been to move to using negative >> return value in r3 rather than CR0[SO]=1 to indicate error, which matches >> most other architectures and is closer to a function call. >> >> The number of scratch registers (r9-r12) at kernel entry seems >> sufficient that we don't have any costly spilling, patch is here[2]. >> >> [1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst >> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840.html > > My preference would be that it work just like the i386 AT_SYSINFO > where you just replace "int $128" with "call *%%gs:16" and the kernel > provides a stub in the vdso that performs either scv or the old > mechanism with the same calling convention. Then if the kernel doesn't > provide it (because the kernel is too old) libc would have to provide > its own stub that uses the legacy method and matches the calling > convention of the one the kernel is expected to provide. What about pthread cancellation and the requirement of checking the cancellable syscall anchors in asynchronous cancellation? My plan is still to use musl strategy on glibc (BZ#12683) and for i686 it requires to always use old int$128 for program that uses cancellation (static case) or just threads (dynamic mode, which should be more common on glibc). Using the i686 strategy of a vDSO bridge symbol would require to always fallback to 'sc' to still use the same cancellation strategy (and thus defeating this optimization in such cases). > Note that any libc that actually makes use of the new functionality is > not going to be able to make clobbers conditional on support for it; > branching around different clobbers is going to defeat any gains vs > always just treating anything clobbered by either method as clobbered. > Likewise, it's not useful to have different error return mechanisms > because the caller just has to branch to support both (or the > kernel-provided stub just has to emulate one for it; that could work > if you really want to change the bad existing convention). > > Thoughts? > > Rich > From libc-dev at lists.llvm.org Thu Apr 16 08:21:56 2020 From: libc-dev at lists.llvm.org (Jeffrey Walton via libc-dev) Date: Thu, 16 Apr 2020 11:21:56 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <1586994952.nnxigedbu2.astroid@bobo.none> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <1586994952.nnxigedbu2.astroid@bobo.none> Message-ID: On Wed, Apr 15, 2020 at 8:17 PM Nicholas Piggin wrote: > > Excerpts from Rich Felker's message of April 16, 2020 8:55 am: > > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: > >> I would like to enable Linux support for the powerpc 'scv' instruction, > >> as a faster system call instruction. > >> > >> This requires two things to be defined: Firstly a way to advertise to > >> userspace that kernel supports scv, and a way to allocate and advertise > >> support for individual scv vectors. Secondly, a calling convention ABI > >> for this new instruction. > >> ... > > Note that any libc that actually makes use of the new functionality is > > not going to be able to make clobbers conditional on support for it; > > branching around different clobbers is going to defeat any gains vs > > always just treating anything clobbered by either method as clobbered. > > Well it would have to test HWCAP and patch in or branch to two > completely different sequences including register save/restores yes. > You could have the same asm and matching clobbers to put the sequence > inline and then you could patch the one sc/scv instruction I suppose. Could GCC function multiversioning work here? https://gcc.gnu.org/wiki/FunctionMultiVersioning It seems like selecting a runtime version of a function is the sort of thing you are trying to do. Jeff From libc-dev at lists.llvm.org Thu Apr 16 08:35:09 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Thu, 16 Apr 2020 11:35:09 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <87k12gf32r.fsf@mid.deneb.enyo.de> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <87k12gf32r.fsf@mid.deneb.enyo.de> Message-ID: <20200416153509.GT11469@brightrain.aerifal.cx> On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote: > * Rich Felker: > > > My preference would be that it work just like the i386 AT_SYSINFO > > where you just replace "int $128" with "call *%%gs:16" and the kernel > > provides a stub in the vdso that performs either scv or the old > > mechanism with the same calling convention. > > The i386 mechanism has received some criticism because it provides an > effective means to redirect execution flow to anyone who can write to > the TCB. I am not sure if it makes sense to copy it. Indeed that's a good point. Do you have ideas for making it equally efficient without use of a function pointer in the TCB? Rich From libc-dev at lists.llvm.org Thu Apr 16 08:37:56 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Thu, 16 Apr 2020 11:37:56 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> Message-ID: <20200416153756.GU11469@brightrain.aerifal.cx> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: > > My preference would be that it work just like the i386 AT_SYSINFO > > where you just replace "int $128" with "call *%%gs:16" and the kernel > > provides a stub in the vdso that performs either scv or the old > > mechanism with the same calling convention. Then if the kernel doesn't > > provide it (because the kernel is too old) libc would have to provide > > its own stub that uses the legacy method and matches the calling > > convention of the one the kernel is expected to provide. > > What about pthread cancellation and the requirement of checking the > cancellable syscall anchors in asynchronous cancellation? My plan is > still to use musl strategy on glibc (BZ#12683) and for i686 it > requires to always use old int$128 for program that uses cancellation > (static case) or just threads (dynamic mode, which should be more > common on glibc). > > Using the i686 strategy of a vDSO bridge symbol would require to always > fallback to 'sc' to still use the same cancellation strategy (and > thus defeating this optimization in such cases). Yes, I assumed it would be the same, ignoring the new syscall mechanism for cancellable syscalls. While there are some exceptions, cancellable syscalls are generally not hot paths but things that are expected to block and to have significant amounts of work to do in kernelspace, so saving a few tens of cycles is rather pointless. It's possible to do a branch/multiple versions of the syscall asm for cancellation but would require extending the cancellation handler to support checking against multiple independent address ranges or using some alternate markup of them. Rich From libc-dev at lists.llvm.org Thu Apr 16 08:40:12 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Thu, 16 Apr 2020 11:40:12 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <1586994952.nnxigedbu2.astroid@bobo.none> Message-ID: <20200416154012.GV11469@brightrain.aerifal.cx> On Thu, Apr 16, 2020 at 11:21:56AM -0400, Jeffrey Walton wrote: > On Wed, Apr 15, 2020 at 8:17 PM Nicholas Piggin wrote: > > > > Excerpts from Rich Felker's message of April 16, 2020 8:55 am: > > > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote: > > >> I would like to enable Linux support for the powerpc 'scv' instruction, > > >> as a faster system call instruction. > > >> > > >> This requires two things to be defined: Firstly a way to advertise to > > >> userspace that kernel supports scv, and a way to allocate and advertise > > >> support for individual scv vectors. Secondly, a calling convention ABI > > >> for this new instruction. > > >> ... > > > Note that any libc that actually makes use of the new functionality is > > > not going to be able to make clobbers conditional on support for it; > > > branching around different clobbers is going to defeat any gains vs > > > always just treating anything clobbered by either method as clobbered. > > > > Well it would have to test HWCAP and patch in or branch to two > > completely different sequences including register save/restores yes. > > You could have the same asm and matching clobbers to put the sequence > > inline and then you could patch the one sc/scv instruction I suppose. > > Could GCC function multiversioning work here? > https://gcc.gnu.org/wiki/FunctionMultiVersioning > > It seems like selecting a runtime version of a function is the sort of > thing you are trying to do. On glibc it potentially could. This is ifunc-based functionality though and musl explicitly does not (and will not) support ifunc because of lots of fundamental problems it entails. But even on glibc the underlying mechanisms for ifunc are just the same as a normal indirect call and there's no real reason to prefer implementing it with ifunc/multiversioning vs directly. Rich From libc-dev at lists.llvm.org Thu Apr 16 09:52:57 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Thu, 16 Apr 2020 12:52:57 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <87sgh3e613.fsf@mid.deneb.enyo.de> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <87k12gf32r.fsf@mid.deneb.enyo.de> <20200416153509.GT11469@brightrain.aerifal.cx> <87sgh3e613.fsf@mid.deneb.enyo.de> Message-ID: <20200416165257.GY11469@brightrain.aerifal.cx> On Thu, Apr 16, 2020 at 06:42:32PM +0200, Florian Weimer wrote: > * Rich Felker: > > > On Thu, Apr 16, 2020 at 06:48:44AM +0200, Florian Weimer wrote: > >> * Rich Felker: > >> > >> > My preference would be that it work just like the i386 AT_SYSINFO > >> > where you just replace "int $128" with "call *%%gs:16" and the kernel > >> > provides a stub in the vdso that performs either scv or the old > >> > mechanism with the same calling convention. > >> > >> The i386 mechanism has received some criticism because it provides an > >> effective means to redirect execution flow to anyone who can write to > >> the TCB. I am not sure if it makes sense to copy it. > > > > Indeed that's a good point. Do you have ideas for making it equally > > efficient without use of a function pointer in the TCB? > > We could add a shared non-writable mapping at a 64K offset from the > thread pointer and store the function pointer or the code there. Then > it would be safe. > > However, since this is apparently tied to POWER9 and we already have a > POWER9 multilib, and assuming that we are going to backport the kernel > change, I would tweak the selection criterion for that multilib to > include the new HWCAP2 flag. If a user runs this glibc on a kernel > which does not have support, they will get set baseline (POWER8) > multilib, which still works. This way, outside the dynamic loader, no > run-time dispatch is needed at all. I guess this is not at all the > answer you were looking for. 8-) How does this work with -static? :-) > If a single binary is needed, I would perhaps follow what Arm did for > -moutline-atomics: lay out the code so that its easy to execute for > the non-POWER9 case, assuming that POWER9 machines will be better at > predicting things than their predecessors. > > Or you could also put the function pointer into a RELRO segment. Then > there's overlap with the __libc_single_threaded discussion, where > people objected to this kind of optimization (although I did not > propose to change the TCB ABI, that would be required for > __libc_single_threaded because it's an external interface). Of course you can use a normal global, but now every call point needs to setup a TOC pointer (= two entry points and more icache lines for otherwise trivial functions). I think my choice would be just making the inline syscall be a single call insn to an asm source file that out-of-lines the loading of TOC pointer and call through it or branch based on hwcap so that it's not repeated all over the place. Alternatively, it would perhaps work to just put hwcap in the TCB and branch on it rather than making an indirect call to a function pointer in the TCB, so that the worst you could do by clobbering it is execute the wrong syscall insn and thereby get SIGILL. Rich From libc-dev at lists.llvm.org Thu Apr 16 10:50:18 2020 From: libc-dev at lists.llvm.org (Adhemerval Zanella via libc-dev) Date: Thu, 16 Apr 2020 14:50:18 -0300 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200416153756.GU11469@brightrain.aerifal.cx> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> Message-ID: <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> On 16/04/2020 12:37, Rich Felker wrote: > On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: >>> My preference would be that it work just like the i386 AT_SYSINFO >>> where you just replace "int $128" with "call *%%gs:16" and the kernel >>> provides a stub in the vdso that performs either scv or the old >>> mechanism with the same calling convention. Then if the kernel doesn't >>> provide it (because the kernel is too old) libc would have to provide >>> its own stub that uses the legacy method and matches the calling >>> convention of the one the kernel is expected to provide. >> >> What about pthread cancellation and the requirement of checking the >> cancellable syscall anchors in asynchronous cancellation? My plan is >> still to use musl strategy on glibc (BZ#12683) and for i686 it >> requires to always use old int$128 for program that uses cancellation >> (static case) or just threads (dynamic mode, which should be more >> common on glibc). >> >> Using the i686 strategy of a vDSO bridge symbol would require to always >> fallback to 'sc' to still use the same cancellation strategy (and >> thus defeating this optimization in such cases). > > Yes, I assumed it would be the same, ignoring the new syscall > mechanism for cancellable syscalls. While there are some exceptions, > cancellable syscalls are generally not hot paths but things that are > expected to block and to have significant amounts of work to do in > kernelspace, so saving a few tens of cycles is rather pointless. > > It's possible to do a branch/multiple versions of the syscall asm for > cancellation but would require extending the cancellation handler to > support checking against multiple independent address ranges or using > some alternate markup of them. The main issue is at least for glibc dynamic linking is way more common than static linking and once the program become multithread the fallback will be always used. And besides the cancellation performance issue, a new bridge vDSO mechanism will still require to setup some extra bridge for the case of the older kernel. In the scheme you suggested: __asm__("indirect call" ... with common clobbers); The indirect call will be either the vDSO bridge or an libc provided that fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain against: if (hwcap & PPC_FEATURE2_SCV) { __asm__(... with some clobbers); } else { __asm__(... with different clobbers); } Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a TCB member (as we do on glibc) and if we could make the asm clever enough to not require different clobbers (although not sure if it would be possible). From libc-dev at lists.llvm.org Thu Apr 16 10:59:32 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Thu, 16 Apr 2020 13:59:32 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> Message-ID: <20200416175932.GZ11469@brightrain.aerifal.cx> On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote: > > > On 16/04/2020 12:37, Rich Felker wrote: > > On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: > >>> My preference would be that it work just like the i386 AT_SYSINFO > >>> where you just replace "int $128" with "call *%%gs:16" and the kernel > >>> provides a stub in the vdso that performs either scv or the old > >>> mechanism with the same calling convention. Then if the kernel doesn't > >>> provide it (because the kernel is too old) libc would have to provide > >>> its own stub that uses the legacy method and matches the calling > >>> convention of the one the kernel is expected to provide. > >> > >> What about pthread cancellation and the requirement of checking the > >> cancellable syscall anchors in asynchronous cancellation? My plan is > >> still to use musl strategy on glibc (BZ#12683) and for i686 it > >> requires to always use old int$128 for program that uses cancellation > >> (static case) or just threads (dynamic mode, which should be more > >> common on glibc). > >> > >> Using the i686 strategy of a vDSO bridge symbol would require to always > >> fallback to 'sc' to still use the same cancellation strategy (and > >> thus defeating this optimization in such cases). > > > > Yes, I assumed it would be the same, ignoring the new syscall > > mechanism for cancellable syscalls. While there are some exceptions, > > cancellable syscalls are generally not hot paths but things that are > > expected to block and to have significant amounts of work to do in > > kernelspace, so saving a few tens of cycles is rather pointless. > > > > It's possible to do a branch/multiple versions of the syscall asm for > > cancellation but would require extending the cancellation handler to > > support checking against multiple independent address ranges or using > > some alternate markup of them. > > The main issue is at least for glibc dynamic linking is way more common > than static linking and once the program become multithread the fallback > will be always used. I'm not relying on static linking optimizing out the cancellable version. I'm talking about how cancellable syscalls are pretty much all "heavy" operations to begin with where a few tens of cycles are in the realm of "measurement noise" relative to the dominating time costs. > And besides the cancellation performance issue, a new bridge vDSO mechanism > will still require to setup some extra bridge for the case of the older > kernel. In the scheme you suggested: > > __asm__("indirect call" ... with common clobbers); > > The indirect call will be either the vDSO bridge or an libc provided that > fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain > against: > > if (hwcap & PPC_FEATURE2_SCV) { > __asm__(... with some clobbers); > } else { > __asm__(... with different clobbers); > } If the indirect call can be made roughly as efficiently as the sc sequence now (which already have some cost due to handling the nasty error return convention, making the indirect call likely just as small or smaller), it's O(1) additional code size (and thus icache usage) rather than O(n) where n is number of syscall points. Of course it would work just as well (for avoiding O(n) growth) to have a direct call to out-of-line branch like you suggested. > Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a > TCB member (as we do on glibc) and if we could make the asm clever > enough to not require different clobbers (although not sure if > it would be possible). The easy way not to require different clobbers is just using the union of the clobbers, no? Does the proposed new method clobber any call-saved registers that would make it painful (requiring new call frames to save them in)? Rich From libc-dev at lists.llvm.org Thu Apr 16 11:18:42 2020 From: libc-dev at lists.llvm.org (Adhemerval Zanella via libc-dev) Date: Thu, 16 Apr 2020 15:18:42 -0300 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200416175932.GZ11469@brightrain.aerifal.cx> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> Message-ID: <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> On 16/04/2020 14:59, Rich Felker wrote: > On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote: >> >> >> On 16/04/2020 12:37, Rich Felker wrote: >>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: >>>>> My preference would be that it work just like the i386 AT_SYSINFO >>>>> where you just replace "int $128" with "call *%%gs:16" and the kernel >>>>> provides a stub in the vdso that performs either scv or the old >>>>> mechanism with the same calling convention. Then if the kernel doesn't >>>>> provide it (because the kernel is too old) libc would have to provide >>>>> its own stub that uses the legacy method and matches the calling >>>>> convention of the one the kernel is expected to provide. >>>> >>>> What about pthread cancellation and the requirement of checking the >>>> cancellable syscall anchors in asynchronous cancellation? My plan is >>>> still to use musl strategy on glibc (BZ#12683) and for i686 it >>>> requires to always use old int$128 for program that uses cancellation >>>> (static case) or just threads (dynamic mode, which should be more >>>> common on glibc). >>>> >>>> Using the i686 strategy of a vDSO bridge symbol would require to always >>>> fallback to 'sc' to still use the same cancellation strategy (and >>>> thus defeating this optimization in such cases). >>> >>> Yes, I assumed it would be the same, ignoring the new syscall >>> mechanism for cancellable syscalls. While there are some exceptions, >>> cancellable syscalls are generally not hot paths but things that are >>> expected to block and to have significant amounts of work to do in >>> kernelspace, so saving a few tens of cycles is rather pointless. >>> >>> It's possible to do a branch/multiple versions of the syscall asm for >>> cancellation but would require extending the cancellation handler to >>> support checking against multiple independent address ranges or using >>> some alternate markup of them. >> >> The main issue is at least for glibc dynamic linking is way more common >> than static linking and once the program become multithread the fallback >> will be always used. > > I'm not relying on static linking optimizing out the cancellable > version. I'm talking about how cancellable syscalls are pretty much > all "heavy" operations to begin with where a few tens of cycles are in > the realm of "measurement noise" relative to the dominating time > costs. Yes I am aware, but at same time I am not sure how it plays on real world. For instance, some workloads might issue kernel query syscalls, such as recv, where buffer copying might not be dominant factor. So I see that if the idea is optimizing syscall mechanism, we should try to leverage it as whole in libc. > >> And besides the cancellation performance issue, a new bridge vDSO mechanism >> will still require to setup some extra bridge for the case of the older >> kernel. In the scheme you suggested: >> >> __asm__("indirect call" ... with common clobbers); >> >> The indirect call will be either the vDSO bridge or an libc provided that >> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain >> against: >> >> if (hwcap & PPC_FEATURE2_SCV) { >> __asm__(... with some clobbers); >> } else { >> __asm__(... with different clobbers); >> } > > If the indirect call can be made roughly as efficiently as the sc > sequence now (which already have some cost due to handling the nasty > error return convention, making the indirect call likely just as small > or smaller), it's O(1) additional code size (and thus icache usage) > rather than O(n) where n is number of syscall points. > > Of course it would work just as well (for avoiding O(n) growth) to > have a direct call to out-of-line branch like you suggested. Yes, but does it really matter to optimize this specific usage case for size? glibc, for instance, tries to leverage the syscall mechanism by adding some complex pre-processor asm directives. It optimizes the syscall code size in most cases. For instance, kill in static case generates on x86_64: 0000000000000000 <__kill>: 0: b8 3e 00 00 00 mov $0x3e,%eax 5: 0f 05 syscall 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax d: 0f 83 00 00 00 00 jae 13 <__kill+0x13> 13: c3 retq While on musl: 0000000000000000 : 0: 48 83 ec 08 sub $0x8,%rsp 4: 48 63 ff movslq %edi,%rdi 7: 48 63 f6 movslq %esi,%rsi a: b8 3e 00 00 00 mov $0x3e,%eax f: 0f 05 syscall 11: 48 89 c7 mov %rax,%rdi 14: e8 00 00 00 00 callq 19 19: 5a pop %rdx 1a: c3 retq But I hardly think it pays off the required code complexity. Some for providing a O(1) bridge: this will require additional complexity to write it and setup correctly. > >> Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a >> TCB member (as we do on glibc) and if we could make the asm clever >> enough to not require different clobbers (although not sure if >> it would be possible). > > The easy way not to require different clobbers is just using the union > of the clobbers, no? Does the proposed new method clobber any > call-saved registers that would make it painful (requiring new call > frames to save them in)? As far I can tell, it should be ok. From libc-dev at lists.llvm.org Thu Apr 16 11:31:51 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Thu, 16 Apr 2020 14:31:51 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> Message-ID: <20200416183151.GA11469@brightrain.aerifal.cx> On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote: > > > On 16/04/2020 14:59, Rich Felker wrote: > > On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote: > >> > >> > >> On 16/04/2020 12:37, Rich Felker wrote: > >>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: > >>>>> My preference would be that it work just like the i386 AT_SYSINFO > >>>>> where you just replace "int $128" with "call *%%gs:16" and the kernel > >>>>> provides a stub in the vdso that performs either scv or the old > >>>>> mechanism with the same calling convention. Then if the kernel doesn't > >>>>> provide it (because the kernel is too old) libc would have to provide > >>>>> its own stub that uses the legacy method and matches the calling > >>>>> convention of the one the kernel is expected to provide. > >>>> > >>>> What about pthread cancellation and the requirement of checking the > >>>> cancellable syscall anchors in asynchronous cancellation? My plan is > >>>> still to use musl strategy on glibc (BZ#12683) and for i686 it > >>>> requires to always use old int$128 for program that uses cancellation > >>>> (static case) or just threads (dynamic mode, which should be more > >>>> common on glibc). > >>>> > >>>> Using the i686 strategy of a vDSO bridge symbol would require to always > >>>> fallback to 'sc' to still use the same cancellation strategy (and > >>>> thus defeating this optimization in such cases). > >>> > >>> Yes, I assumed it would be the same, ignoring the new syscall > >>> mechanism for cancellable syscalls. While there are some exceptions, > >>> cancellable syscalls are generally not hot paths but things that are > >>> expected to block and to have significant amounts of work to do in > >>> kernelspace, so saving a few tens of cycles is rather pointless. > >>> > >>> It's possible to do a branch/multiple versions of the syscall asm for > >>> cancellation but would require extending the cancellation handler to > >>> support checking against multiple independent address ranges or using > >>> some alternate markup of them. > >> > >> The main issue is at least for glibc dynamic linking is way more common > >> than static linking and once the program become multithread the fallback > >> will be always used. > > > > I'm not relying on static linking optimizing out the cancellable > > version. I'm talking about how cancellable syscalls are pretty much > > all "heavy" operations to begin with where a few tens of cycles are in > > the realm of "measurement noise" relative to the dominating time > > costs. > > Yes I am aware, but at same time I am not sure how it plays on real world. > For instance, some workloads might issue kernel query syscalls, such as > recv, where buffer copying might not be dominant factor. So I see that if > the idea is optimizing syscall mechanism, we should try to leverage it > as whole in libc. Have you timed a minimal recv? I'm not assuming buffer copying is the dominant factor. I'm assuming the overhead of all the kernel layers involved is dominant. > >> And besides the cancellation performance issue, a new bridge vDSO mechanism > >> will still require to setup some extra bridge for the case of the older > >> kernel. In the scheme you suggested: > >> > >> __asm__("indirect call" ... with common clobbers); > >> > >> The indirect call will be either the vDSO bridge or an libc provided that > >> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain > >> against: > >> > >> if (hwcap & PPC_FEATURE2_SCV) { > >> __asm__(... with some clobbers); > >> } else { > >> __asm__(... with different clobbers); > >> } > > > > If the indirect call can be made roughly as efficiently as the sc > > sequence now (which already have some cost due to handling the nasty > > error return convention, making the indirect call likely just as small > > or smaller), it's O(1) additional code size (and thus icache usage) > > rather than O(n) where n is number of syscall points. > > > > Of course it would work just as well (for avoiding O(n) growth) to > > have a direct call to out-of-line branch like you suggested. > > Yes, but does it really matter to optimize this specific usage case > for size? glibc, for instance, tries to leverage the syscall mechanism > by adding some complex pre-processor asm directives. It optimizes > the syscall code size in most cases. For instance, kill in static case > generates on x86_64: > > 0000000000000000 <__kill>: > 0: b8 3e 00 00 00 mov $0x3e,%eax > 5: 0f 05 syscall > 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax > d: 0f 83 00 00 00 00 jae 13 <__kill+0x13> > 13: c3 retq > > While on musl: > > 0000000000000000 : > 0: 48 83 ec 08 sub $0x8,%rsp > 4: 48 63 ff movslq %edi,%rdi > 7: 48 63 f6 movslq %esi,%rsi > a: b8 3e 00 00 00 mov $0x3e,%eax > f: 0f 05 syscall > 11: 48 89 c7 mov %rax,%rdi > 14: e8 00 00 00 00 callq 19 > 19: 5a pop %rdx > 1a: c3 retq Wow that's some extraordinarily bad codegen going on by gcc... The sign-extension is semantically needed and I don't see a good way around it (glibc's asm is kinda a hack taking advantage of kernel not looking at high bits, I think), but the gratuitous stack adjustment and refusal to generate a tail call isn't. I'll see if we can track down what's going on and get it fixed. > But I hardly think it pays off the required code complexity. Some > for providing a O(1) bridge: this will require additional complexity > to write it and setup correctly. In some sense I agree, but inline instructions are a lot more expensive on ppc (being 32-bit each), and it might take out-of-lining anyway to get rid of stack frame setups if that ends up being a problem. > >> Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a > >> TCB member (as we do on glibc) and if we could make the asm clever > >> enough to not require different clobbers (although not sure if > >> it would be possible). > > > > The easy way not to require different clobbers is just using the union > > of the clobbers, no? Does the proposed new method clobber any > > call-saved registers that would make it painful (requiring new call > > frames to save them in)? > > As far I can tell, it should be ok. Note that because lr is clobbered we need at least once normally call-clobbered register that's not syscall clobbered to save lr in. Otherwise stack frame setup is required to spill it. (And I'm not even sure if gcc does things right to avoid it by using a register -- we should check that I guess...) Rich From libc-dev at lists.llvm.org Thu Apr 16 11:44:51 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Thu, 16 Apr 2020 14:44:51 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200416183151.GA11469@brightrain.aerifal.cx> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> Message-ID: <20200416184451.GB11469@brightrain.aerifal.cx> On Thu, Apr 16, 2020 at 02:31:51PM -0400, Rich Felker wrote: > > While on musl: > > > > 0000000000000000 : > > 0: 48 83 ec 08 sub $0x8,%rsp > > 4: 48 63 ff movslq %edi,%rdi > > 7: 48 63 f6 movslq %esi,%rsi > > a: b8 3e 00 00 00 mov $0x3e,%eax > > f: 0f 05 syscall > > 11: 48 89 c7 mov %rax,%rdi > > 14: e8 00 00 00 00 callq 19 > > 19: 5a pop %rdx > > 1a: c3 retq > > Wow that's some extraordinarily bad codegen going on by gcc... The > sign-extension is semantically needed and I don't see a good way > around it (glibc's asm is kinda a hack taking advantage of kernel not > looking at high bits, I think), but the gratuitous stack adjustment > and refusal to generate a tail call isn't. I'll see if we can track > down what's going on and get it fixed. It seems to be https://gcc.gnu.org/bugzilla/show_bug.cgi?id=14441 which I've updated with a comment about the above. Rich From libc-dev at lists.llvm.org Thu Apr 16 11:52:47 2020 From: libc-dev at lists.llvm.org (Adhemerval Zanella via libc-dev) Date: Thu, 16 Apr 2020 15:52:47 -0300 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200416183151.GA11469@brightrain.aerifal.cx> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> Message-ID: <65f70b10-bfc1-e9f6-d48a-4b063ad6b669@linaro.org> On 16/04/2020 15:31, Rich Felker wrote: > On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote: >> >> >> On 16/04/2020 14:59, Rich Felker wrote: >>> On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote: >>>> >>>> >>>> On 16/04/2020 12:37, Rich Felker wrote: >>>>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: >>>>>>> My preference would be that it work just like the i386 AT_SYSINFO >>>>>>> where you just replace "int $128" with "call *%%gs:16" and the kernel >>>>>>> provides a stub in the vdso that performs either scv or the old >>>>>>> mechanism with the same calling convention. Then if the kernel doesn't >>>>>>> provide it (because the kernel is too old) libc would have to provide >>>>>>> its own stub that uses the legacy method and matches the calling >>>>>>> convention of the one the kernel is expected to provide. >>>>>> >>>>>> What about pthread cancellation and the requirement of checking the >>>>>> cancellable syscall anchors in asynchronous cancellation? My plan is >>>>>> still to use musl strategy on glibc (BZ#12683) and for i686 it >>>>>> requires to always use old int$128 for program that uses cancellation >>>>>> (static case) or just threads (dynamic mode, which should be more >>>>>> common on glibc). >>>>>> >>>>>> Using the i686 strategy of a vDSO bridge symbol would require to always >>>>>> fallback to 'sc' to still use the same cancellation strategy (and >>>>>> thus defeating this optimization in such cases). >>>>> >>>>> Yes, I assumed it would be the same, ignoring the new syscall >>>>> mechanism for cancellable syscalls. While there are some exceptions, >>>>> cancellable syscalls are generally not hot paths but things that are >>>>> expected to block and to have significant amounts of work to do in >>>>> kernelspace, so saving a few tens of cycles is rather pointless. >>>>> >>>>> It's possible to do a branch/multiple versions of the syscall asm for >>>>> cancellation but would require extending the cancellation handler to >>>>> support checking against multiple independent address ranges or using >>>>> some alternate markup of them. >>>> >>>> The main issue is at least for glibc dynamic linking is way more common >>>> than static linking and once the program become multithread the fallback >>>> will be always used. >>> >>> I'm not relying on static linking optimizing out the cancellable >>> version. I'm talking about how cancellable syscalls are pretty much >>> all "heavy" operations to begin with where a few tens of cycles are in >>> the realm of "measurement noise" relative to the dominating time >>> costs. >> >> Yes I am aware, but at same time I am not sure how it plays on real world. >> For instance, some workloads might issue kernel query syscalls, such as >> recv, where buffer copying might not be dominant factor. So I see that if >> the idea is optimizing syscall mechanism, we should try to leverage it >> as whole in libc. > > Have you timed a minimal recv? I'm not assuming buffer copying is the > dominant factor. I'm assuming the overhead of all the kernel layers > involved is dominant. Not really, but reading the advantages of using 'scv' over 'sc' also does not outline the real expect gain. Taking in consideration this should be a micro-optimization (focused on entry syscall patch), I think we should use where it possible. > >>>> And besides the cancellation performance issue, a new bridge vDSO mechanism >>>> will still require to setup some extra bridge for the case of the older >>>> kernel. In the scheme you suggested: >>>> >>>> __asm__("indirect call" ... with common clobbers); >>>> >>>> The indirect call will be either the vDSO bridge or an libc provided that >>>> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain >>>> against: >>>> >>>> if (hwcap & PPC_FEATURE2_SCV) { >>>> __asm__(... with some clobbers); >>>> } else { >>>> __asm__(... with different clobbers); >>>> } >>> >>> If the indirect call can be made roughly as efficiently as the sc >>> sequence now (which already have some cost due to handling the nasty >>> error return convention, making the indirect call likely just as small >>> or smaller), it's O(1) additional code size (and thus icache usage) >>> rather than O(n) where n is number of syscall points. >>> >>> Of course it would work just as well (for avoiding O(n) growth) to >>> have a direct call to out-of-line branch like you suggested. >> >> Yes, but does it really matter to optimize this specific usage case >> for size? glibc, for instance, tries to leverage the syscall mechanism >> by adding some complex pre-processor asm directives. It optimizes >> the syscall code size in most cases. For instance, kill in static case >> generates on x86_64: >> >> 0000000000000000 <__kill>: >> 0: b8 3e 00 00 00 mov $0x3e,%eax >> 5: 0f 05 syscall >> 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax >> d: 0f 83 00 00 00 00 jae 13 <__kill+0x13> >> 13: c3 retq >> >> While on musl: >> >> 0000000000000000 : >> 0: 48 83 ec 08 sub $0x8,%rsp >> 4: 48 63 ff movslq %edi,%rdi >> 7: 48 63 f6 movslq %esi,%rsi >> a: b8 3e 00 00 00 mov $0x3e,%eax >> f: 0f 05 syscall >> 11: 48 89 c7 mov %rax,%rdi >> 14: e8 00 00 00 00 callq 19 >> 19: 5a pop %rdx >> 1a: c3 retq > > Wow that's some extraordinarily bad codegen going on by gcc... The > sign-extension is semantically needed and I don't see a good way > around it (glibc's asm is kinda a hack taking advantage of kernel not > looking at high bits, I think), but the gratuitous stack adjustment > and refusal to generate a tail call isn't. I'll see if we can track > down what's going on and get it fixed. Wrt glibc, it is most likely and it has bitten us on x32 port recently (where some types were being passed correctly). In any case, my long term plan to also get rid of this nasty assembly pre-processor on syscall passing. > >> But I hardly think it pays off the required code complexity. Some >> for providing a O(1) bridge: this will require additional complexity >> to write it and setup correctly. > > In some sense I agree, but inline instructions are a lot more > expensive on ppc (being 32-bit each), and it might take out-of-lining > anyway to get rid of stack frame setups if that ends up being a > problem. Indeed, I didn't started to prototype what would be required to make this change on glibc. Maybe an out-of-line helper might make sense. > >>>> Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a >>>> TCB member (as we do on glibc) and if we could make the asm clever >>>> enough to not require different clobbers (although not sure if >>>> it would be possible). >>> >>> The easy way not to require different clobbers is just using the union >>> of the clobbers, no? Does the proposed new method clobber any >>> call-saved registers that would make it painful (requiring new call >>> frames to save them in)? >> >> As far I can tell, it should be ok. > > Note that because lr is clobbered we need at least once normally > call-clobbered register that's not syscall clobbered to save lr in. > Otherwise stack frame setup is required to spill it. (And I'm not even > sure if gcc does things right to avoid it by using a register -- we > should check that I guess...) If I recall correctly Florian has found some issue in lr clobbering. From libc-dev at lists.llvm.org Fri Apr 17 01:34:57 2020 From: libc-dev at lists.llvm.org (Florian Weimer via libc-dev) Date: Fri, 17 Apr 2020 10:34:57 +0200 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200417014831.GL26902@gate.crashing.org> (Segher Boessenkool's message of "Thu, 16 Apr 2020 20:48:31 -0500") References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <87k12gf32r.fsf@mid.deneb.enyo.de> <20200416153509.GT11469@brightrain.aerifal.cx> <87sgh3e613.fsf@mid.deneb.enyo.de> <20200416165257.GY11469@brightrain.aerifal.cx> <87ftd3e1vg.fsf@mid.deneb.enyo.de> <20200416230235.GG26902@gate.crashing.org> <20200417003442.GD11469@brightrain.aerifal.cx> <20200417014831.GL26902@gate.crashing.org> Message-ID: <87d086cxxq.fsf@mid.deneb.enyo.de> * Segher Boessenkool: > On Thu, Apr 16, 2020 at 08:34:42PM -0400, Rich Felker wrote: >> On Thu, Apr 16, 2020 at 06:02:35PM -0500, Segher Boessenkool wrote: >> > On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote: >> > > > I think my choice would be just making the inline syscall be a single >> > > > call insn to an asm source file that out-of-lines the loading of TOC >> > > > pointer and call through it or branch based on hwcap so that it's not >> > > > repeated all over the place. >> > > >> > > I don't know how problematic control flow out of an inline asm is on >> > > POWER. But this is basically the -moutline-atomics approach. >> > >> > Control flow out of inline asm (other than with "asm goto") is not >> > allowed at all, just like on any other target (and will not work in >> > practice, either -- just like on any other target). But the suggestion >> > was to use actual assembler code, not inline asm? >> >> Calling it control flow out of inline asm is something of a misnomer. >> The enclosing state is not discarded or altered; the asm statement >> exits normally, reaching the next instruction in the enclosing >> block/function as soon as the call from the asm statement returns, >> with all register/clobber constraints satisfied. > > Ah. That should always Just Work, then -- our ABIs guarantee you can. After thinking about it, I agree: GCC will handle spilling of the link register. Branch-and-link instructions do not clobber the protected zone, so no stack adjustment is needed (which would be problematic to reflect in the unwind information). Of course, the target function has to be written in assembler because it must not use a regular stack frame. From libc-dev at lists.llvm.org Thu Apr 16 16:02:35 2020 From: libc-dev at lists.llvm.org (Segher Boessenkool via libc-dev) Date: Thu, 16 Apr 2020 18:02:35 -0500 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <87ftd3e1vg.fsf@mid.deneb.enyo.de> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <87k12gf32r.fsf@mid.deneb.enyo.de> <20200416153509.GT11469@brightrain.aerifal.cx> <87sgh3e613.fsf@mid.deneb.enyo.de> <20200416165257.GY11469@brightrain.aerifal.cx> <87ftd3e1vg.fsf@mid.deneb.enyo.de> Message-ID: <20200416230235.GG26902@gate.crashing.org> On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote: > > I think my choice would be just making the inline syscall be a single > > call insn to an asm source file that out-of-lines the loading of TOC > > pointer and call through it or branch based on hwcap so that it's not > > repeated all over the place. > > I don't know how problematic control flow out of an inline asm is on > POWER. But this is basically the -moutline-atomics approach. Control flow out of inline asm (other than with "asm goto") is not allowed at all, just like on any other target (and will not work in practice, either -- just like on any other target). But the suggestion was to use actual assembler code, not inline asm? Segher From libc-dev at lists.llvm.org Thu Apr 16 17:34:42 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Thu, 16 Apr 2020 20:34:42 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200416230235.GG26902@gate.crashing.org> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <87k12gf32r.fsf@mid.deneb.enyo.de> <20200416153509.GT11469@brightrain.aerifal.cx> <87sgh3e613.fsf@mid.deneb.enyo.de> <20200416165257.GY11469@brightrain.aerifal.cx> <87ftd3e1vg.fsf@mid.deneb.enyo.de> <20200416230235.GG26902@gate.crashing.org> Message-ID: <20200417003442.GD11469@brightrain.aerifal.cx> On Thu, Apr 16, 2020 at 06:02:35PM -0500, Segher Boessenkool wrote: > On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote: > > > I think my choice would be just making the inline syscall be a single > > > call insn to an asm source file that out-of-lines the loading of TOC > > > pointer and call through it or branch based on hwcap so that it's not > > > repeated all over the place. > > > > I don't know how problematic control flow out of an inline asm is on > > POWER. But this is basically the -moutline-atomics approach. > > Control flow out of inline asm (other than with "asm goto") is not > allowed at all, just like on any other target (and will not work in > practice, either -- just like on any other target). But the suggestion > was to use actual assembler code, not inline asm? Calling it control flow out of inline asm is something of a misnomer. The enclosing state is not discarded or altered; the asm statement exits normally, reaching the next instruction in the enclosing block/function as soon as the call from the asm statement returns, with all register/clobber constraints satisfied. Control flow out of inline asm would be more like longjmp, and it can be valid -- for instance, you can implement coroutines this way (assuming you switch stack correctly) or do longjmp this way (jumping to the location saved by setjmp). But it's not what'd be happening here. Rich From libc-dev at lists.llvm.org Thu Apr 16 18:48:31 2020 From: libc-dev at lists.llvm.org (Segher Boessenkool via libc-dev) Date: Thu, 16 Apr 2020 20:48:31 -0500 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200417003442.GD11469@brightrain.aerifal.cx> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <87k12gf32r.fsf@mid.deneb.enyo.de> <20200416153509.GT11469@brightrain.aerifal.cx> <87sgh3e613.fsf@mid.deneb.enyo.de> <20200416165257.GY11469@brightrain.aerifal.cx> <87ftd3e1vg.fsf@mid.deneb.enyo.de> <20200416230235.GG26902@gate.crashing.org> <20200417003442.GD11469@brightrain.aerifal.cx> Message-ID: <20200417014831.GL26902@gate.crashing.org> On Thu, Apr 16, 2020 at 08:34:42PM -0400, Rich Felker wrote: > On Thu, Apr 16, 2020 at 06:02:35PM -0500, Segher Boessenkool wrote: > > On Thu, Apr 16, 2020 at 08:12:19PM +0200, Florian Weimer wrote: > > > > I think my choice would be just making the inline syscall be a single > > > > call insn to an asm source file that out-of-lines the loading of TOC > > > > pointer and call through it or branch based on hwcap so that it's not > > > > repeated all over the place. > > > > > > I don't know how problematic control flow out of an inline asm is on > > > POWER. But this is basically the -moutline-atomics approach. > > > > Control flow out of inline asm (other than with "asm goto") is not > > allowed at all, just like on any other target (and will not work in > > practice, either -- just like on any other target). But the suggestion > > was to use actual assembler code, not inline asm? > > Calling it control flow out of inline asm is something of a misnomer. > The enclosing state is not discarded or altered; the asm statement > exits normally, reaching the next instruction in the enclosing > block/function as soon as the call from the asm statement returns, > with all register/clobber constraints satisfied. Ah. That should always Just Work, then -- our ABIs guarantee you can. > Control flow out of inline asm would be more like longjmp, and it can > be valid -- for instance, you can implement coroutines this way > (assuming you switch stack correctly) or do longjmp this way (jumping > to the location saved by setjmp). But it's not what'd be happening > here. Yeah, you cannot do that in C, not without making assumptions about what machine code the compiler generates. GCC explicitly disallows it, too: 'asm' statements may not perform jumps into other 'asm' statements, only to the listed GOTOLABELS. GCC's optimizers do not know about other jumps; therefore they cannot take account of them when deciding how to optimize. Segher From libc-dev at lists.llvm.org Sun Apr 19 17:27:58 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Mon, 20 Apr 2020 10:27:58 +1000 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200416095800.GC23945@port70.net> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <1586994952.nnxigedbu2.astroid@bobo.none> <20200416095800.GC23945@port70.net> Message-ID: <1587341904.1r83vbudyf.astroid@bobo.none> Excerpts from Szabolcs Nagy's message of April 16, 2020 7:58 pm: > * Nicholas Piggin via Libc-alpha [2020-04-16 10:16:54 +1000]: >> Well it would have to test HWCAP and patch in or branch to two >> completely different sequences including register save/restores yes. >> You could have the same asm and matching clobbers to put the sequence >> inline and then you could patch the one sc/scv instruction I suppose. > > how would that 'patch' work? > > there are many reasons why you don't > want libc to write its .text I guess I don't know what I'm talking about when it comes to libraries. Shame if there is no good way to load-time patch libc. It's orthogonal to the scv selection though -- if you don't patch you have to conditional or indirect branch however you implement it. Thanks, Nick From libc-dev at lists.llvm.org Sun Apr 19 17:46:45 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Mon, 20 Apr 2020 10:46:45 +1000 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <65f70b10-bfc1-e9f6-d48a-4b063ad6b669@linaro.org> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> <65f70b10-bfc1-e9f6-d48a-4b063ad6b669@linaro.org> Message-ID: <1587342668.1krc7b5v5v.astroid@bobo.none> Excerpts from Adhemerval Zanella's message of April 17, 2020 4:52 am: > > > On 16/04/2020 15:31, Rich Felker wrote: >> On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote: >>> >>> >>> On 16/04/2020 14:59, Rich Felker wrote: >>>> On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote: >>>>> >>>>> >>>>> On 16/04/2020 12:37, Rich Felker wrote: >>>>>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: >>>>>>>> My preference would be that it work just like the i386 AT_SYSINFO >>>>>>>> where you just replace "int $128" with "call *%%gs:16" and the kernel >>>>>>>> provides a stub in the vdso that performs either scv or the old >>>>>>>> mechanism with the same calling convention. Then if the kernel doesn't >>>>>>>> provide it (because the kernel is too old) libc would have to provide >>>>>>>> its own stub that uses the legacy method and matches the calling >>>>>>>> convention of the one the kernel is expected to provide. >>>>>>> >>>>>>> What about pthread cancellation and the requirement of checking the >>>>>>> cancellable syscall anchors in asynchronous cancellation? My plan is >>>>>>> still to use musl strategy on glibc (BZ#12683) and for i686 it >>>>>>> requires to always use old int$128 for program that uses cancellation >>>>>>> (static case) or just threads (dynamic mode, which should be more >>>>>>> common on glibc). >>>>>>> >>>>>>> Using the i686 strategy of a vDSO bridge symbol would require to always >>>>>>> fallback to 'sc' to still use the same cancellation strategy (and >>>>>>> thus defeating this optimization in such cases). >>>>>> >>>>>> Yes, I assumed it would be the same, ignoring the new syscall >>>>>> mechanism for cancellable syscalls. While there are some exceptions, >>>>>> cancellable syscalls are generally not hot paths but things that are >>>>>> expected to block and to have significant amounts of work to do in >>>>>> kernelspace, so saving a few tens of cycles is rather pointless. >>>>>> >>>>>> It's possible to do a branch/multiple versions of the syscall asm for >>>>>> cancellation but would require extending the cancellation handler to >>>>>> support checking against multiple independent address ranges or using >>>>>> some alternate markup of them. >>>>> >>>>> The main issue is at least for glibc dynamic linking is way more common >>>>> than static linking and once the program become multithread the fallback >>>>> will be always used. >>>> >>>> I'm not relying on static linking optimizing out the cancellable >>>> version. I'm talking about how cancellable syscalls are pretty much >>>> all "heavy" operations to begin with where a few tens of cycles are in >>>> the realm of "measurement noise" relative to the dominating time >>>> costs. >>> >>> Yes I am aware, but at same time I am not sure how it plays on real world. >>> For instance, some workloads might issue kernel query syscalls, such as >>> recv, where buffer copying might not be dominant factor. So I see that if >>> the idea is optimizing syscall mechanism, we should try to leverage it >>> as whole in libc. >> >> Have you timed a minimal recv? I'm not assuming buffer copying is the >> dominant factor. I'm assuming the overhead of all the kernel layers >> involved is dominant. > > Not really, but reading the advantages of using 'scv' over 'sc' also does > not outline the real expect gain. Taking in consideration this should > be a micro-optimization (focused on entry syscall patch), I think we should > use where it possible. It's around 90 cycles improvement, depending on config options and speculative mitigations in place, this may be roughly 5-20% of a gettid syscall, which itself probably bears little relationship to what a recv syscall doing real work would do, it's easy to swamp it with other work. But it's a pretty big win in terms of how much we try to optimise this path. Thanks, Nick From libc-dev at lists.llvm.org Sun Apr 19 18:10:25 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Mon, 20 Apr 2020 11:10:25 +1000 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200416183151.GA11469@brightrain.aerifal.cx> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> Message-ID: <1587344003.daumxvs1kh.astroid@bobo.none> Excerpts from Rich Felker's message of April 17, 2020 4:31 am: > On Thu, Apr 16, 2020 at 03:18:42PM -0300, Adhemerval Zanella wrote: >> >> >> On 16/04/2020 14:59, Rich Felker wrote: >> > On Thu, Apr 16, 2020 at 02:50:18PM -0300, Adhemerval Zanella wrote: >> >> >> >> >> >> On 16/04/2020 12:37, Rich Felker wrote: >> >>> On Thu, Apr 16, 2020 at 11:16:04AM -0300, Adhemerval Zanella wrote: >> >>>>> My preference would be that it work just like the i386 AT_SYSINFO >> >>>>> where you just replace "int $128" with "call *%%gs:16" and the kernel >> >>>>> provides a stub in the vdso that performs either scv or the old >> >>>>> mechanism with the same calling convention. Then if the kernel doesn't >> >>>>> provide it (because the kernel is too old) libc would have to provide >> >>>>> its own stub that uses the legacy method and matches the calling >> >>>>> convention of the one the kernel is expected to provide. >> >>>> >> >>>> What about pthread cancellation and the requirement of checking the >> >>>> cancellable syscall anchors in asynchronous cancellation? My plan is >> >>>> still to use musl strategy on glibc (BZ#12683) and for i686 it >> >>>> requires to always use old int$128 for program that uses cancellation >> >>>> (static case) or just threads (dynamic mode, which should be more >> >>>> common on glibc). >> >>>> >> >>>> Using the i686 strategy of a vDSO bridge symbol would require to always >> >>>> fallback to 'sc' to still use the same cancellation strategy (and >> >>>> thus defeating this optimization in such cases). >> >>> >> >>> Yes, I assumed it would be the same, ignoring the new syscall >> >>> mechanism for cancellable syscalls. While there are some exceptions, >> >>> cancellable syscalls are generally not hot paths but things that are >> >>> expected to block and to have significant amounts of work to do in >> >>> kernelspace, so saving a few tens of cycles is rather pointless. >> >>> >> >>> It's possible to do a branch/multiple versions of the syscall asm for >> >>> cancellation but would require extending the cancellation handler to >> >>> support checking against multiple independent address ranges or using >> >>> some alternate markup of them. >> >> >> >> The main issue is at least for glibc dynamic linking is way more common >> >> than static linking and once the program become multithread the fallback >> >> will be always used. >> > >> > I'm not relying on static linking optimizing out the cancellable >> > version. I'm talking about how cancellable syscalls are pretty much >> > all "heavy" operations to begin with where a few tens of cycles are in >> > the realm of "measurement noise" relative to the dominating time >> > costs. >> >> Yes I am aware, but at same time I am not sure how it plays on real world. >> For instance, some workloads might issue kernel query syscalls, such as >> recv, where buffer copying might not be dominant factor. So I see that if >> the idea is optimizing syscall mechanism, we should try to leverage it >> as whole in libc. > > Have you timed a minimal recv? I'm not assuming buffer copying is the > dominant factor. I'm assuming the overhead of all the kernel layers > involved is dominant. > >> >> And besides the cancellation performance issue, a new bridge vDSO mechanism >> >> will still require to setup some extra bridge for the case of the older >> >> kernel. In the scheme you suggested: >> >> >> >> __asm__("indirect call" ... with common clobbers); >> >> >> >> The indirect call will be either the vDSO bridge or an libc provided that >> >> fallback to 'sc' for !PPC_FEATURE2_SCV. I am not this is really a gain >> >> against: >> >> >> >> if (hwcap & PPC_FEATURE2_SCV) { >> >> __asm__(... with some clobbers); >> >> } else { >> >> __asm__(... with different clobbers); >> >> } >> > >> > If the indirect call can be made roughly as efficiently as the sc >> > sequence now (which already have some cost due to handling the nasty >> > error return convention, making the indirect call likely just as small >> > or smaller), it's O(1) additional code size (and thus icache usage) >> > rather than O(n) where n is number of syscall points. >> > >> > Of course it would work just as well (for avoiding O(n) growth) to >> > have a direct call to out-of-line branch like you suggested. >> >> Yes, but does it really matter to optimize this specific usage case >> for size? glibc, for instance, tries to leverage the syscall mechanism >> by adding some complex pre-processor asm directives. It optimizes >> the syscall code size in most cases. For instance, kill in static case >> generates on x86_64: >> >> 0000000000000000 <__kill>: >> 0: b8 3e 00 00 00 mov $0x3e,%eax >> 5: 0f 05 syscall >> 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax >> d: 0f 83 00 00 00 00 jae 13 <__kill+0x13> >> 13: c3 retq >> >> While on musl: >> >> 0000000000000000 : >> 0: 48 83 ec 08 sub $0x8,%rsp >> 4: 48 63 ff movslq %edi,%rdi >> 7: 48 63 f6 movslq %esi,%rsi >> a: b8 3e 00 00 00 mov $0x3e,%eax >> f: 0f 05 syscall >> 11: 48 89 c7 mov %rax,%rdi >> 14: e8 00 00 00 00 callq 19 >> 19: 5a pop %rdx >> 1a: c3 retq > > Wow that's some extraordinarily bad codegen going on by gcc... The > sign-extension is semantically needed and I don't see a good way > around it (glibc's asm is kinda a hack taking advantage of kernel not > looking at high bits, I think), but the gratuitous stack adjustment > and refusal to generate a tail call isn't. I'll see if we can track > down what's going on and get it fixed. > >> But I hardly think it pays off the required code complexity. Some >> for providing a O(1) bridge: this will require additional complexity >> to write it and setup correctly. > > In some sense I agree, but inline instructions are a lot more > expensive on ppc (being 32-bit each), and it might take out-of-lining > anyway to get rid of stack frame setups if that ends up being a > problem. > >> >> Specially if 'hwcap & PPC_FEATURE2_SCV' could be optimized with a >> >> TCB member (as we do on glibc) and if we could make the asm clever >> >> enough to not require different clobbers (although not sure if >> >> it would be possible). >> > >> > The easy way not to require different clobbers is just using the union >> > of the clobbers, no? Does the proposed new method clobber any >> > call-saved registers that would make it painful (requiring new call >> > frames to save them in)? >> >> As far I can tell, it should be ok. > > Note that because lr is clobbered we need at least once normally > call-clobbered register that's not syscall clobbered to save lr in. > Otherwise stack frame setup is required to spill it. The kernel would like to use r9-r12 for itself. We could do with fewer registers, but we have some delay establishing the stack (depends on a load which depends on a mfspr), and entry code tends to be quite store heavy whereas on the caller side you have r1 set up (modulo stack updates), and the system call is a long delay during which time the store queue has significant time to drain. My feeling is it would be better for kernel to have these scratch registers. Thanks, Nick From libc-dev at lists.llvm.org Sun Apr 19 18:29:04 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Sun, 19 Apr 2020 21:29:04 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <1587341904.1r83vbudyf.astroid@bobo.none> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <1586994952.nnxigedbu2.astroid@bobo.none> <20200416095800.GC23945@port70.net> <1587341904.1r83vbudyf.astroid@bobo.none> Message-ID: <20200420012904.GY11469@brightrain.aerifal.cx> On Mon, Apr 20, 2020 at 10:27:58AM +1000, Nicholas Piggin wrote: > Excerpts from Szabolcs Nagy's message of April 16, 2020 7:58 pm: > > * Nicholas Piggin via Libc-alpha [2020-04-16 10:16:54 +1000]: > >> Well it would have to test HWCAP and patch in or branch to two > >> completely different sequences including register save/restores yes. > >> You could have the same asm and matching clobbers to put the sequence > >> inline and then you could patch the one sc/scv instruction I suppose. > > > > how would that 'patch' work? > > > > there are many reasons why you don't > > want libc to write its .text > > I guess I don't know what I'm talking about when it comes to libraries. > Shame if there is no good way to load-time patch libc. It's orthogonal > to the scv selection though -- if you don't patch you have to > conditional or indirect branch however you implement it. Patched pages cannot be shared. The whole design of PIC and shared libraries is that the code("text")/rodata is immutable and shared and that only a minimal amount of data, packed tightly together (the GOT) has to exist per-instance. Also, allowing patching of executable pages is generally frowned upon these days because W^X is a desirable hardening property. Rich From libc-dev at lists.llvm.org Sun Apr 19 18:34:12 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Sun, 19 Apr 2020 21:34:12 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <1587344003.daumxvs1kh.astroid@bobo.none> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> Message-ID: <20200420013412.GZ11469@brightrain.aerifal.cx> On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote: > Excerpts from Rich Felker's message of April 17, 2020 4:31 am: > > Note that because lr is clobbered we need at least once normally > > call-clobbered register that's not syscall clobbered to save lr in. > > Otherwise stack frame setup is required to spill it. > > The kernel would like to use r9-r12 for itself. We could do with fewer > registers, but we have some delay establishing the stack (depends on a > load which depends on a mfspr), and entry code tends to be quite store > heavy whereas on the caller side you have r1 set up (modulo stack > updates), and the system call is a long delay during which time the > store queue has significant time to drain. > > My feeling is it would be better for kernel to have these scratch > registers. If your new kernel syscall mechanism requires the caller to make a whole stack frame it otherwise doesn't need and spill registers to it, it becomes a lot less attractive. Some of those 90 cycles saved are immediately lost on the userspace side, plus you either waste icache at the call point or require the syscall to go through a userspace-side helper function that performs the spill and restore. The right way to do this is to have the kernel preserve enough registers that userspace can avoid having any spills. It doesn't have to preserve everything, probably just enough to save lr. (BTW are syscall arg registers still preserved? If not, this is a major cost on the userspace side, since any call point that has to loop-and-retry (e.g. futex) now needs to make its own place to store the original values.) Rich From libc-dev at lists.llvm.org Sun Apr 19 19:08:36 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Mon, 20 Apr 2020 12:08:36 +1000 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200420012904.GY11469@brightrain.aerifal.cx> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <1586994952.nnxigedbu2.astroid@bobo.none> <20200416095800.GC23945@port70.net> <1587341904.1r83vbudyf.astroid@bobo.none> <20200420012904.GY11469@brightrain.aerifal.cx> Message-ID: <1587348046.pwnfbo52iq.astroid@bobo.none> Excerpts from Rich Felker's message of April 20, 2020 11:29 am: > On Mon, Apr 20, 2020 at 10:27:58AM +1000, Nicholas Piggin wrote: >> Excerpts from Szabolcs Nagy's message of April 16, 2020 7:58 pm: >> > * Nicholas Piggin via Libc-alpha [2020-04-16 10:16:54 +1000]: >> >> Well it would have to test HWCAP and patch in or branch to two >> >> completely different sequences including register save/restores yes. >> >> You could have the same asm and matching clobbers to put the sequence >> >> inline and then you could patch the one sc/scv instruction I suppose. >> > >> > how would that 'patch' work? >> > >> > there are many reasons why you don't >> > want libc to write its .text >> >> I guess I don't know what I'm talking about when it comes to libraries. >> Shame if there is no good way to load-time patch libc. It's orthogonal >> to the scv selection though -- if you don't patch you have to >> conditional or indirect branch however you implement it. > > Patched pages cannot be shared. The whole design of PIC and shared > libraries is that the code("text")/rodata is immutable and shared and > that only a minimal amount of data, packed tightly together (the GOT) > has to exist per-instance. Yeah the pages which were patched couldn't be shared across exec, which is a significant downside, unless you could group all patch sites into their own section and similarly pack it together (which has issues of being out of line). > > Also, allowing patching of executable pages is generally frowned upon > these days because W^X is a desirable hardening property. Right, it would want be write-protected after being patched. Thanks, Nick From libc-dev at lists.llvm.org Sun Apr 19 19:32:21 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Mon, 20 Apr 2020 12:32:21 +1000 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200420013412.GZ11469@brightrain.aerifal.cx> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> <20200420013412.GZ11469@brightrain.aerifal.cx> Message-ID: <1587348538.l1ioqml73m.astroid@bobo.none> Excerpts from Rich Felker's message of April 20, 2020 11:34 am: > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote: >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am: >> > Note that because lr is clobbered we need at least once normally >> > call-clobbered register that's not syscall clobbered to save lr in. >> > Otherwise stack frame setup is required to spill it. >> >> The kernel would like to use r9-r12 for itself. We could do with fewer >> registers, but we have some delay establishing the stack (depends on a >> load which depends on a mfspr), and entry code tends to be quite store >> heavy whereas on the caller side you have r1 set up (modulo stack >> updates), and the system call is a long delay during which time the >> store queue has significant time to drain. >> >> My feeling is it would be better for kernel to have these scratch >> registers. > > If your new kernel syscall mechanism requires the caller to make a > whole stack frame it otherwise doesn't need and spill registers to it, > it becomes a lot less attractive. Some of those 90 cycles saved are > immediately lost on the userspace side, plus you either waste icache > at the call point or require the syscall to go through a > userspace-side helper function that performs the spill and restore. You would be surprised how few cycles that takes on a high end CPU. Some might be a couple of %. I am one for counting cycles mind you, I'm not being flippant about it. If we can come up with something faster I'd be up for it. > > The right way to do this is to have the kernel preserve enough > registers that userspace can avoid having any spills. It doesn't have > to preserve everything, probably just enough to save lr. (BTW are Again, the problem is the kernel doesn't have its dependencies immediately ready to spill, and spilling (may be) more costly immediately after the call because we're doing a lot of stores. I could try measure this. Unfortunately our pipeline simulator tool doesn't model system calls properly so it's hard to see what's happening across the user/kernel horizon, I might check if that can be improved or I can hack it by putting some isync in there or something. > syscall arg registers still preserved? If not, this is a major cost on > the userspace side, since any call point that has to loop-and-retry > (e.g. futex) now needs to make its own place to store the original > values.) Powerpc system calls never did. We could have scv preserve them, but you'd still need to restore r3. We could make an ABI which does not clobber r3 but puts the return value in r9, say. I'd like to see what the user side code looks like to take advantage of such a thing though. Thanks, Nick From libc-dev at lists.llvm.org Sun Apr 19 21:09:26 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Mon, 20 Apr 2020 00:09:26 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <1587348538.l1ioqml73m.astroid@bobo.none> References: <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> <20200420013412.GZ11469@brightrain.aerifal.cx> <1587348538.l1ioqml73m.astroid@bobo.none> Message-ID: <20200420040926.GA11469@brightrain.aerifal.cx> On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote: > Excerpts from Rich Felker's message of April 20, 2020 11:34 am: > > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote: > >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am: > >> > Note that because lr is clobbered we need at least once normally > >> > call-clobbered register that's not syscall clobbered to save lr in. > >> > Otherwise stack frame setup is required to spill it. > >> > >> The kernel would like to use r9-r12 for itself. We could do with fewer > >> registers, but we have some delay establishing the stack (depends on a > >> load which depends on a mfspr), and entry code tends to be quite store > >> heavy whereas on the caller side you have r1 set up (modulo stack > >> updates), and the system call is a long delay during which time the > >> store queue has significant time to drain. > >> > >> My feeling is it would be better for kernel to have these scratch > >> registers. > > > > If your new kernel syscall mechanism requires the caller to make a > > whole stack frame it otherwise doesn't need and spill registers to it, > > it becomes a lot less attractive. Some of those 90 cycles saved are > > immediately lost on the userspace side, plus you either waste icache > > at the call point or require the syscall to go through a > > userspace-side helper function that performs the spill and restore. > > You would be surprised how few cycles that takes on a high end CPU. Some > might be a couple of %. I am one for counting cycles mind you, I'm not > being flippant about it. If we can come up with something faster I'd be > up for it. If the cycle count is trivial then just do it on the kernel side. > > The right way to do this is to have the kernel preserve enough > > registers that userspace can avoid having any spills. It doesn't have > > to preserve everything, probably just enough to save lr. (BTW are > > Again, the problem is the kernel doesn't have its dependencies > immediately ready to spill, and spilling (may be) more costly > immediately after the call because we're doing a lot of stores. > > I could try measure this. Unfortunately our pipeline simulator tool > doesn't model system calls properly so it's hard to see what's happening > across the user/kernel horizon, I might check if that can be improved > or I can hack it by putting some isync in there or something. I think it's unlikely to make any real difference to the total number of cycles spent which side it happens on, but putting it on the kernel side makes it easier to avoid wasting size/icache at each syscall site. > > syscall arg registers still preserved? If not, this is a major cost on > > the userspace side, since any call point that has to loop-and-retry > > (e.g. futex) now needs to make its own place to store the original > > values.) > > Powerpc system calls never did. We could have scv preserve them, but > you'd still need to restore r3. We could make an ABI which does not > clobber r3 but puts the return value in r9, say. I'd like to see what > the user side code looks like to take advantage of such a thing though. Oh wow, I hadn't realized that, but indeed the code we have now is allowing for the kernel to clobber them all. So at least this isn't getting any worse I guess. I think it was a very poor choice of behavior though and a disadvantage vs what other archs do (some of them preserve all registers; others preserve only normally call-saved ones plus the syscall arg ones and possibly a few other specials). Rich From libc-dev at lists.llvm.org Sun Apr 19 21:31:58 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Mon, 20 Apr 2020 14:31:58 +1000 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200420040926.GA11469@brightrain.aerifal.cx> References: <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> <20200420013412.GZ11469@brightrain.aerifal.cx> <1587348538.l1ioqml73m.astroid@bobo.none> <20200420040926.GA11469@brightrain.aerifal.cx> Message-ID: <1587356128.aslvdnmtbw.astroid@bobo.none> Excerpts from Rich Felker's message of April 20, 2020 2:09 pm: > On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote: >> Excerpts from Rich Felker's message of April 20, 2020 11:34 am: >> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote: >> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am: >> >> > Note that because lr is clobbered we need at least once normally >> >> > call-clobbered register that's not syscall clobbered to save lr in. >> >> > Otherwise stack frame setup is required to spill it. >> >> >> >> The kernel would like to use r9-r12 for itself. We could do with fewer >> >> registers, but we have some delay establishing the stack (depends on a >> >> load which depends on a mfspr), and entry code tends to be quite store >> >> heavy whereas on the caller side you have r1 set up (modulo stack >> >> updates), and the system call is a long delay during which time the >> >> store queue has significant time to drain. >> >> >> >> My feeling is it would be better for kernel to have these scratch >> >> registers. >> > >> > If your new kernel syscall mechanism requires the caller to make a >> > whole stack frame it otherwise doesn't need and spill registers to it, >> > it becomes a lot less attractive. Some of those 90 cycles saved are >> > immediately lost on the userspace side, plus you either waste icache >> > at the call point or require the syscall to go through a >> > userspace-side helper function that performs the spill and restore. >> >> You would be surprised how few cycles that takes on a high end CPU. Some >> might be a couple of %. I am one for counting cycles mind you, I'm not >> being flippant about it. If we can come up with something faster I'd be >> up for it. > > If the cycle count is trivial then just do it on the kernel side. The cycle count for user is, because you have r1 ready. Kernel does not have its stack ready, it has to mfspr rX ; ld rY,N(rX); to get stack to save into. Which is also wasted work for a userspace. Now that I think about it, no stack frame is even required! lr is saved into the caller's stack when its clobbered with an asm, just as when it's used for a function call. >> > The right way to do this is to have the kernel preserve enough >> > registers that userspace can avoid having any spills. It doesn't have >> > to preserve everything, probably just enough to save lr. (BTW are >> >> Again, the problem is the kernel doesn't have its dependencies >> immediately ready to spill, and spilling (may be) more costly >> immediately after the call because we're doing a lot of stores. >> >> I could try measure this. Unfortunately our pipeline simulator tool >> doesn't model system calls properly so it's hard to see what's happening >> across the user/kernel horizon, I might check if that can be improved >> or I can hack it by putting some isync in there or something. > > I think it's unlikely to make any real difference to the total number > of cycles spent which side it happens on, but putting it on the kernel > side makes it easier to avoid wasting size/icache at each syscall > site. > >> > syscall arg registers still preserved? If not, this is a major cost on >> > the userspace side, since any call point that has to loop-and-retry >> > (e.g. futex) now needs to make its own place to store the original >> > values.) >> >> Powerpc system calls never did. We could have scv preserve them, but >> you'd still need to restore r3. We could make an ABI which does not >> clobber r3 but puts the return value in r9, say. I'd like to see what >> the user side code looks like to take advantage of such a thing though. > > Oh wow, I hadn't realized that, but indeed the code we have now is > allowing for the kernel to clobber them all. So at least this isn't > getting any worse I guess. I think it was a very poor choice of > behavior though and a disadvantage vs what other archs do (some of > them preserve all registers; others preserve only normally call-saved > ones plus the syscall arg ones and possibly a few other specials). Well, we could change it. Does the generated code improve significantly we take those clobbers away? Thanks, Nick From libc-dev at lists.llvm.org Mon Apr 20 10:27:15 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Mon, 20 Apr 2020 13:27:15 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <1587356128.aslvdnmtbw.astroid@bobo.none> References: <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> <20200420013412.GZ11469@brightrain.aerifal.cx> <1587348538.l1ioqml73m.astroid@bobo.none> <20200420040926.GA11469@brightrain.aerifal.cx> <1587356128.aslvdnmtbw.astroid@bobo.none> Message-ID: <20200420172715.GC11469@brightrain.aerifal.cx> On Mon, Apr 20, 2020 at 02:31:58PM +1000, Nicholas Piggin wrote: > Excerpts from Rich Felker's message of April 20, 2020 2:09 pm: > > On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote: > >> Excerpts from Rich Felker's message of April 20, 2020 11:34 am: > >> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote: > >> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am: > >> >> > Note that because lr is clobbered we need at least once normally > >> >> > call-clobbered register that's not syscall clobbered to save lr in. > >> >> > Otherwise stack frame setup is required to spill it. > >> >> > >> >> The kernel would like to use r9-r12 for itself. We could do with fewer > >> >> registers, but we have some delay establishing the stack (depends on a > >> >> load which depends on a mfspr), and entry code tends to be quite store > >> >> heavy whereas on the caller side you have r1 set up (modulo stack > >> >> updates), and the system call is a long delay during which time the > >> >> store queue has significant time to drain. > >> >> > >> >> My feeling is it would be better for kernel to have these scratch > >> >> registers. > >> > > >> > If your new kernel syscall mechanism requires the caller to make a > >> > whole stack frame it otherwise doesn't need and spill registers to it, > >> > it becomes a lot less attractive. Some of those 90 cycles saved are > >> > immediately lost on the userspace side, plus you either waste icache > >> > at the call point or require the syscall to go through a > >> > userspace-side helper function that performs the spill and restore. > >> > >> You would be surprised how few cycles that takes on a high end CPU. Some > >> might be a couple of %. I am one for counting cycles mind you, I'm not > >> being flippant about it. If we can come up with something faster I'd be > >> up for it. > > > > If the cycle count is trivial then just do it on the kernel side. > > The cycle count for user is, because you have r1 ready. Kernel does not > have its stack ready, it has to mfspr rX ; ld rY,N(rX); to get stack to > save into. > > Which is also wasted work for a userspace. > > Now that I think about it, no stack frame is even required! lr is saved > into the caller's stack when its clobbered with an asm, just as when > it's used for a function call. No. If there is a non-clobbered register, lr can be moved to the non-clobbered register rather than saved to the stack. However it looks like (1) gcc doesn't take advantage of that possibility, but (2) the caller already arranged for there to be space on the stack to save lr, so the cost is only one store and one load, not any stack adjustment or other frame setup. So it's probably not a really big deal. However, just adding "lr" clobber to existing syscall in musl increased the size of a simple syscall function (getuid) from 20 bytes to 36 bytes. > >> > syscall arg registers still preserved? If not, this is a major cost on > >> > the userspace side, since any call point that has to loop-and-retry > >> > (e.g. futex) now needs to make its own place to store the original > >> > values.) > >> > >> Powerpc system calls never did. We could have scv preserve them, but > >> you'd still need to restore r3. We could make an ABI which does not > >> clobber r3 but puts the return value in r9, say. I'd like to see what > >> the user side code looks like to take advantage of such a thing though. > > > > Oh wow, I hadn't realized that, but indeed the code we have now is > > allowing for the kernel to clobber them all. So at least this isn't > > getting any worse I guess. I think it was a very poor choice of > > behavior though and a disadvantage vs what other archs do (some of > > them preserve all registers; others preserve only normally call-saved > > ones plus the syscall arg ones and possibly a few other specials). > > Well, we could change it. Does the generated code improve significantly > we take those clobbers away? I'd have to experiment a bit more to see. It's not going to help at all in functions which are pure syscall wrappers that just do the syscall and return, since the arg regs are dead after the syscall anyway (the caller must assume they were clobbered). But where syscalls are inlined and used in a loop, like a futex wait, it might make a nontrivial difference. Unfortunately even if you did change it for the new scv mechanism, it would be hard to take advantage of the change while also supporting sc, unless we used a helper function that just did scv directly, but saved/restored all the arg regs when using the legacy sc mechanism. Just inlining the hwcap conditional and clobbering more regs in one code path than in the other likely would not help; gcc won't shrink-wrap the clobbered/non-clobbered paths separately, and even if it did, when this were inlined somewhere like a futex loop, it'd end up having to lift the conditional out of the loop to be very advantageous, then making the code much larger by producing two copies of the loop. So I think just behaving similarly to the old sc method is probably the best option we have... Rich From libc-dev at lists.llvm.org Tue Apr 21 02:57:00 2020 From: libc-dev at lists.llvm.org (Florian Weimer via libc-dev) Date: Tue, 21 Apr 2020 11:57:00 +0200 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200420211751.GF23945@port70.net> (Szabolcs Nagy's message of "Mon, 20 Apr 2020 23:17:51 +0200") References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <1586994952.nnxigedbu2.astroid@bobo.none> <20200416095800.GC23945@port70.net> <1587341904.1r83vbudyf.astroid@bobo.none> <20200420012904.GY11469@brightrain.aerifal.cx> <1587348046.pwnfbo52iq.astroid@bobo.none> <20200420211751.GF23945@port70.net> Message-ID: <87eeshupoz.fsf@mid.deneb.enyo.de> * Szabolcs Nagy: > * Nicholas Piggin [2020-04-20 12:08:36 +1000]: >> Excerpts from Rich Felker's message of April 20, 2020 11:29 am: >> > Also, allowing patching of executable pages is generally frowned upon >> > these days because W^X is a desirable hardening property. >> >> Right, it would want be write-protected after being patched. > > "frowned upon" means that users may have to update > their security policy setting in pax, selinux, apparmor, > seccomp bpf filters and who knows what else that may > monitor and flag W&X mprotect. > > libc update can break systems if the new libc does W&X. It's possible to map over pre-compiled alternative implementations, though. Basically, we would do the patching and build time and store the results in the file. It works best if the variance is concentrated on a few pages, and there are very few alternatives. For example, having two syscall APIs and supporting threading and no-threading versions would need four code versions in total, which is likely excessive. From libc-dev at lists.llvm.org Mon Apr 20 14:17:51 2020 From: libc-dev at lists.llvm.org (Szabolcs Nagy via libc-dev) Date: Mon, 20 Apr 2020 23:17:51 +0200 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <1587348046.pwnfbo52iq.astroid@bobo.none> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <1586994952.nnxigedbu2.astroid@bobo.none> <20200416095800.GC23945@port70.net> <1587341904.1r83vbudyf.astroid@bobo.none> <20200420012904.GY11469@brightrain.aerifal.cx> <1587348046.pwnfbo52iq.astroid@bobo.none> Message-ID: <20200420211751.GF23945@port70.net> * Nicholas Piggin [2020-04-20 12:08:36 +1000]: > Excerpts from Rich Felker's message of April 20, 2020 11:29 am: > > Also, allowing patching of executable pages is generally frowned upon > > these days because W^X is a desirable hardening property. > > Right, it would want be write-protected after being patched. "frowned upon" means that users may have to update their security policy setting in pax, selinux, apparmor, seccomp bpf filters and who knows what else that may monitor and flag W&X mprotect. libc update can break systems if the new libc does W&X. From libc-dev at lists.llvm.org Tue Apr 21 05:28:25 2020 From: libc-dev at lists.llvm.org (David Laight via libc-dev) Date: Tue, 21 Apr 2020 12:28:25 +0000 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <1587344003.daumxvs1kh.astroid@bobo.none> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> Message-ID: From: Nicholas Piggin > Sent: 20 April 2020 02:10 ... > >> Yes, but does it really matter to optimize this specific usage case > >> for size? glibc, for instance, tries to leverage the syscall mechanism > >> by adding some complex pre-processor asm directives. It optimizes > >> the syscall code size in most cases. For instance, kill in static case > >> generates on x86_64: > >> > >> 0000000000000000 <__kill>: > >> 0: b8 3e 00 00 00 mov $0x3e,%eax > >> 5: 0f 05 syscall > >> 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax > >> d: 0f 83 00 00 00 00 jae 13 <__kill+0x13> Hmmm... that cmp + jae is unnecessary here. It is also a 32bit offset jump. I also suspect it gets predicted very badly. > >> 13: c3 retq > >> > >> While on musl: > >> > >> 0000000000000000 : > >> 0: 48 83 ec 08 sub $0x8,%rsp > >> 4: 48 63 ff movslq %edi,%rdi > >> 7: 48 63 f6 movslq %esi,%rsi > >> a: b8 3e 00 00 00 mov $0x3e,%eax > >> f: 0f 05 syscall > >> 11: 48 89 c7 mov %rax,%rdi > >> 14: e8 00 00 00 00 callq 19 > >> 19: 5a pop %rdx > >> 1a: c3 retq > > > > Wow that's some extraordinarily bad codegen going on by gcc... The > > sign-extension is semantically needed and I don't see a good way > > around it (glibc's asm is kinda a hack taking advantage of kernel not > > looking at high bits, I think), but the gratuitous stack adjustment > > and refusal to generate a tail call isn't. I'll see if we can track > > down what's going on and get it fixed. A suitable cast might get rid of the sign extension. Possibly just (unsigned int). David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) From libc-dev at lists.llvm.org Tue Apr 21 07:39:41 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Tue, 21 Apr 2020 10:39:41 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> Message-ID: <20200421143941.GJ11469@brightrain.aerifal.cx> On Tue, Apr 21, 2020 at 12:28:25PM +0000, David Laight wrote: > From: Nicholas Piggin > > Sent: 20 April 2020 02:10 > ... > > >> Yes, but does it really matter to optimize this specific usage case > > >> for size? glibc, for instance, tries to leverage the syscall mechanism > > >> by adding some complex pre-processor asm directives. It optimizes > > >> the syscall code size in most cases. For instance, kill in static case > > >> generates on x86_64: > > >> > > >> 0000000000000000 <__kill>: > > >> 0: b8 3e 00 00 00 mov $0x3e,%eax > > >> 5: 0f 05 syscall > > >> 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax > > >> d: 0f 83 00 00 00 00 jae 13 <__kill+0x13> > > Hmmm... that cmp + jae is unnecessary here. It's not.. Rather the objdump was just mistakenly done without -r so it looks like a nop jump rather than a conditional tail call to the function that sets errno. > It is also a 32bit offset jump. > I also suspect it gets predicted very badly. I doubt that. This is a very standard idiom and the size of the offset (which is necessarily 32-bit because it has a relocation on it) is orthogonal to the condition on the jump. FWIW a syscall like kill takes global kernel-side locks to be able to address a target process by pid, and the rate of meaningful calls you can make to it is very low (since it's bounded by time for target process to act on the signal). Trying to optimize it for speed is pointless, and even size isn't important locally (although in aggregate, lots of wasted small size can add up to more pages = more TLB entries = ...). > > >> 13: c3 retq > > >> > > >> While on musl: > > >> > > >> 0000000000000000 : > > >> 0: 48 83 ec 08 sub $0x8,%rsp > > >> 4: 48 63 ff movslq %edi,%rdi > > >> 7: 48 63 f6 movslq %esi,%rsi > > >> a: b8 3e 00 00 00 mov $0x3e,%eax > > >> f: 0f 05 syscall > > >> 11: 48 89 c7 mov %rax,%rdi > > >> 14: e8 00 00 00 00 callq 19 > > >> 19: 5a pop %rdx > > >> 1a: c3 retq > > > > > > Wow that's some extraordinarily bad codegen going on by gcc... The > > > sign-extension is semantically needed and I don't see a good way > > > around it (glibc's asm is kinda a hack taking advantage of kernel not > > > looking at high bits, I think), but the gratuitous stack adjustment > > > and refusal to generate a tail call isn't. I'll see if we can track > > > down what's going on and get it fixed. > > A suitable cast might get rid of the sign extension. > Possibly just (unsigned int). No, it won't. The problem is that there is no representation of the fact that the kernel is only going to inspect the low 32 bits (by declaring the kernel-side function as taking an int argument). The external kill function receives arguments by the ABI, where the upper bits of int args can contain junk, and the asm register constraints for syscalls use longs (or rather an abstract syscall-arg type). It wouldn't even work to have macro magic detect that the expressions passed are ints and use hacks to avoid that, since it's perfectly valid to pass an int to a syscall that expects a long argument (e.g. offset to mmap), in which case it needs to be sign-extended. The only way to avoid this is encoding somewhere the syscall-specific knowledge of what arg size the kernel function expects. That's way too much redundant effort and too error-prone for the incredibly miniscule size benefit you'd get out of it. Rich From libc-dev at lists.llvm.org Tue Apr 21 08:00:31 2020 From: libc-dev at lists.llvm.org (Adhemerval Zanella via libc-dev) Date: Tue, 21 Apr 2020 12:00:31 -0300 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200421143941.GJ11469@brightrain.aerifal.cx> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> <20200421143941.GJ11469@brightrain.aerifal.cx> Message-ID: <960127e0-57a0-55b4-f309-ae0a675c7756@linaro.org> On 21/04/2020 11:39, Rich Felker wrote: > On Tue, Apr 21, 2020 at 12:28:25PM +0000, David Laight wrote: >> From: Nicholas Piggin >>> Sent: 20 April 2020 02:10 >> ... >>>>> Yes, but does it really matter to optimize this specific usage case >>>>> for size? glibc, for instance, tries to leverage the syscall mechanism >>>>> by adding some complex pre-processor asm directives. It optimizes >>>>> the syscall code size in most cases. For instance, kill in static case >>>>> generates on x86_64: >>>>> >>>>> 0000000000000000 <__kill>: >>>>> 0: b8 3e 00 00 00 mov $0x3e,%eax >>>>> 5: 0f 05 syscall >>>>> 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax >>>>> d: 0f 83 00 00 00 00 jae 13 <__kill+0x13> >> >> Hmmm... that cmp + jae is unnecessary here. > > It's not.. Rather the objdump was just mistakenly done without -r so > it looks like a nop jump rather than a conditional tail call to the > function that sets errno. > Indeed, the output with -r is: 0000000000000000 <__kill>: 0: b8 3e 00 00 00 mov $0x3e,%eax 5: 0f 05 syscall 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax d: 0f 83 00 00 00 00 jae 13 <__kill+0x13> f: R_X86_64_PLT32 __syscall_error-0x4 13: c3 retq And for x86_64 __syscall_error is defined as: 0000000000000000 <__syscall_error>: 0: 48 f7 d8 neg %rax 0000000000000003 <__syscall_error_1>: 3: 64 89 04 25 00 00 00 mov %eax,%fs:0x0 a: 00 7: R_X86_64_TPOFF32 errno b: 48 83 c8 ff or $0xffffffffffffffff,%rax f: c3 retq Different than musl, each architecture defines its own error handling mechanism (some embedded errno setting in syscall itself, other branches to a __syscall_error like function as x86_64). This is due most likely from the glibc long history. One of my long term plan is to just simplify, get rid of the assembly pre-processor, implement all syscall in C code, and set error handling mechanism in a platform neutral way using a tail call (most likely you do on musl). >> It is also a 32bit offset jump. >> I also suspect it gets predicted very badly. > > I doubt that. This is a very standard idiom and the size of the offset > (which is necessarily 32-bit because it has a relocation on it) is > orthogonal to the condition on the jump. > > FWIW a syscall like kill takes global kernel-side locks to be able to > address a target process by pid, and the rate of meaningful calls you > can make to it is very low (since it's bounded by time for target > process to act on the signal). Trying to optimize it for speed is > pointless, and even size isn't important locally (although in > aggregate, lots of wasted small size can add up to more pages = more > TLB entries = ...). I agree and I would prefer to focus on code simplicity to have a platform neutral way to handle error and let the compiler optimize it than messy with assembly macros to squeeze this kind of micro-optimizations. > >>>>> 13: c3 retq >>>>> >>>>> While on musl: >>>>> >>>>> 0000000000000000 : >>>>> 0: 48 83 ec 08 sub $0x8,%rsp >>>>> 4: 48 63 ff movslq %edi,%rdi >>>>> 7: 48 63 f6 movslq %esi,%rsi >>>>> a: b8 3e 00 00 00 mov $0x3e,%eax >>>>> f: 0f 05 syscall >>>>> 11: 48 89 c7 mov %rax,%rdi >>>>> 14: e8 00 00 00 00 callq 19 >>>>> 19: 5a pop %rdx >>>>> 1a: c3 retq >>>> >>>> Wow that's some extraordinarily bad codegen going on by gcc... The >>>> sign-extension is semantically needed and I don't see a good way >>>> around it (glibc's asm is kinda a hack taking advantage of kernel not >>>> looking at high bits, I think), but the gratuitous stack adjustment >>>> and refusal to generate a tail call isn't. I'll see if we can track >>>> down what's going on and get it fixed. >> >> A suitable cast might get rid of the sign extension. >> Possibly just (unsigned int). > > No, it won't. The problem is that there is no representation of the > fact that the kernel is only going to inspect the low 32 bits (by > declaring the kernel-side function as taking an int argument). The > external kill function receives arguments by the ABI, where the upper > bits of int args can contain junk, and the asm register constraints > for syscalls use longs (or rather an abstract syscall-arg type). It > wouldn't even work to have macro magic detect that the expressions > passed are ints and use hacks to avoid that, since it's perfectly > valid to pass an int to a syscall that expects a long argument (e.g. > offset to mmap), in which case it needs to be sign-extended. > > The only way to avoid this is encoding somewhere the syscall-specific > knowledge of what arg size the kernel function expects. That's way too > much redundant effort and too error-prone for the incredibly miniscule > size benefit you'd get out of it. > > Rich > From libc-dev at lists.llvm.org Tue Apr 21 08:31:08 2020 From: libc-dev at lists.llvm.org (David Laight via libc-dev) Date: Tue, 21 Apr 2020 15:31:08 +0000 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <960127e0-57a0-55b4-f309-ae0a675c7756@linaro.org> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> <20200421143941.GJ11469@brightrain.aerifal.cx> <960127e0-57a0-55b4-f309-ae0a675c7756@linaro.org> Message-ID: From: Adhemerval Zanella > Sent: 21 April 2020 16:01 > > On 21/04/2020 11:39, Rich Felker wrote: > > On Tue, Apr 21, 2020 at 12:28:25PM +0000, David Laight wrote: > >> From: Nicholas Piggin > >>> Sent: 20 April 2020 02:10 > >> ... > >>>>> Yes, but does it really matter to optimize this specific usage case > >>>>> for size? glibc, for instance, tries to leverage the syscall mechanism > >>>>> by adding some complex pre-processor asm directives. It optimizes > >>>>> the syscall code size in most cases. For instance, kill in static case > >>>>> generates on x86_64: > >>>>> > >>>>> 0000000000000000 <__kill>: > >>>>> 0: b8 3e 00 00 00 mov $0x3e,%eax > >>>>> 5: 0f 05 syscall > >>>>> 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax > >>>>> d: 0f 83 00 00 00 00 jae 13 <__kill+0x13> > >> > >> Hmmm... that cmp + jae is unnecessary here. > > > > It's not.. Rather the objdump was just mistakenly done without -r so > > it looks like a nop jump rather than a conditional tail call to the > > function that sets errno. > > > > Indeed, the output with -r is: > > 0000000000000000 <__kill>: > 0: b8 3e 00 00 00 mov $0x3e,%eax > 5: 0f 05 syscall > 7: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax > d: 0f 83 00 00 00 00 jae 13 <__kill+0x13> > f: R_X86_64_PLT32 __syscall_error-0x4 > 13: c3 retq Yes, I probably should have remembered it looked like that :-) ... > >> I also suspect it gets predicted very badly. > > > > I doubt that. This is a very standard idiom and the size of the offset > > (which is necessarily 32-bit because it has a relocation on it) is > > orthogonal to the condition on the jump. Yes, it only gets mispredicted as badly as any other conditional jump. I believe modern intel x86 will randomly predict it taken (regardless of the direction) and then hit a TLB fault on text.unlikely :-) > > FWIW a syscall like kill takes global kernel-side locks to be able to > > address a target process by pid, and the rate of meaningful calls you > > can make to it is very low (since it's bounded by time for target > > process to act on the signal). Trying to optimize it for speed is > > pointless, and even size isn't important locally (although in > > aggregate, lots of wasted small size can add up to more pages = more > > TLB entries = ...). > > I agree and I would prefer to focus on code simplicity to have a > platform neutral way to handle error and let the compiler optimize > it than messy with assembly macros to squeeze this kind of > micro-optimizations. syscall entry does get micro-optimised. Real speed-ups can probably be found by optimising other places. I've a patch i need to resumbit that should improve the reading of iov[] from user space. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) From libc-dev at lists.llvm.org Wed Apr 22 00:15:08 2020 From: libc-dev at lists.llvm.org (Florian Weimer via libc-dev) Date: Wed, 22 Apr 2020 09:15:08 +0200 Subject: [libc-dev] [musl] Re: Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <1587536988.ivnp421w2w.astroid@bobo.none> (Nicholas Piggin's message of "Wed, 22 Apr 2020 16:54:18 +1000") References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> <20200421143941.GJ11469@brightrain.aerifal.cx> <960127e0-57a0-55b4-f309-ae0a675c7756@linaro.org> <1587536988.ivnp421w2w.astroid@bobo.none> Message-ID: <874ktcng8z.fsf@mid.deneb.enyo.de> * Nicholas Piggin: > Another option would be to use a different signal. I don't see that any > are more suitable. SIGSYS comes to my mind. But I don't know how exclusively it is associated with seccomp these days. From libc-dev at lists.llvm.org Wed Apr 22 01:11:49 2020 From: libc-dev at lists.llvm.org (Florian Weimer via libc-dev) Date: Wed, 22 Apr 2020 10:11:49 +0200 Subject: [libc-dev] [musl] Re: Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <1587540390.vde84z8edw.astroid@bobo.none> (Nicholas Piggin's message of "Wed, 22 Apr 2020 17:31:07 +1000") References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> <20200421143941.GJ11469@brightrain.aerifal.cx> <960127e0-57a0-55b4-f309-ae0a675c7756@linaro.org> <1587536988.ivnp421w2w.astroid@bobo.none> <874ktcng8z.fsf@mid.deneb.enyo.de> <1587540390.vde84z8edw.astroid@bobo.none> Message-ID: <87imhslz22.fsf@mid.deneb.enyo.de> * Nicholas Piggin: > So I would be disinclined to use SIGSYS unless there are no other better > signal types, and we don't want to use SIGILL for some good reason -- is > there a good reason to add complexity for userspace by differentiating > these two situations? No, SIGILL seems fine to me. scv 0 and scv 1 could well be considered different instructions eventually (with different mnemonics). From libc-dev at lists.llvm.org Tue Apr 21 23:18:36 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Wed, 22 Apr 2020 16:18:36 +1000 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200420172715.GC11469@brightrain.aerifal.cx> References: <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> <20200420013412.GZ11469@brightrain.aerifal.cx> <1587348538.l1ioqml73m.astroid@bobo.none> <20200420040926.GA11469@brightrain.aerifal.cx> <1587356128.aslvdnmtbw.astroid@bobo.none> <20200420172715.GC11469@brightrain.aerifal.cx> Message-ID: <1587531042.1qvc287tsc.astroid@bobo.none> Excerpts from Rich Felker's message of April 21, 2020 3:27 am: > On Mon, Apr 20, 2020 at 02:31:58PM +1000, Nicholas Piggin wrote: >> Excerpts from Rich Felker's message of April 20, 2020 2:09 pm: >> > On Mon, Apr 20, 2020 at 12:32:21PM +1000, Nicholas Piggin wrote: >> >> Excerpts from Rich Felker's message of April 20, 2020 11:34 am: >> >> > On Mon, Apr 20, 2020 at 11:10:25AM +1000, Nicholas Piggin wrote: >> >> >> Excerpts from Rich Felker's message of April 17, 2020 4:31 am: >> >> >> > Note that because lr is clobbered we need at least once normally >> >> >> > call-clobbered register that's not syscall clobbered to save lr in. >> >> >> > Otherwise stack frame setup is required to spill it. >> >> >> >> >> >> The kernel would like to use r9-r12 for itself. We could do with fewer >> >> >> registers, but we have some delay establishing the stack (depends on a >> >> >> load which depends on a mfspr), and entry code tends to be quite store >> >> >> heavy whereas on the caller side you have r1 set up (modulo stack >> >> >> updates), and the system call is a long delay during which time the >> >> >> store queue has significant time to drain. >> >> >> >> >> >> My feeling is it would be better for kernel to have these scratch >> >> >> registers. >> >> > >> >> > If your new kernel syscall mechanism requires the caller to make a >> >> > whole stack frame it otherwise doesn't need and spill registers to it, >> >> > it becomes a lot less attractive. Some of those 90 cycles saved are >> >> > immediately lost on the userspace side, plus you either waste icache >> >> > at the call point or require the syscall to go through a >> >> > userspace-side helper function that performs the spill and restore. >> >> >> >> You would be surprised how few cycles that takes on a high end CPU. Some >> >> might be a couple of %. I am one for counting cycles mind you, I'm not >> >> being flippant about it. If we can come up with something faster I'd be >> >> up for it. >> > >> > If the cycle count is trivial then just do it on the kernel side. >> >> The cycle count for user is, because you have r1 ready. Kernel does not >> have its stack ready, it has to mfspr rX ; ld rY,N(rX); to get stack to >> save into. >> >> Which is also wasted work for a userspace. >> >> Now that I think about it, no stack frame is even required! lr is saved >> into the caller's stack when its clobbered with an asm, just as when >> it's used for a function call. > > No. If there is a non-clobbered register, lr can be moved to the > non-clobbered register rather than saved to the stack. However it > looks like (1) gcc doesn't take advantage of that possibility, but (2) > the caller already arranged for there to be space on the stack to save > lr, so the cost is only one store and one load, not any stack > adjustment or other frame setup. So it's probably not a really big > deal. However, just adding "lr" clobber to existing syscall in musl > increased the size of a simple syscall function (getuid) from 20 bytes > to 36 bytes. Yeah I had a bit of a play around with musl (which is very nice code I must say). The powerpc64 syscall asm is missing ctr clobber by the way. Fortunately adding it doesn't change code generation for me, but it should be fixed. glibc had the same bug at one point I think (probably due to syscall ABI documentation not existing -- something now lives in linux/Documentation/powerpc/syscall64-abi.rst). Yes lr needs to be saved, I didn't see any new requirement for stack frames, and it was often already saved, but it does hurt the small wrapper functions. I did look at entirely replacing sc with scv though, just as an experiment. One day you might make sc optional! Text size impoves by about 3kB with the proposed ABI. Mostly seems to be the bns+ ; neg sequence. __syscall1/2/3 get out-of-lined by the compiler in a lot of cases. Linux's bloat-o-meter says: add/remove: 0/5 grow/shrink: 24/260 up/down: 220/-3428 (-3208) Function old new delta fcntl 400 424 +24 popen 600 620 +20 times 32 40 +8 [...] alloc_rev 816 784 -32 alloc_fwd 812 780 -32 __syscall1.constprop 32 - -32 __fdopen 504 472 -32 __expand_heap 628 592 -36 __syscall2 40 - -40 __syscall3 44 - -44 fchmodat 372 324 -48 __wake.constprop 408 360 -48 child 1116 1064 -52 checker 220 156 -64 __bin_chunk 1576 1512 -64 malloc 1940 1860 -80 __syscall3.constprop 96 - -96 __syscall1 108 - -108 Total: Before=613379, After=610171, chg -0.52% Now if we go a step further we could preserve r0,r4-r8. That gives the kernel r9-r12 as scratch while leaving userspace with some spare volatile GPRs except in the uncommon syscall6 case. static inline long __syscall0(long n) { register long r0 __asm__("r0") = n; register long r3 __asm__("r3"); __asm__ __volatile__("scv 0" : "=r"(r3) : "r"(r0) : "memory", "cr0", "cr1", "cr5", "cr6", "cr7", "lr", "ctr", "r9", "r10", "r11", "r12" return r3; } That saves another ~400 bytes, reducing some of the register shuffling for futex loops etc: [...] __pthread_cond_timedwait 964 944 -20 __expand_heap 592 572 -20 socketpair 292 268 -24 __wake.constprop 360 336 -24 malloc 1860 1828 -32 __bin_chunk 1512 1472 -40 fcntl 424 376 -48 Total: Before=610171, After=609723, chg -0.07% As you say, the compiler doesn't do a good job of saving lr in a spare GPR unfortunately. Saving it ourselves to eliminate the lr clobber is no good because it's almost always already saved. At least having non-clobbered volatile GPRs could let a future smarter compiler take advantage. If we go further and try to preserve r3 as well by putting the return value in r9 or r0, we go backwards about 300 bytes. It's good for the lock loops and complex functions, but hurts a lot of simpler functions that have to add 'mr r3,r9' etc. Most of the time there are saved non-volatile GPRs around anyway though, so not sure which way to go on this. Text size savings can't be ignored and it's pretty easy for the kernel to do (we already save r3-r8 and zero them on exit, so we could load them instead from cache line that's should be hot). So I may be inclined to go this way, even if we won't see benefit now. Thanks, Nick From libc-dev at lists.llvm.org Tue Apr 21 23:29:19 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Wed, 22 Apr 2020 16:29:19 +1000 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <1587531042.1qvc287tsc.astroid@bobo.none> References: <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> <20200420013412.GZ11469@brightrain.aerifal.cx> <1587348538.l1ioqml73m.astroid@bobo.none> <20200420040926.GA11469@brightrain.aerifal.cx> <1587356128.aslvdnmtbw.astroid@bobo.none> <20200420172715.GC11469@brightrain.aerifal.cx> <1587531042.1qvc287tsc.astroid@bobo.none> Message-ID: <1587536847.k87ypbo53k.astroid@bobo.none> Excerpts from Nicholas Piggin's message of April 22, 2020 4:18 pm: > If we go further and try to preserve r3 as well by putting the return > value in r9 or r0, we go backwards about 300 bytes. It's good for the > lock loops and complex functions, but hurts a lot of simpler functions > that have to add 'mr r3,r9' etc. > > Most of the time there are saved non-volatile GPRs around anyway though, > so not sure which way to go on this. Text size savings can't be ignored > and it's pretty easy for the kernel to do (we already save r3-r8 and > zero them on exit, so we could load them instead from cache line that's > should be hot). > > So I may be inclined to go this way, even if we won't see benefit now. By, "this way" I don't mean r9 or r0 return value (which is larger code), but r3 return value with r0,r4-r8 preserved. Thanks, Nick From libc-dev at lists.llvm.org Tue Apr 21 23:54:18 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Wed, 22 Apr 2020 16:54:18 +1000 Subject: [libc-dev] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <960127e0-57a0-55b4-f309-ae0a675c7756@linaro.org> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> <20200421143941.GJ11469@brightrain.aerifal.cx> <960127e0-57a0-55b4-f309-ae0a675c7756@linaro.org> Message-ID: <1587536988.ivnp421w2w.astroid@bobo.none> Let me try to summarise what we have. - vdso style call is ruled out as unnecessary with possible security concerns. Caller can internally use indirect branch to select variant if it wants to use that mechanism to select. - LR clobber seems to handled okay by gcc. It can increase size of small leaf wrapper functions, but they can use the caller stack frame for this (and even red zone for saving other things if necessary), but not a huge amount. - -ve error return seems to be favoured by everyone. Experimentally, it's better for musl (but musl could probably improve cr0[SO] error handling a bit 'asm goto'). - Preserving syscall args and volatiles up to r8 is a small but noticable help for cases that inline the call rather than always call wrappers. This is unlikely to be helpful unless 'sc' support is compiled out but I'll consider doing it for the long term. Next step is to trace and test on real hardware. - One thing that nobody has really asked about is error handling for unsupported scv vectors, so I would like to just go over it: Today, the scv facility is disabled by the kernel (FSCR[SCV] is cleared), which makes any `scv` instruction take a facility unavailable, which ends up printing a kernel message about SCV facility unavilable, and SIGILL's the process with ILL_ILLOPC. Enabling 'scv 0' will enable 1-127 as well, so the kernel has to handle those somehow. What we are saying is that we will allocate HWCAP bits in future if we implement more scv vectors, so userspace is not *supposed* to rely on this, but kernel has to choose some behaviour for invalid vectors. My proposal was to do the same SIGILL (with no kernel facility message), so it appears to behave the same way to userspace as it does now. There is also the ILL_ILLOPN code that could be used as invalid operand, but powerpc does not use this much, and e.g., the static instruction coded operands e.g., invalid mfspr generate ILL_ILLOPC so we could consider the entire instruction as the opcode, and input register values as operands. Now I don't know why a process would want to distinguish between FSCR[SCV]=0 and the case where it is enabled but kernel doesn't implement the vector, but maybe it does? Another option would be to use a different signal. I don't see that any are more suitable. Or return without a signal but -ENOSYS or something in r3. This doesn't seem so good because an invalid scv vector is not a system call, and a failure ABI would constrain any future implementation just a little bit. Any objections to SIGILL ILL_ILLOPC? Thanks, Nick From libc-dev at lists.llvm.org Wed Apr 22 00:31:07 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Wed, 22 Apr 2020 17:31:07 +1000 Subject: [libc-dev] [musl] Re: Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <874ktcng8z.fsf@mid.deneb.enyo.de> References: <1586931450.ub4c8cq8dj.astroid@bobo.none> <20200415225539.GL11469@brightrain.aerifal.cx> <20200416153756.GU11469@brightrain.aerifal.cx> <4b2a7a56-dd2b-1863-50e5-2f4cdbeef47c@linaro.org> <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> <20200421143941.GJ11469@brightrain.aerifal.cx> <960127e0-57a0-55b4-f309-ae0a675c7756@linaro.org> <1587536988.ivnp421w2w.astroid@bobo.none> <874ktcng8z.fsf@mid.deneb.enyo.de> Message-ID: <1587540390.vde84z8edw.astroid@bobo.none> Excerpts from Florian Weimer's message of April 22, 2020 5:15 pm: > * Nicholas Piggin: > >> Another option would be to use a different signal. I don't see that any >> are more suitable. > > SIGSYS comes to my mind. But I don't know how exclusively it is > associated with seccomp these days. SIGSYS is entirely seccomp now. There looks like a single obscure MIPS user of it in Linux that's not seccomp, but it would be entirely new for powerpc (or any of the common platforms, arm, x86 etc). So I would be disinclined to use SIGSYS unless there are no other better signal types, and we don't want to use SIGILL for some good reason -- is there a good reason to add complexity for userspace by differentiating these two situations? Thanks, Nick From libc-dev at lists.llvm.org Wed Apr 22 19:36:42 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Wed, 22 Apr 2020 22:36:42 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <1587531042.1qvc287tsc.astroid@bobo.none> References: <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> <20200420013412.GZ11469@brightrain.aerifal.cx> <1587348538.l1ioqml73m.astroid@bobo.none> <20200420040926.GA11469@brightrain.aerifal.cx> <1587356128.aslvdnmtbw.astroid@bobo.none> <20200420172715.GC11469@brightrain.aerifal.cx> <1587531042.1qvc287tsc.astroid@bobo.none> Message-ID: <20200423023642.GP11469@brightrain.aerifal.cx> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: > Yeah I had a bit of a play around with musl (which is very nice code I > must say). The powerpc64 syscall asm is missing ctr clobber by the way. > Fortunately adding it doesn't change code generation for me, but it > should be fixed. glibc had the same bug at one point I think (probably > due to syscall ABI documentation not existing -- something now lives in > linux/Documentation/powerpc/syscall64-abi.rst). Do you know anywhere I can read about the ctr issue, possibly the relevant glibc bug report? I'm not particularly familiar with ppc register file (at least I have to refamiliarize myself every time I work on this stuff) so it'd be nice to understand what's potentially-wrong now. Rich From libc-dev at lists.llvm.org Thu Apr 23 05:13:57 2020 From: libc-dev at lists.llvm.org (Adhemerval Zanella via libc-dev) Date: Thu, 23 Apr 2020 09:13:57 -0300 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200423023642.GP11469@brightrain.aerifal.cx> References: <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> <20200420013412.GZ11469@brightrain.aerifal.cx> <1587348538.l1ioqml73m.astroid@bobo.none> <20200420040926.GA11469@brightrain.aerifal.cx> <1587356128.aslvdnmtbw.astroid@bobo.none> <20200420172715.GC11469@brightrain.aerifal.cx> <1587531042.1qvc287tsc.astroid@bobo.none> <20200423023642.GP11469@brightrain.aerifal.cx> Message-ID: On 22/04/2020 23:36, Rich Felker wrote: > On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: >> Yeah I had a bit of a play around with musl (which is very nice code I >> must say). The powerpc64 syscall asm is missing ctr clobber by the way. >> Fortunately adding it doesn't change code generation for me, but it >> should be fixed. glibc had the same bug at one point I think (probably >> due to syscall ABI documentation not existing -- something now lives in >> linux/Documentation/powerpc/syscall64-abi.rst). > > Do you know anywhere I can read about the ctr issue, possibly the > relevant glibc bug report? I'm not particularly familiar with ppc > register file (at least I have to refamiliarize myself every time I > work on this stuff) so it'd be nice to understand what's > potentially-wrong now. My understanding is the ctr issue only happens for vDSO calls where it fallback to a syscall in case an error (invalid argument, etc. and assuming if vDSO does not fallback to a syscall it always succeed). This makes the vDSO call on powerpc to have same same ABI constraint as a syscall, where it clobbers CR0. On glibc we handle by simulating a function call and analysing the CR0 result: __asm__ __volatile__ ("mtctr %0\n\t" "bctrl\n\t" "mfcr %0\n\t" "0:" : "+r" (r0), "+r" (r3), "+r" (r4), "+r" (r5), "+r" (r6), "+r" (r7), "+r" (r8) : : "r9", "r10", "r11", "r12", "cr0", "ctr", "lr", "memory"); __asm__ __volatile__ ("" : "=r" (rval) : "r" (r3)); On musl you don't have this issue because it does not enable vDSO support on powerpc. And if it eventually does it with the VDSO_* macros the only issue I see is on when vDSO fallbacks to the syscall and it also fails (the return code won't be negated since on musl it uses a default C function pointer issue which does not model the CR0 kernel abi). So I think the extra ctr constraint on glibc powerpc syscall code is not really required. I think I have some patches to optimize this a bit based on previous discussions. From libc-dev at lists.llvm.org Thu Apr 23 09:18:41 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Thu, 23 Apr 2020 12:18:41 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: References: <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> <20200420013412.GZ11469@brightrain.aerifal.cx> <1587348538.l1ioqml73m.astroid@bobo.none> <20200420040926.GA11469@brightrain.aerifal.cx> <1587356128.aslvdnmtbw.astroid@bobo.none> <20200420172715.GC11469@brightrain.aerifal.cx> <1587531042.1qvc287tsc.astroid@bobo.none> <20200423023642.GP11469@brightrain.aerifal.cx> Message-ID: <20200423161841.GU11469@brightrain.aerifal.cx> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: > > > On 22/04/2020 23:36, Rich Felker wrote: > > On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: > >> Yeah I had a bit of a play around with musl (which is very nice code I > >> must say). The powerpc64 syscall asm is missing ctr clobber by the way. > >> Fortunately adding it doesn't change code generation for me, but it > >> should be fixed. glibc had the same bug at one point I think (probably > >> due to syscall ABI documentation not existing -- something now lives in > >> linux/Documentation/powerpc/syscall64-abi.rst). > > > > Do you know anywhere I can read about the ctr issue, possibly the > > relevant glibc bug report? I'm not particularly familiar with ppc > > register file (at least I have to refamiliarize myself every time I > > work on this stuff) so it'd be nice to understand what's > > potentially-wrong now. > > My understanding is the ctr issue only happens for vDSO calls where it > fallback to a syscall in case an error (invalid argument, etc. and > assuming if vDSO does not fallback to a syscall it always succeed). > This makes the vDSO call on powerpc to have same same ABI constraint > as a syscall, where it clobbers CR0. I think you mean "vsyscall", the old thing glibc used where there are in-userspace implementations of some syscalls with call interfaces roughly equivalent to a syscall. musl has never used this. It only uses the actual exported functions from the vdso which have normal external function call ABI. Rich From libc-dev at lists.llvm.org Thu Apr 23 09:35:01 2020 From: libc-dev at lists.llvm.org (Adhemerval Zanella via libc-dev) Date: Thu, 23 Apr 2020 13:35:01 -0300 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200423161841.GU11469@brightrain.aerifal.cx> References: <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> <20200420013412.GZ11469@brightrain.aerifal.cx> <1587348538.l1ioqml73m.astroid@bobo.none> <20200420040926.GA11469@brightrain.aerifal.cx> <1587356128.aslvdnmtbw.astroid@bobo.none> <20200420172715.GC11469@brightrain.aerifal.cx> <1587531042.1qvc287tsc.astroid@bobo.none> <20200423023642.GP11469@brightrain.aerifal.cx> <20200423161841.GU11469@brightrain.aerifal.cx> Message-ID: <3fe73604-7c92-e073-cbe7-abb4a8ae7c1a@linaro.org> On 23/04/2020 13:18, Rich Felker wrote: > On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: >> >> >> On 22/04/2020 23:36, Rich Felker wrote: >>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: >>>> Yeah I had a bit of a play around with musl (which is very nice code I >>>> must say). The powerpc64 syscall asm is missing ctr clobber by the way. >>>> Fortunately adding it doesn't change code generation for me, but it >>>> should be fixed. glibc had the same bug at one point I think (probably >>>> due to syscall ABI documentation not existing -- something now lives in >>>> linux/Documentation/powerpc/syscall64-abi.rst). >>> >>> Do you know anywhere I can read about the ctr issue, possibly the >>> relevant glibc bug report? I'm not particularly familiar with ppc >>> register file (at least I have to refamiliarize myself every time I >>> work on this stuff) so it'd be nice to understand what's >>> potentially-wrong now. >> >> My understanding is the ctr issue only happens for vDSO calls where it >> fallback to a syscall in case an error (invalid argument, etc. and >> assuming if vDSO does not fallback to a syscall it always succeed). >> This makes the vDSO call on powerpc to have same same ABI constraint >> as a syscall, where it clobbers CR0. > > I think you mean "vsyscall", the old thing glibc used where there are > in-userspace implementations of some syscalls with call interfaces > roughly equivalent to a syscall. musl has never used this. It only > uses the actual exported functions from the vdso which have normal > external function call ABI. I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing. The issue is indeed when calling the powerpc provided functions in vDSO, which musl might want to do eventually. From libc-dev at lists.llvm.org Thu Apr 23 09:43:14 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Thu, 23 Apr 2020 12:43:14 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <3fe73604-7c92-e073-cbe7-abb4a8ae7c1a@linaro.org> References: <20200420013412.GZ11469@brightrain.aerifal.cx> <1587348538.l1ioqml73m.astroid@bobo.none> <20200420040926.GA11469@brightrain.aerifal.cx> <1587356128.aslvdnmtbw.astroid@bobo.none> <20200420172715.GC11469@brightrain.aerifal.cx> <1587531042.1qvc287tsc.astroid@bobo.none> <20200423023642.GP11469@brightrain.aerifal.cx> <20200423161841.GU11469@brightrain.aerifal.cx> <3fe73604-7c92-e073-cbe7-abb4a8ae7c1a@linaro.org> Message-ID: <20200423164314.GX11469@brightrain.aerifal.cx> On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote: > > > On 23/04/2020 13:18, Rich Felker wrote: > > On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: > >> > >> > >> On 22/04/2020 23:36, Rich Felker wrote: > >>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: > >>>> Yeah I had a bit of a play around with musl (which is very nice code I > >>>> must say). The powerpc64 syscall asm is missing ctr clobber by the way. > >>>> Fortunately adding it doesn't change code generation for me, but it > >>>> should be fixed. glibc had the same bug at one point I think (probably > >>>> due to syscall ABI documentation not existing -- something now lives in > >>>> linux/Documentation/powerpc/syscall64-abi.rst). > >>> > >>> Do you know anywhere I can read about the ctr issue, possibly the > >>> relevant glibc bug report? I'm not particularly familiar with ppc > >>> register file (at least I have to refamiliarize myself every time I > >>> work on this stuff) so it'd be nice to understand what's > >>> potentially-wrong now. > >> > >> My understanding is the ctr issue only happens for vDSO calls where it > >> fallback to a syscall in case an error (invalid argument, etc. and > >> assuming if vDSO does not fallback to a syscall it always succeed). > >> This makes the vDSO call on powerpc to have same same ABI constraint > >> as a syscall, where it clobbers CR0. > > > > I think you mean "vsyscall", the old thing glibc used where there are > > in-userspace implementations of some syscalls with call interfaces > > roughly equivalent to a syscall. musl has never used this. It only > > uses the actual exported functions from the vdso which have normal > > external function call ABI. > > I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing. > The issue is indeed when calling the powerpc provided functions in > vDSO, which musl might want to do eventually. AIUI (at least this is true for all other archs) the functions have normal external function call ABI and calling them has nothing to do with syscall mechanisms. It looks like we're not using them right now and I'm not sure why. It could be that there are ABI mismatch issues (are 32-bit ones compatible with secure-plt? are 64-bit ones compatible with ELFv2?) or just that nobody proposed adding them. Also as of 5.4 32-bit ppc lacked time64 versions of them; not sure if this is fixed yet. Rich From libc-dev at lists.llvm.org Thu Apr 23 10:15:58 2020 From: libc-dev at lists.llvm.org (Adhemerval Zanella via libc-dev) Date: Thu, 23 Apr 2020 14:15:58 -0300 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200423164314.GX11469@brightrain.aerifal.cx> References: <20200420013412.GZ11469@brightrain.aerifal.cx> <1587348538.l1ioqml73m.astroid@bobo.none> <20200420040926.GA11469@brightrain.aerifal.cx> <1587356128.aslvdnmtbw.astroid@bobo.none> <20200420172715.GC11469@brightrain.aerifal.cx> <1587531042.1qvc287tsc.astroid@bobo.none> <20200423023642.GP11469@brightrain.aerifal.cx> <20200423161841.GU11469@brightrain.aerifal.cx> <3fe73604-7c92-e073-cbe7-abb4a8ae7c1a@linaro.org> <20200423164314.GX11469@brightrain.aerifal.cx> Message-ID: <64d82a23-1f6e-2e6a-b7a9-0eeab8a53435@linaro.org> On 23/04/2020 13:43, Rich Felker wrote: > On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote: >> >> >> On 23/04/2020 13:18, Rich Felker wrote: >>> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: >>>> >>>> >>>> On 22/04/2020 23:36, Rich Felker wrote: >>>>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: >>>>>> Yeah I had a bit of a play around with musl (which is very nice code I >>>>>> must say). The powerpc64 syscall asm is missing ctr clobber by the way. >>>>>> Fortunately adding it doesn't change code generation for me, but it >>>>>> should be fixed. glibc had the same bug at one point I think (probably >>>>>> due to syscall ABI documentation not existing -- something now lives in >>>>>> linux/Documentation/powerpc/syscall64-abi.rst). >>>>> >>>>> Do you know anywhere I can read about the ctr issue, possibly the >>>>> relevant glibc bug report? I'm not particularly familiar with ppc >>>>> register file (at least I have to refamiliarize myself every time I >>>>> work on this stuff) so it'd be nice to understand what's >>>>> potentially-wrong now. >>>> >>>> My understanding is the ctr issue only happens for vDSO calls where it >>>> fallback to a syscall in case an error (invalid argument, etc. and >>>> assuming if vDSO does not fallback to a syscall it always succeed). >>>> This makes the vDSO call on powerpc to have same same ABI constraint >>>> as a syscall, where it clobbers CR0. >>> >>> I think you mean "vsyscall", the old thing glibc used where there are >>> in-userspace implementations of some syscalls with call interfaces >>> roughly equivalent to a syscall. musl has never used this. It only >>> uses the actual exported functions from the vdso which have normal >>> external function call ABI. >> >> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing. >> The issue is indeed when calling the powerpc provided functions in >> vDSO, which musl might want to do eventually. > > AIUI (at least this is true for all other archs) the functions have > normal external function call ABI and calling them has nothing to do > with syscall mechanisms. My point is powerpc specifically does not follow it, since it issues a syscall in fallback and its semantic follow kernel syscalls (error signalled in cr0, r3 being always a positive value): -- V_FUNCTION_BEGIN(__kernel_clock_gettime) .cfi_startproc [...] /* * syscall fallback */ 99: li r0,__NR_clock_gettime .cfi_restore lr sc blr .cfi_endproc V_FUNCTION_END(__kernel_clock_gettime) > > It looks like we're not using them right now and I'm not sure why. It > could be that there are ABI mismatch issues (are 32-bit ones > compatible with secure-plt? are 64-bit ones compatible with ELFv2?) or > just that nobody proposed adding them. Also as of 5.4 32-bit ppc > lacked time64 versions of them; not sure if this is fixed yet. For 64-bit it also have an issue where vDSO does not provide an OPD for ELFv1, which has bitten glibc while trying to implement an ifunc optimization. I don't recall any issue for ELFv2. For 32-bit I am not sure secure-plt will change anything, at least not on powerpc where we use the same strategy for 64-bit and use a mtctr/bctr directly. From libc-dev at lists.llvm.org Thu Apr 23 10:42:14 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Thu, 23 Apr 2020 13:42:14 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <64d82a23-1f6e-2e6a-b7a9-0eeab8a53435@linaro.org> References: <20200420040926.GA11469@brightrain.aerifal.cx> <1587356128.aslvdnmtbw.astroid@bobo.none> <20200420172715.GC11469@brightrain.aerifal.cx> <1587531042.1qvc287tsc.astroid@bobo.none> <20200423023642.GP11469@brightrain.aerifal.cx> <20200423161841.GU11469@brightrain.aerifal.cx> <3fe73604-7c92-e073-cbe7-abb4a8ae7c1a@linaro.org> <20200423164314.GX11469@brightrain.aerifal.cx> <64d82a23-1f6e-2e6a-b7a9-0eeab8a53435@linaro.org> Message-ID: <20200423174214.GZ11469@brightrain.aerifal.cx> On Thu, Apr 23, 2020 at 02:15:58PM -0300, Adhemerval Zanella wrote: > > > On 23/04/2020 13:43, Rich Felker wrote: > > On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote: > >> > >> > >> On 23/04/2020 13:18, Rich Felker wrote: > >>> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: > >>>> > >>>> > >>>> On 22/04/2020 23:36, Rich Felker wrote: > >>>>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: > >>>>>> Yeah I had a bit of a play around with musl (which is very nice code I > >>>>>> must say). The powerpc64 syscall asm is missing ctr clobber by the way. > >>>>>> Fortunately adding it doesn't change code generation for me, but it > >>>>>> should be fixed. glibc had the same bug at one point I think (probably > >>>>>> due to syscall ABI documentation not existing -- something now lives in > >>>>>> linux/Documentation/powerpc/syscall64-abi.rst). > >>>>> > >>>>> Do you know anywhere I can read about the ctr issue, possibly the > >>>>> relevant glibc bug report? I'm not particularly familiar with ppc > >>>>> register file (at least I have to refamiliarize myself every time I > >>>>> work on this stuff) so it'd be nice to understand what's > >>>>> potentially-wrong now. > >>>> > >>>> My understanding is the ctr issue only happens for vDSO calls where it > >>>> fallback to a syscall in case an error (invalid argument, etc. and > >>>> assuming if vDSO does not fallback to a syscall it always succeed). > >>>> This makes the vDSO call on powerpc to have same same ABI constraint > >>>> as a syscall, where it clobbers CR0. > >>> > >>> I think you mean "vsyscall", the old thing glibc used where there are > >>> in-userspace implementations of some syscalls with call interfaces > >>> roughly equivalent to a syscall. musl has never used this. It only > >>> uses the actual exported functions from the vdso which have normal > >>> external function call ABI. > >> > >> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing. > >> The issue is indeed when calling the powerpc provided functions in > >> vDSO, which musl might want to do eventually. > > > > AIUI (at least this is true for all other archs) the functions have > > normal external function call ABI and calling them has nothing to do > > with syscall mechanisms. > > My point is powerpc specifically does not follow it, since it issues a > syscall in fallback and its semantic follow kernel syscalls (error > signalled in cr0, r3 being always a positive value): Oh, then I think we'll just ignore these unless the kernel can make ones with a reasonable ABI. It's not worth having ppc-specific code for this... It would be really nice if ones that actually behave like functions could be added though. > -- > V_FUNCTION_BEGIN(__kernel_clock_gettime) > .cfi_startproc > [...] > /* > * syscall fallback > */ > 99: > li r0,__NR_clock_gettime > .cfi_restore lr > sc > blr > .cfi_endproc > V_FUNCTION_END(__kernel_clock_gettime) > > > > > > It looks like we're not using them right now and I'm not sure why. It > > could be that there are ABI mismatch issues (are 32-bit ones > > compatible with secure-plt? are 64-bit ones compatible with ELFv2?) or > > just that nobody proposed adding them. Also as of 5.4 32-bit ppc > > lacked time64 versions of them; not sure if this is fixed yet. > > For 64-bit it also have an issue where vDSO does not provide an OPD > for ELFv1, which has bitten glibc while trying to implement an ifunc > optimization. I don't recall any issue for ELFv2. > > For 32-bit I am not sure secure-plt will change anything, at least not > on powerpc where we use the same strategy for 64-bit and use a > mtctr/bctr directly. Indeed, I don't think there's a secure-plt distinction unless you're making outgoing calls to possibly-cross-DSO functions. Rich From libc-dev at lists.llvm.org Fri Apr 24 20:30:01 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Sat, 25 Apr 2020 13:30:01 +1000 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200423023642.GP11469@brightrain.aerifal.cx> References: <20200416175932.GZ11469@brightrain.aerifal.cx> <4f824a37-e660-8912-25aa-fde88d4b79f3@linaro.org> <20200416183151.GA11469@brightrain.aerifal.cx> <1587344003.daumxvs1kh.astroid@bobo.none> <20200420013412.GZ11469@brightrain.aerifal.cx> <1587348538.l1ioqml73m.astroid@bobo.none> <20200420040926.GA11469@brightrain.aerifal.cx> <1587356128.aslvdnmtbw.astroid@bobo.none> <20200420172715.GC11469@brightrain.aerifal.cx> <1587531042.1qvc287tsc.astroid@bobo.none> <20200423023642.GP11469@brightrain.aerifal.cx> Message-ID: <1587784441.81hgf5xa06.astroid@bobo.none> Excerpts from Rich Felker's message of April 23, 2020 12:36 pm: > On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: >> Yeah I had a bit of a play around with musl (which is very nice code I >> must say). The powerpc64 syscall asm is missing ctr clobber by the way. >> Fortunately adding it doesn't change code generation for me, but it >> should be fixed. glibc had the same bug at one point I think (probably >> due to syscall ABI documentation not existing -- something now lives in >> linux/Documentation/powerpc/syscall64-abi.rst). > > Do you know anywhere I can read about the ctr issue, possibly the > relevant glibc bug report? I'm not particularly familiar with ppc > register file (at least I have to refamiliarize myself every time I > work on this stuff) so it'd be nice to understand what's > potentially-wrong now. Ah I was misremembering, glibc was (and still is) actually missing cr clobbers from its "vsyscall", probably because it copied syscall which only clobbers cr0, but vsyscall clobbers cr0-1,5-7 like a normal function call. musl is missing the ctr register clobber from syscalls. powerpc has gpr0-31 GPRs, cr0-7 condition regs, and lr and ctr branch registers (lr is generally used for function returns, ctr for other indirect branches). ctr is volatile (caller saved) across C function calls, and sc system calls on Linux. Thanks, Nick From libc-dev at lists.llvm.org Fri Apr 24 20:40:24 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Sat, 25 Apr 2020 13:40:24 +1000 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <20200423174214.GZ11469@brightrain.aerifal.cx> References: <20200420040926.GA11469@brightrain.aerifal.cx> <1587356128.aslvdnmtbw.astroid@bobo.none> <20200420172715.GC11469@brightrain.aerifal.cx> <1587531042.1qvc287tsc.astroid@bobo.none> <20200423023642.GP11469@brightrain.aerifal.cx> <20200423161841.GU11469@brightrain.aerifal.cx> <3fe73604-7c92-e073-cbe7-abb4a8ae7c1a@linaro.org> <20200423164314.GX11469@brightrain.aerifal.cx> <64d82a23-1f6e-2e6a-b7a9-0eeab8a53435@linaro.org> <20200423174214.GZ11469@brightrain.aerifal.cx> Message-ID: <1587785455.59207xhucl.astroid@bobo.none> Excerpts from Rich Felker's message of April 24, 2020 3:42 am: > On Thu, Apr 23, 2020 at 02:15:58PM -0300, Adhemerval Zanella wrote: >> >> >> On 23/04/2020 13:43, Rich Felker wrote: >> > On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote: >> >> >> >> >> >> On 23/04/2020 13:18, Rich Felker wrote: >> >>> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: >> >>>> >> >>>> >> >>>> On 22/04/2020 23:36, Rich Felker wrote: >> >>>>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: >> >>>>>> Yeah I had a bit of a play around with musl (which is very nice code I >> >>>>>> must say). The powerpc64 syscall asm is missing ctr clobber by the way. >> >>>>>> Fortunately adding it doesn't change code generation for me, but it >> >>>>>> should be fixed. glibc had the same bug at one point I think (probably >> >>>>>> due to syscall ABI documentation not existing -- something now lives in >> >>>>>> linux/Documentation/powerpc/syscall64-abi.rst). >> >>>>> >> >>>>> Do you know anywhere I can read about the ctr issue, possibly the >> >>>>> relevant glibc bug report? I'm not particularly familiar with ppc >> >>>>> register file (at least I have to refamiliarize myself every time I >> >>>>> work on this stuff) so it'd be nice to understand what's >> >>>>> potentially-wrong now. >> >>>> >> >>>> My understanding is the ctr issue only happens for vDSO calls where it >> >>>> fallback to a syscall in case an error (invalid argument, etc. and >> >>>> assuming if vDSO does not fallback to a syscall it always succeed). >> >>>> This makes the vDSO call on powerpc to have same same ABI constraint >> >>>> as a syscall, where it clobbers CR0. >> >>> >> >>> I think you mean "vsyscall", the old thing glibc used where there are >> >>> in-userspace implementations of some syscalls with call interfaces >> >>> roughly equivalent to a syscall. musl has never used this. It only >> >>> uses the actual exported functions from the vdso which have normal >> >>> external function call ABI. >> >> >> >> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing. >> >> The issue is indeed when calling the powerpc provided functions in >> >> vDSO, which musl might want to do eventually. >> > >> > AIUI (at least this is true for all other archs) the functions have >> > normal external function call ABI and calling them has nothing to do >> > with syscall mechanisms. >> >> My point is powerpc specifically does not follow it, since it issues a >> syscall in fallback and its semantic follow kernel syscalls (error >> signalled in cr0, r3 being always a positive value): > > Oh, then I think we'll just ignore these unless the kernel can make > ones with a reasonable ABI. It's not worth having ppc-specific code > for this... It would be really nice if ones that actually behave like > functions could be added though. Yeah this is an annoyance for me after making the scv ABI return -ve in r3 for error and other things that more closely follow function calls, we still have the vdso functions using the old style. Maybe we should add function call style vdso too. Thanks, Nick From libc-dev at lists.llvm.org Fri Apr 24 21:52:48 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Sat, 25 Apr 2020 00:52:48 -0400 Subject: [libc-dev] [musl] Powerpc Linux 'scv' system call ABI proposal take 2 In-Reply-To: <1587785455.59207xhucl.astroid@bobo.none> References: <20200420172715.GC11469@brightrain.aerifal.cx> <1587531042.1qvc287tsc.astroid@bobo.none> <20200423023642.GP11469@brightrain.aerifal.cx> <20200423161841.GU11469@brightrain.aerifal.cx> <3fe73604-7c92-e073-cbe7-abb4a8ae7c1a@linaro.org> <20200423164314.GX11469@brightrain.aerifal.cx> <64d82a23-1f6e-2e6a-b7a9-0eeab8a53435@linaro.org> <20200423174214.GZ11469@brightrain.aerifal.cx> <1587785455.59207xhucl.astroid@bobo.none> Message-ID: <20200425045248.GG11469@brightrain.aerifal.cx> On Sat, Apr 25, 2020 at 01:40:24PM +1000, Nicholas Piggin wrote: > Excerpts from Rich Felker's message of April 24, 2020 3:42 am: > > On Thu, Apr 23, 2020 at 02:15:58PM -0300, Adhemerval Zanella wrote: > >> > >> > >> On 23/04/2020 13:43, Rich Felker wrote: > >> > On Thu, Apr 23, 2020 at 01:35:01PM -0300, Adhemerval Zanella wrote: > >> >> > >> >> > >> >> On 23/04/2020 13:18, Rich Felker wrote: > >> >>> On Thu, Apr 23, 2020 at 09:13:57AM -0300, Adhemerval Zanella wrote: > >> >>>> > >> >>>> > >> >>>> On 22/04/2020 23:36, Rich Felker wrote: > >> >>>>> On Wed, Apr 22, 2020 at 04:18:36PM +1000, Nicholas Piggin wrote: > >> >>>>>> Yeah I had a bit of a play around with musl (which is very nice code I > >> >>>>>> must say). The powerpc64 syscall asm is missing ctr clobber by the way. > >> >>>>>> Fortunately adding it doesn't change code generation for me, but it > >> >>>>>> should be fixed. glibc had the same bug at one point I think (probably > >> >>>>>> due to syscall ABI documentation not existing -- something now lives in > >> >>>>>> linux/Documentation/powerpc/syscall64-abi.rst). > >> >>>>> > >> >>>>> Do you know anywhere I can read about the ctr issue, possibly the > >> >>>>> relevant glibc bug report? I'm not particularly familiar with ppc > >> >>>>> register file (at least I have to refamiliarize myself every time I > >> >>>>> work on this stuff) so it'd be nice to understand what's > >> >>>>> potentially-wrong now. > >> >>>> > >> >>>> My understanding is the ctr issue only happens for vDSO calls where it > >> >>>> fallback to a syscall in case an error (invalid argument, etc. and > >> >>>> assuming if vDSO does not fallback to a syscall it always succeed). > >> >>>> This makes the vDSO call on powerpc to have same same ABI constraint > >> >>>> as a syscall, where it clobbers CR0. > >> >>> > >> >>> I think you mean "vsyscall", the old thing glibc used where there are > >> >>> in-userspace implementations of some syscalls with call interfaces > >> >>> roughly equivalent to a syscall. musl has never used this. It only > >> >>> uses the actual exported functions from the vdso which have normal > >> >>> external function call ABI. > >> >> > >> >> I wasn't thinking in vsyscall in fact, which afaik it is a x86 thing. > >> >> The issue is indeed when calling the powerpc provided functions in > >> >> vDSO, which musl might want to do eventually. > >> > > >> > AIUI (at least this is true for all other archs) the functions have > >> > normal external function call ABI and calling them has nothing to do > >> > with syscall mechanisms. > >> > >> My point is powerpc specifically does not follow it, since it issues a > >> syscall in fallback and its semantic follow kernel syscalls (error > >> signalled in cr0, r3 being always a positive value): > > > > Oh, then I think we'll just ignore these unless the kernel can make > > ones with a reasonable ABI. It's not worth having ppc-specific code > > for this... It would be really nice if ones that actually behave like > > functions could be added though. > > Yeah this is an annoyance for me after making the scv ABI return -ve in > r3 for error and other things that more closely follow function calls, > we still have the vdso functions using the old style. > > Maybe we should add function call style vdso too. Please do. Rich From libc-dev at lists.llvm.org Fri Apr 24 22:22:27 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Sat, 25 Apr 2020 15:22:27 +1000 Subject: [libc-dev] New powerpc vdso calling convention Message-ID: <1587790194.w180xsw5be.astroid@bobo.none> As noted in the 'scv' thread, powerpc's vdso calling convention does not match the C ELF ABI calling convention (or the proposed scv convention). I think we could implement a new ABI by basically duplicating function entry points with different names. The ELF v2 ABI convention would suit it well, because the caller already requires the function address for ctr, so having it in r12 will eliminate the need for address calculation, which suits the vdso data page access. Is there a need for ELF v1 specific calls as well, or could those just be deprecated and remain on existing functions or required to use the ELF v2 calls using asm wrappers? Is there a good reason for the system call fallback to go in the vdso function rather than have the caller handle it? Thanks, Nick From libc-dev at lists.llvm.org Fri Apr 24 22:40:19 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Sat, 25 Apr 2020 01:40:19 -0400 Subject: [libc-dev] [musl] New powerpc vdso calling convention In-Reply-To: <1587790194.w180xsw5be.astroid@bobo.none> References: <1587790194.w180xsw5be.astroid@bobo.none> Message-ID: <20200425054019.GI11469@brightrain.aerifal.cx> On Sat, Apr 25, 2020 at 03:22:27PM +1000, Nicholas Piggin wrote: > As noted in the 'scv' thread, powerpc's vdso calling convention does not > match the C ELF ABI calling convention (or the proposed scv convention). > I think we could implement a new ABI by basically duplicating function > entry points with different names. > > The ELF v2 ABI convention would suit it well, because the caller already > requires the function address for ctr, so having it in r12 will > eliminate the need for address calculation, which suits the vdso data > page access. > > Is there a need for ELF v1 specific calls as well, or could those just be > deprecated and remain on existing functions or required to use the ELF > v2 calls using asm wrappers? musl doesn't use ELFv1, but my expectation would be for the kernel to provide an ELFv1 VDSO to an ELFv1 process. (I'm pretty sure the kernel has to be aware of this property of the process-image (executable file) since it affects how signals work.) > Is there a good reason for the system call fallback to go in the vdso > function rather than have the caller handle it? Originally it was deemed the vdso's responsibility to do fallback, but MIPS broke this contract so musl always makes a syscall itself if the vdso function returns -ENOSYS. I believe it honors other errors. We could change it to fallback on all errors if needed. I'm not sure what glibc does here. Rich From libc-dev at lists.llvm.org Sat Apr 25 00:47:08 2020 From: libc-dev at lists.llvm.org (Christophe Leroy via libc-dev) Date: Sat, 25 Apr 2020 09:47:08 +0200 Subject: [libc-dev] New powerpc vdso calling convention In-Reply-To: <1587790194.w180xsw5be.astroid@bobo.none> References: <1587790194.w180xsw5be.astroid@bobo.none> Message-ID: <9371cac5-20bb-0552-2609-0d537f41fecd@c-s.fr> Le 25/04/2020 à 07:22, Nicholas Piggin a écrit : > As noted in the 'scv' thread, powerpc's vdso calling convention does not > match the C ELF ABI calling convention (or the proposed scv convention). > I think we could implement a new ABI by basically duplicating function > entry points with different names. I think doing this is a real good idea. I've been working at porting powerpc VDSO to the GENERIC C VDSO, and the main pitfall has been that our vdso calling convention is not compatible with C calling convention, so we have go through an ASM entry/exit. See https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=171469 We should kill this error flag return through CR[SO] and get it the "modern" way like other architectectures implementing the C VDSO: return 0 when successfull, return -err when failed. > > The ELF v2 ABI convention would suit it well, because the caller already > requires the function address for ctr, so having it in r12 will > eliminate the need for address calculation, which suits the vdso data > page access. > > Is there a need for ELF v1 specific calls as well, or could those just be > deprecated and remain on existing functions or required to use the ELF > v2 calls using asm wrappers? What's ELF v1 and ELF v2 ? Is ELF v1 what PPC32 uses ? If so, I'd say yes, it would be good to have it to avoid going through ASM in the middle. > > Is there a good reason for the system call fallback to go in the vdso > function rather than have the caller handle it? I've seen at least one while porting powerpc to the C VDSO: arguments toward VDSO functions are in volatile registers. If the caller has to call the fallback by itself, it has to save them before calling the VDSO, allthought in 99% of cases it won't use them again. With the fallback called by the VDSO itself, the arguments are still hot in volatile registers and ready for calling the fallback. That make it very easy to call them, see patch 5 in the series (https://patchwork.ozlabs.org/project/linuxppc-dev/patch/59bea35725ab4cefc67a678577da8b3ab7771af5.1587401492.git.christophe.leroy at c-s.fr/) > > Thanks, > Nick > Christophe From libc-dev at lists.llvm.org Sat Apr 25 03:56:54 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Sat, 25 Apr 2020 20:56:54 +1000 Subject: [libc-dev] New powerpc vdso calling convention In-Reply-To: <9371cac5-20bb-0552-2609-0d537f41fecd@c-s.fr> References: <1587790194.w180xsw5be.astroid@bobo.none> <9371cac5-20bb-0552-2609-0d537f41fecd@c-s.fr> Message-ID: <1587810370.tg8ym9yjpc.astroid@bobo.none> Excerpts from Christophe Leroy's message of April 25, 2020 5:47 pm: > > > Le 25/04/2020 à 07:22, Nicholas Piggin a écrit : >> As noted in the 'scv' thread, powerpc's vdso calling convention does not >> match the C ELF ABI calling convention (or the proposed scv convention). >> I think we could implement a new ABI by basically duplicating function >> entry points with different names. > > I think doing this is a real good idea. > > I've been working at porting powerpc VDSO to the GENERIC C VDSO, and the > main pitfall has been that our vdso calling convention is not compatible > with C calling convention, so we have go through an ASM entry/exit. > > See https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=171469 > > We should kill this error flag return through CR[SO] and get it the > "modern" way like other architectectures implementing the C VDSO: return > 0 when successfull, return -err when failed. Agreed. >> The ELF v2 ABI convention would suit it well, because the caller already >> requires the function address for ctr, so having it in r12 will >> eliminate the need for address calculation, which suits the vdso data >> page access. >> >> Is there a need for ELF v1 specific calls as well, or could those just be >> deprecated and remain on existing functions or required to use the ELF >> v2 calls using asm wrappers? > > What's ELF v1 and ELF v2 ? Is ELF v1 what PPC32 uses ? If so, I'd say > yes, it would be good to have it to avoid going through ASM in the middle. I'm not sure about PPC32. On PPC64, ELFv2 functions must be called with their address in r12 if called at their global entry point. ELFv1 have a function descriptor with call address and TOC in it, caller has to load the TOC if it's global. The vdso doesn't have TOC, it has one global address (the vdso data page) which it loads by calculating its own address. The kernel doesn't change the vdso based on whether it's called by a v1 or v2 userspace (it doesn't really know itself and would have to export different functions). glibc has a hack to create something: # define VDSO_IFUNC_RET(value) \ ({ \ static Elf64_FuncDesc vdso_opd = { .fd_toc = ~0x0 }; \ vdso_opd.fd_func = (Elf64_Addr)value; \ &vdso_opd; \ }) If we could make something which links more like any other dso with ELFv1, that would be good. Otherwise I think v2 is preferable so it doesn't have to calculate its own address. >> Is there a good reason for the system call fallback to go in the vdso >> function rather than have the caller handle it? > > I've seen at least one while porting powerpc to the C VDSO: arguments > toward VDSO functions are in volatile registers. If the caller has to > call the fallback by itself, it has to save them before calling the > VDSO, allthought in 99% of cases it won't use them again. With the > fallback called by the VDSO itself, the arguments are still hot in > volatile registers and ready for calling the fallback. That make it very > easy to call them, see patch 5 in the series > (https://patchwork.ozlabs.org/project/linuxppc-dev/patch/59bea35725ab4cefc67a678577da8b3ab7771af5.1587401492.git.christophe.leroy at c-s.fr/) I see. Well the kernel can probably patch in sc or scv depending on which is supported, so we could keep the automatic fallback. Thanks, Nick From libc-dev at lists.llvm.org Sat Apr 25 05:20:45 2020 From: libc-dev at lists.llvm.org (Christophe Leroy via libc-dev) Date: Sat, 25 Apr 2020 14:20:45 +0200 Subject: [libc-dev] New powerpc vdso calling convention In-Reply-To: <1587810370.tg8ym9yjpc.astroid@bobo.none> References: <1587790194.w180xsw5be.astroid@bobo.none> <9371cac5-20bb-0552-2609-0d537f41fecd@c-s.fr> <1587810370.tg8ym9yjpc.astroid@bobo.none> Message-ID: <976551e8-229e-54c1-8fb2-c5df94b979c3@c-s.fr> Le 25/04/2020 à 12:56, Nicholas Piggin a écrit : > Excerpts from Christophe Leroy's message of April 25, 2020 5:47 pm: >> >> >> Le 25/04/2020 à 07:22, Nicholas Piggin a écrit : >>> As noted in the 'scv' thread, powerpc's vdso calling convention does not >>> match the C ELF ABI calling convention (or the proposed scv convention). >>> I think we could implement a new ABI by basically duplicating function >>> entry points with different names. >> >> I think doing this is a real good idea. >> >> I've been working at porting powerpc VDSO to the GENERIC C VDSO, and the >> main pitfall has been that our vdso calling convention is not compatible >> with C calling convention, so we have go through an ASM entry/exit. >> >> See https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=171469 >> >> We should kill this error flag return through CR[SO] and get it the >> "modern" way like other architectectures implementing the C VDSO: return >> 0 when successfull, return -err when failed. > > Agreed. > >>> The ELF v2 ABI convention would suit it well, because the caller already >>> requires the function address for ctr, so having it in r12 will >>> eliminate the need for address calculation, which suits the vdso data >>> page access. >>> >>> Is there a need for ELF v1 specific calls as well, or could those just be >>> deprecated and remain on existing functions or required to use the ELF >>> v2 calls using asm wrappers? >> >> What's ELF v1 and ELF v2 ? Is ELF v1 what PPC32 uses ? If so, I'd say >> yes, it would be good to have it to avoid going through ASM in the middle. > > I'm not sure about PPC32. On PPC64, ELFv2 functions must be called with > their address in r12 if called at their global entry point. ELFv1 have a > function descriptor with call address and TOC in it, caller has to load > the TOC if it's global. > > The vdso doesn't have TOC, it has one global address (the vdso data > page) which it loads by calculating its own address. > > The kernel doesn't change the vdso based on whether it's called by a v1 > or v2 userspace (it doesn't really know itself and would have to export > different functions). glibc has a hack to create something: > > # define VDSO_IFUNC_RET(value) \ > ({ \ > static Elf64_FuncDesc vdso_opd = { .fd_toc = ~0x0 }; \ > vdso_opd.fd_func = (Elf64_Addr)value; \ > &vdso_opd; \ > }) > > If we could make something which links more like any other dso with > ELFv1, that would be good. Otherwise I think v2 is preferable so it > doesn't have to calculate its own address. I see the following in glibc. So looks like PPC32 is like PPC64 elfv1. By the way, they are talking about something not completely finished in the kernel. Can we finish it ? #if (defined(__PPC64__) || defined(__powerpc64__)) && _CALL_ELF != 2 /* The correct solution is for _dl_vdso_vsym to return the address of the OPD for the kernel VDSO function. That address would then be stored in the __vdso_* variables and returned as the result of the IFUNC resolver function. Yet, the kernel does not contain any OPD entries for the VDSO functions (incomplete implementation). However, PLT relocations for IFUNCs still expect the address of an OPD to be returned from the IFUNC resolver function (since PLT entries on PPC64 are just copies of OPDs). The solution for now is to create an artificial static OPD for each VDSO function returned by a resolver function. The TOC value is set to a non-zero value to avoid triggering lazy symbol resolution via .glink0/.plt0 for a zero TOC (requires thread-safe PLT sequences) when the dynamic linker isn't prepared for it e.g. RTLD_NOW. None of the kernel VDSO routines use the TOC or AUX values so any non-zero value will work. Note that function pointer comparisons will not use this artificial static OPD since those are resolved via ADDR64 relocations and will point at the non-IFUNC default OPD for the symbol. Lastly, because the IFUNC relocations are processed immediately at startup the resolver functions and this code need not be thread-safe, but if the caller writes to a PLT slot it must do so in a thread-safe manner with all the required barriers. */ #define VDSO_IFUNC_RET(value) \ ({ \ static Elf64_FuncDesc vdso_opd = { .fd_toc = ~0x0 }; \ vdso_opd.fd_func = (Elf64_Addr)value; \ &vdso_opd; \ }) #else #define VDSO_IFUNC_RET(value) ((void *) (value)) #endif Christophe From libc-dev at lists.llvm.org Sat Apr 25 09:22:04 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Sat, 25 Apr 2020 12:22:04 -0400 Subject: [libc-dev] [musl] Re: New powerpc vdso calling convention In-Reply-To: <1587810370.tg8ym9yjpc.astroid@bobo.none> References: <1587790194.w180xsw5be.astroid@bobo.none> <9371cac5-20bb-0552-2609-0d537f41fecd@c-s.fr> <1587810370.tg8ym9yjpc.astroid@bobo.none> Message-ID: <20200425162204.GJ11469@brightrain.aerifal.cx> On Sat, Apr 25, 2020 at 08:56:54PM +1000, Nicholas Piggin wrote: > >> The ELF v2 ABI convention would suit it well, because the caller already > >> requires the function address for ctr, so having it in r12 will > >> eliminate the need for address calculation, which suits the vdso data > >> page access. > >> > >> Is there a need for ELF v1 specific calls as well, or could those just be > >> deprecated and remain on existing functions or required to use the ELF > >> v2 calls using asm wrappers? > > > > What's ELF v1 and ELF v2 ? Is ELF v1 what PPC32 uses ? If so, I'd say > > yes, it would be good to have it to avoid going through ASM in the middle.. > > I'm not sure about PPC32. On PPC64, ELFv2 functions must be called with > their address in r12 if called at their global entry point. ELFv1 have a > function descriptor with call address and TOC in it, caller has to load > the TOC if it's global. > > The vdso doesn't have TOC, it has one global address (the vdso data > page) which it loads by calculating its own address. A function descriptor could be put in the VDSO data page, or as it's done now by glibc the vdso linkage code could create it. My leaning is to at least have a version of the code that's callable (with the right descriptor around it) by v1 binaries, but since musl does not use ELFv1 at all we really have no stake in this and I'm fine with whatever outcome users of v1 decide on. > The kernel doesn't change the vdso based on whether it's called by a v1 > or v2 userspace (it doesn't really know itself and would have to export > different functions). glibc has a hack to create something: I'm pretty sure it does know because signal invocation has to know whether the function pointer points to a descriptor or code. At least for FDPIC archs (similar to PPC64 ELFv1 function descriptors) it knows and has to know. > >> Is there a good reason for the system call fallback to go in the vdso > >> function rather than have the caller handle it? > > > > I've seen at least one while porting powerpc to the C VDSO: arguments > > toward VDSO functions are in volatile registers. If the caller has to > > call the fallback by itself, it has to save them before calling the > > VDSO, allthought in 99% of cases it won't use them again. With the > > fallback called by the VDSO itself, the arguments are still hot in > > volatile registers and ready for calling the fallback. That make it very > > easy to call them, see patch 5 in the series > > (https://patchwork.ozlabs.org/project/linuxppc-dev/patch/59bea35725ab4cefc67a678577da8b3ab7771af5.1587401492.git.christophe.leroy at c-s.fr/) This is actually a good reason not to spuriously fail and fallback. At present musl wouldn't take advantage of it because musl uses the fallback path for lazy initialization of the vdso function pointer and doesn't special-case the MIPS badness, but if it made a big difference we probably could shuffle things around to only do the fallback on archs that need it and avoid saving the input arg registers across the vdso call. Rich From libc-dev at lists.llvm.org Sat Apr 25 15:58:19 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Sun, 26 Apr 2020 08:58:19 +1000 Subject: [libc-dev] New powerpc vdso calling convention In-Reply-To: <976551e8-229e-54c1-8fb2-c5df94b979c3@c-s.fr> References: <1587790194.w180xsw5be.astroid@bobo.none> <9371cac5-20bb-0552-2609-0d537f41fecd@c-s.fr> <1587810370.tg8ym9yjpc.astroid@bobo.none> <976551e8-229e-54c1-8fb2-c5df94b979c3@c-s.fr> Message-ID: <1587855423.jug0f1n0b8.astroid@bobo.none> Excerpts from Christophe Leroy's message of April 25, 2020 10:20 pm: > > > Le 25/04/2020 à 12:56, Nicholas Piggin a écrit : >> Excerpts from Christophe Leroy's message of April 25, 2020 5:47 pm: >>> >>> >>> Le 25/04/2020 à 07:22, Nicholas Piggin a écrit : >>>> As noted in the 'scv' thread, powerpc's vdso calling convention does not >>>> match the C ELF ABI calling convention (or the proposed scv convention). >>>> I think we could implement a new ABI by basically duplicating function >>>> entry points with different names. >>> >>> I think doing this is a real good idea. >>> >>> I've been working at porting powerpc VDSO to the GENERIC C VDSO, and the >>> main pitfall has been that our vdso calling convention is not compatible >>> with C calling convention, so we have go through an ASM entry/exit. >>> >>> See https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=171469 >>> >>> We should kill this error flag return through CR[SO] and get it the >>> "modern" way like other architectectures implementing the C VDSO: return >>> 0 when successfull, return -err when failed. >> >> Agreed. >> >>>> The ELF v2 ABI convention would suit it well, because the caller already >>>> requires the function address for ctr, so having it in r12 will >>>> eliminate the need for address calculation, which suits the vdso data >>>> page access. >>>> >>>> Is there a need for ELF v1 specific calls as well, or could those just be >>>> deprecated and remain on existing functions or required to use the ELF >>>> v2 calls using asm wrappers? >>> >>> What's ELF v1 and ELF v2 ? Is ELF v1 what PPC32 uses ? If so, I'd say >>> yes, it would be good to have it to avoid going through ASM in the middle. >> >> I'm not sure about PPC32. On PPC64, ELFv2 functions must be called with >> their address in r12 if called at their global entry point. ELFv1 have a >> function descriptor with call address and TOC in it, caller has to load >> the TOC if it's global. >> >> The vdso doesn't have TOC, it has one global address (the vdso data >> page) which it loads by calculating its own address. >> >> The kernel doesn't change the vdso based on whether it's called by a v1 >> or v2 userspace (it doesn't really know itself and would have to export >> different functions). glibc has a hack to create something: >> >> # define VDSO_IFUNC_RET(value) \ >> ({ \ >> static Elf64_FuncDesc vdso_opd = { .fd_toc = ~0x0 }; \ >> vdso_opd.fd_func = (Elf64_Addr)value; \ >> &vdso_opd; \ >> }) >> >> If we could make something which links more like any other dso with >> ELFv1, that would be good. Otherwise I think v2 is preferable so it >> doesn't have to calculate its own address. > > I see the following in glibc. So looks like PPC32 is like PPC64 elfv1. > By the way, they are talking about something not completely finished in > the kernel. Can we finish it ? Possibly can. It seems like a good idea to fix all loose ends if we are going to add new versions. Will have to check with the toolchain people to make sure we're doing the right thing. Thanks, Nick From libc-dev at lists.llvm.org Sat Apr 25 16:11:19 2020 From: libc-dev at lists.llvm.org (Rich Felker via libc-dev) Date: Sat, 25 Apr 2020 19:11:19 -0400 Subject: [libc-dev] New powerpc vdso calling convention In-Reply-To: <1587855423.jug0f1n0b8.astroid@bobo.none> References: <1587790194.w180xsw5be.astroid@bobo.none> <9371cac5-20bb-0552-2609-0d537f41fecd@c-s.fr> <1587810370.tg8ym9yjpc.astroid@bobo.none> <976551e8-229e-54c1-8fb2-c5df94b979c3@c-s.fr> <1587855423.jug0f1n0b8.astroid@bobo.none> Message-ID: <20200425231119.GM11469@brightrain.aerifal.cx> On Sun, Apr 26, 2020 at 08:58:19AM +1000, Nicholas Piggin wrote: > Excerpts from Christophe Leroy's message of April 25, 2020 10:20 pm: > > > > > > Le 25/04/2020 à 12:56, Nicholas Piggin a écrit : > >> Excerpts from Christophe Leroy's message of April 25, 2020 5:47 pm: > >>> > >>> > >>> Le 25/04/2020 à 07:22, Nicholas Piggin a écrit : > >>>> As noted in the 'scv' thread, powerpc's vdso calling convention does not > >>>> match the C ELF ABI calling convention (or the proposed scv convention). > >>>> I think we could implement a new ABI by basically duplicating function > >>>> entry points with different names. > >>> > >>> I think doing this is a real good idea. > >>> > >>> I've been working at porting powerpc VDSO to the GENERIC C VDSO, and the > >>> main pitfall has been that our vdso calling convention is not compatible > >>> with C calling convention, so we have go through an ASM entry/exit. > >>> > >>> See https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=171469 > >>> > >>> We should kill this error flag return through CR[SO] and get it the > >>> "modern" way like other architectectures implementing the C VDSO: return > >>> 0 when successfull, return -err when failed. > >> > >> Agreed. > >> > >>>> The ELF v2 ABI convention would suit it well, because the caller already > >>>> requires the function address for ctr, so having it in r12 will > >>>> eliminate the need for address calculation, which suits the vdso data > >>>> page access. > >>>> > >>>> Is there a need for ELF v1 specific calls as well, or could those just be > >>>> deprecated and remain on existing functions or required to use the ELF > >>>> v2 calls using asm wrappers? > >>> > >>> What's ELF v1 and ELF v2 ? Is ELF v1 what PPC32 uses ? If so, I'd say > >>> yes, it would be good to have it to avoid going through ASM in the middle. > >> > >> I'm not sure about PPC32. On PPC64, ELFv2 functions must be called with > >> their address in r12 if called at their global entry point. ELFv1 have a > >> function descriptor with call address and TOC in it, caller has to load > >> the TOC if it's global. > >> > >> The vdso doesn't have TOC, it has one global address (the vdso data > >> page) which it loads by calculating its own address. > >> > >> The kernel doesn't change the vdso based on whether it's called by a v1 > >> or v2 userspace (it doesn't really know itself and would have to export > >> different functions). glibc has a hack to create something: > >> > >> # define VDSO_IFUNC_RET(value) \ > >> ({ \ > >> static Elf64_FuncDesc vdso_opd = { .fd_toc = ~0x0 }; \ > >> vdso_opd.fd_func = (Elf64_Addr)value; \ > >> &vdso_opd; \ > >> }) > >> > >> If we could make something which links more like any other dso with > >> ELFv1, that would be good. Otherwise I think v2 is preferable so it > >> doesn't have to calculate its own address. > > > > I see the following in glibc. So looks like PPC32 is like PPC64 elfv1. > > By the way, they are talking about something not completely finished in > > the kernel. Can we finish it ? > > Possibly can. It seems like a good idea to fix all loose ends if we are > going to add new versions. Will have to check with the toolchain people > to make sure we're doing the right thing. "ELFv1" and "ELFv2" are PPC64-specific names for the old and new version of the ELF psABI for PPC64. They have nothing at all to do with PPC32 which is a completely different ABI from either. Rich From libc-dev at lists.llvm.org Sat Apr 25 16:07:57 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Sun, 26 Apr 2020 09:07:57 +1000 Subject: [libc-dev] [musl] Re: New powerpc vdso calling convention In-Reply-To: <20200425162204.GJ11469@brightrain.aerifal.cx> References: <1587790194.w180xsw5be.astroid@bobo.none> <9371cac5-20bb-0552-2609-0d537f41fecd@c-s.fr> <1587810370.tg8ym9yjpc.astroid@bobo.none> <20200425162204.GJ11469@brightrain.aerifal.cx> Message-ID: <1587855503.8grsasuwof.astroid@bobo.none> Excerpts from Rich Felker's message of April 26, 2020 2:22 am: > On Sat, Apr 25, 2020 at 08:56:54PM +1000, Nicholas Piggin wrote: >> >> The ELF v2 ABI convention would suit it well, because the caller already >> >> requires the function address for ctr, so having it in r12 will >> >> eliminate the need for address calculation, which suits the vdso data >> >> page access. >> >> >> >> Is there a need for ELF v1 specific calls as well, or could those just be >> >> deprecated and remain on existing functions or required to use the ELF >> >> v2 calls using asm wrappers? >> > >> > What's ELF v1 and ELF v2 ? Is ELF v1 what PPC32 uses ? If so, I'd say >> > yes, it would be good to have it to avoid going through ASM in the middle.. >> >> I'm not sure about PPC32. On PPC64, ELFv2 functions must be called with >> their address in r12 if called at their global entry point. ELFv1 have a >> function descriptor with call address and TOC in it, caller has to load >> the TOC if it's global. >> >> The vdso doesn't have TOC, it has one global address (the vdso data >> page) which it loads by calculating its own address. > > A function descriptor could be put in the VDSO data page, or as it's > done now by glibc the vdso linkage code could create it. My leaning is > to at least have a version of the code that's callable (with the right > descriptor around it) by v1 binaries, but since musl does not use > ELFv1 at all we really have no stake in this and I'm fine with > whatever outcome users of v1 decide on. I agree, I think it would be good to make it look as much like a normal function as possible. >> The kernel doesn't change the vdso based on whether it's called by a v1 >> or v2 userspace (it doesn't really know itself and would have to export >> different functions). glibc has a hack to create something: > > I'm pretty sure it does know because signal invocation has to know > whether the function pointer points to a descriptor or code. At least > for FDPIC archs (similar to PPC64 ELFv1 function descriptors) it knows > and has to know. It knows on a per-executable basis (by looking at the ELF header). It doesn't know per-system though so we can't patch the vdso accordingly. But we could include both sets of entry points and map in the appropriate one at exec time I think. >> >> Is there a good reason for the system call fallback to go in the vdso >> >> function rather than have the caller handle it? >> > >> > I've seen at least one while porting powerpc to the C VDSO: arguments >> > toward VDSO functions are in volatile registers. If the caller has to >> > call the fallback by itself, it has to save them before calling the >> > VDSO, allthought in 99% of cases it won't use them again. With the >> > fallback called by the VDSO itself, the arguments are still hot in >> > volatile registers and ready for calling the fallback. That make it very >> > easy to call them, see patch 5 in the series >> > (https://patchwork.ozlabs.org/project/linuxppc-dev/patch/59bea35725ab4cefc67a678577da8b3ab7771af5.1587401492.git.christophe.leroy at c-s.fr/) > > This is actually a good reason not to spuriously fail and fallback. At > present musl wouldn't take advantage of it because musl uses the > fallback path for lazy initialization of the vdso function pointer and > doesn't special-case the MIPS badness, but if it made a big difference > we probably could shuffle things around to only do the fallback on > archs that need it and avoid saving the input arg registers across the > vdso call. It's a point for it yes. I don't know if any libc or app would want to instrument it or do special accounting or something for system calls. Thanks, Nick From libc-dev at lists.llvm.org Sat Apr 25 20:41:08 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Sun, 26 Apr 2020 13:41:08 +1000 Subject: [libc-dev] New powerpc vdso calling convention In-Reply-To: <20200425231119.GM11469@brightrain.aerifal.cx> References: <1587790194.w180xsw5be.astroid@bobo.none> <9371cac5-20bb-0552-2609-0d537f41fecd@c-s.fr> <1587810370.tg8ym9yjpc.astroid@bobo.none> <976551e8-229e-54c1-8fb2-c5df94b979c3@c-s.fr> <1587855423.jug0f1n0b8.astroid@bobo.none> <20200425231119.GM11469@brightrain.aerifal.cx> Message-ID: <1587872025.rtx2ygrmn0.astroid@bobo.none> Excerpts from Rich Felker's message of April 26, 2020 9:11 am: > On Sun, Apr 26, 2020 at 08:58:19AM +1000, Nicholas Piggin wrote: >> Excerpts from Christophe Leroy's message of April 25, 2020 10:20 pm: >> > >> > >> > Le 25/04/2020 à 12:56, Nicholas Piggin a écrit : >> >> Excerpts from Christophe Leroy's message of April 25, 2020 5:47 pm: >> >>> >> >>> >> >>> Le 25/04/2020 à 07:22, Nicholas Piggin a écrit : >> >>>> As noted in the 'scv' thread, powerpc's vdso calling convention does not >> >>>> match the C ELF ABI calling convention (or the proposed scv convention). >> >>>> I think we could implement a new ABI by basically duplicating function >> >>>> entry points with different names. >> >>> >> >>> I think doing this is a real good idea. >> >>> >> >>> I've been working at porting powerpc VDSO to the GENERIC C VDSO, and the >> >>> main pitfall has been that our vdso calling convention is not compatible >> >>> with C calling convention, so we have go through an ASM entry/exit. >> >>> >> >>> See https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=171469 >> >>> >> >>> We should kill this error flag return through CR[SO] and get it the >> >>> "modern" way like other architectectures implementing the C VDSO: return >> >>> 0 when successfull, return -err when failed. >> >> >> >> Agreed. >> >> >> >>>> The ELF v2 ABI convention would suit it well, because the caller already >> >>>> requires the function address for ctr, so having it in r12 will >> >>>> eliminate the need for address calculation, which suits the vdso data >> >>>> page access. >> >>>> >> >>>> Is there a need for ELF v1 specific calls as well, or could those just be >> >>>> deprecated and remain on existing functions or required to use the ELF >> >>>> v2 calls using asm wrappers? >> >>> >> >>> What's ELF v1 and ELF v2 ? Is ELF v1 what PPC32 uses ? If so, I'd say >> >>> yes, it would be good to have it to avoid going through ASM in the middle. >> >> >> >> I'm not sure about PPC32. On PPC64, ELFv2 functions must be called with >> >> their address in r12 if called at their global entry point. ELFv1 have a >> >> function descriptor with call address and TOC in it, caller has to load >> >> the TOC if it's global. >> >> >> >> The vdso doesn't have TOC, it has one global address (the vdso data >> >> page) which it loads by calculating its own address. >> >> >> >> The kernel doesn't change the vdso based on whether it's called by a v1 >> >> or v2 userspace (it doesn't really know itself and would have to export >> >> different functions). glibc has a hack to create something: >> >> >> >> # define VDSO_IFUNC_RET(value) \ >> >> ({ \ >> >> static Elf64_FuncDesc vdso_opd = { .fd_toc = ~0x0 }; \ >> >> vdso_opd.fd_func = (Elf64_Addr)value; \ >> >> &vdso_opd; \ >> >> }) >> >> >> >> If we could make something which links more like any other dso with >> >> ELFv1, that would be good. Otherwise I think v2 is preferable so it >> >> doesn't have to calculate its own address. >> > >> > I see the following in glibc. So looks like PPC32 is like PPC64 elfv1. >> > By the way, they are talking about something not completely finished in >> > the kernel. Can we finish it ? >> >> Possibly can. It seems like a good idea to fix all loose ends if we are >> going to add new versions. Will have to check with the toolchain people >> to make sure we're doing the right thing. > > "ELFv1" and "ELFv2" are PPC64-specific names for the old and new > version of the ELF psABI for PPC64. They have nothing at all to do > with PPC32 which is a completely different ABI from either. Right, I'm just talking about those comments -- it seems like the kernel vdso should contain an .opd section with function descriptors in it for elfv1 calls, rather than the hack it has now of creating one in the caller's .data section. But all that function descriptor code is gated by #if (defined(__PPC64__) || defined(__powerpc64__)) && _CALL_ELF != 2 So it seems PPC32 does not use function descriptors but a direct pointer to the entry point like PPC64 with ELFv2. Thanks, Nick From libc-dev at lists.llvm.org Mon Apr 27 06:09:20 2020 From: libc-dev at lists.llvm.org (Adhemerval Zanella via libc-dev) Date: Mon, 27 Apr 2020 10:09:20 -0300 Subject: [libc-dev] New powerpc vdso calling convention In-Reply-To: <1587872025.rtx2ygrmn0.astroid@bobo.none> References: <1587790194.w180xsw5be.astroid@bobo.none> <9371cac5-20bb-0552-2609-0d537f41fecd@c-s.fr> <1587810370.tg8ym9yjpc.astroid@bobo.none> <976551e8-229e-54c1-8fb2-c5df94b979c3@c-s.fr> <1587855423.jug0f1n0b8.astroid@bobo.none> <20200425231119.GM11469@brightrain.aerifal.cx> <1587872025.rtx2ygrmn0.astroid@bobo.none> Message-ID: On 26/04/2020 00:41, Nicholas Piggin wrote: > Excerpts from Rich Felker's message of April 26, 2020 9:11 am: >> On Sun, Apr 26, 2020 at 08:58:19AM +1000, Nicholas Piggin wrote: >>> Excerpts from Christophe Leroy's message of April 25, 2020 10:20 pm: >>>> >>>> >>>> Le 25/04/2020 à 12:56, Nicholas Piggin a écrit : >>>>> Excerpts from Christophe Leroy's message of April 25, 2020 5:47 pm: >>>>>> >>>>>> >>>>>> Le 25/04/2020 à 07:22, Nicholas Piggin a écrit : >>>>>>> As noted in the 'scv' thread, powerpc's vdso calling convention does not >>>>>>> match the C ELF ABI calling convention (or the proposed scv convention). >>>>>>> I think we could implement a new ABI by basically duplicating function >>>>>>> entry points with different names. >>>>>> >>>>>> I think doing this is a real good idea. >>>>>> >>>>>> I've been working at porting powerpc VDSO to the GENERIC C VDSO, and the >>>>>> main pitfall has been that our vdso calling convention is not compatible >>>>>> with C calling convention, so we have go through an ASM entry/exit. >>>>>> >>>>>> See https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=171469 >>>>>> >>>>>> We should kill this error flag return through CR[SO] and get it the >>>>>> "modern" way like other architectectures implementing the C VDSO: return >>>>>> 0 when successfull, return -err when failed. >>>>> >>>>> Agreed. >>>>> >>>>>>> The ELF v2 ABI convention would suit it well, because the caller already >>>>>>> requires the function address for ctr, so having it in r12 will >>>>>>> eliminate the need for address calculation, which suits the vdso data >>>>>>> page access. >>>>>>> >>>>>>> Is there a need for ELF v1 specific calls as well, or could those just be >>>>>>> deprecated and remain on existing functions or required to use the ELF >>>>>>> v2 calls using asm wrappers? >>>>>> >>>>>> What's ELF v1 and ELF v2 ? Is ELF v1 what PPC32 uses ? If so, I'd say >>>>>> yes, it would be good to have it to avoid going through ASM in the middle. >>>>> >>>>> I'm not sure about PPC32. On PPC64, ELFv2 functions must be called with >>>>> their address in r12 if called at their global entry point. ELFv1 have a >>>>> function descriptor with call address and TOC in it, caller has to load >>>>> the TOC if it's global. >>>>> >>>>> The vdso doesn't have TOC, it has one global address (the vdso data >>>>> page) which it loads by calculating its own address. >>>>> >>>>> The kernel doesn't change the vdso based on whether it's called by a v1 >>>>> or v2 userspace (it doesn't really know itself and would have to export >>>>> different functions). glibc has a hack to create something: >>>>> >>>>> # define VDSO_IFUNC_RET(value) \ >>>>> ({ \ >>>>> static Elf64_FuncDesc vdso_opd = { .fd_toc = ~0x0 }; \ >>>>> vdso_opd.fd_func = (Elf64_Addr)value; \ >>>>> &vdso_opd; \ >>>>> }) >>>>> >>>>> If we could make something which links more like any other dso with >>>>> ELFv1, that would be good. Otherwise I think v2 is preferable so it >>>>> doesn't have to calculate its own address. >>>> >>>> I see the following in glibc. So looks like PPC32 is like PPC64 elfv1. >>>> By the way, they are talking about something not completely finished in >>>> the kernel. Can we finish it ? >>> >>> Possibly can. It seems like a good idea to fix all loose ends if we are >>> going to add new versions. Will have to check with the toolchain people >>> to make sure we're doing the right thing. >> >> "ELFv1" and "ELFv2" are PPC64-specific names for the old and new >> version of the ELF psABI for PPC64. They have nothing at all to do >> with PPC32 which is a completely different ABI from either. > > Right, I'm just talking about those comments -- it seems like the kernel > vdso should contain an .opd section with function descriptors in it for > elfv1 calls, rather than the hack it has now of creating one in the > caller's .data section. > > But all that function descriptor code is gated by > > #if (defined(__PPC64__) || defined(__powerpc64__)) && _CALL_ELF != 2 > > So it seems PPC32 does not use function descriptors but a direct pointer > to the entry point like PPC64 with ELFv2. Yes, this hack is only for ELFv1. The missing ODP has not been an issue or glibc because it has been using the inline assembly to emulate the functions call since initial vDSO support (INTERNAL_VSYSCALL_CALL_TYPE). It just has become an issue when I added a ifunc optimization to gettimeofday so it can bypass the libc.so and make plt branch to vDSO directly. Recently on some y2038 refactoring it was suggested to get rid of this and make gettimeofday call clock_gettime regardless. But some felt that the performance degradation was not worth for a symbol that is still used extensibility, so we stuck with the hack. And I think having this synthetic opd entry is not an issue, since for full relro the program's will be used and correctly set as read-only. The issue is more for glibc itself, and I wouldn't mind to just remove the gettimeofday and time optimizations and use the default vDSO support (which might increase in function latency though). As Rich has put, it would be simpler to just have powerpc vDSO symbols to have a default function call semantic so we could issue a function call directly. But for powerpc64, we glibc will need to continue to support this non-standard call on older kernels and I am not sure if adding new symbols with a different semantic will help much. GLibc already hides this powerpc semantic on INTERNAL_VSYSCALL_CALL_TYPE, so internally all syscalls are assumed to have the new semantic (-errno on error, 0 on success). Adding another ELFv1 would require to add more logic to handle multiple symbol version for vDSO setup (sysdeps/unix/sysv/linux/dl-vdso-setup.h), which would mostly likely to require an arch-specific implementation to handle it. From libc-dev at lists.llvm.org Tue Apr 28 19:39:22 2020 From: libc-dev at lists.llvm.org (Nicholas Piggin via libc-dev) Date: Wed, 29 Apr 2020 12:39:22 +1000 Subject: [libc-dev] New powerpc vdso calling convention In-Reply-To: References: <1587790194.w180xsw5be.astroid@bobo.none> <9371cac5-20bb-0552-2609-0d537f41fecd@c-s.fr> <1587810370.tg8ym9yjpc.astroid@bobo.none> <976551e8-229e-54c1-8fb2-c5df94b979c3@c-s.fr> <1587855423.jug0f1n0b8.astroid@bobo.none> <20200425231119.GM11469@brightrain.aerifal.cx> <1587872025.rtx2ygrmn0.astroid@bobo.none> Message-ID: <1588126678.zjwj4d1d90.astroid@bobo.none> Excerpts from Adhemerval Zanella's message of April 27, 2020 11:09 pm: > > > On 26/04/2020 00:41, Nicholas Piggin wrote: >> Excerpts from Rich Felker's message of April 26, 2020 9:11 am: >>> On Sun, Apr 26, 2020 at 08:58:19AM +1000, Nicholas Piggin wrote: >>>> Excerpts from Christophe Leroy's message of April 25, 2020 10:20 pm: >>>>> >>>>> >>>>> Le 25/04/2020 à 12:56, Nicholas Piggin a écrit : >>>>>> Excerpts from Christophe Leroy's message of April 25, 2020 5:47 pm: >>>>>>> >>>>>>> >>>>>>> Le 25/04/2020 à 07:22, Nicholas Piggin a écrit : >>>>>>>> As noted in the 'scv' thread, powerpc's vdso calling convention does not >>>>>>>> match the C ELF ABI calling convention (or the proposed scv convention). >>>>>>>> I think we could implement a new ABI by basically duplicating function >>>>>>>> entry points with different names. >>>>>>> >>>>>>> I think doing this is a real good idea. >>>>>>> >>>>>>> I've been working at porting powerpc VDSO to the GENERIC C VDSO, and the >>>>>>> main pitfall has been that our vdso calling convention is not compatible >>>>>>> with C calling convention, so we have go through an ASM entry/exit. >>>>>>> >>>>>>> See https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=171469 >>>>>>> >>>>>>> We should kill this error flag return through CR[SO] and get it the >>>>>>> "modern" way like other architectectures implementing the C VDSO: return >>>>>>> 0 when successfull, return -err when failed. >>>>>> >>>>>> Agreed. >>>>>> >>>>>>>> The ELF v2 ABI convention would suit it well, because the caller already >>>>>>>> requires the function address for ctr, so having it in r12 will >>>>>>>> eliminate the need for address calculation, which suits the vdso data >>>>>>>> page access. >>>>>>>> >>>>>>>> Is there a need for ELF v1 specific calls as well, or could those just be >>>>>>>> deprecated and remain on existing functions or required to use the ELF >>>>>>>> v2 calls using asm wrappers? >>>>>>> >>>>>>> What's ELF v1 and ELF v2 ? Is ELF v1 what PPC32 uses ? If so, I'd say >>>>>>> yes, it would be good to have it to avoid going through ASM in the middle. >>>>>> >>>>>> I'm not sure about PPC32. On PPC64, ELFv2 functions must be called with >>>>>> their address in r12 if called at their global entry point. ELFv1 have a >>>>>> function descriptor with call address and TOC in it, caller has to load >>>>>> the TOC if it's global. >>>>>> >>>>>> The vdso doesn't have TOC, it has one global address (the vdso data >>>>>> page) which it loads by calculating its own address. >>>>>> >>>>>> The kernel doesn't change the vdso based on whether it's called by a v1 >>>>>> or v2 userspace (it doesn't really know itself and would have to export >>>>>> different functions). glibc has a hack to create something: >>>>>> >>>>>> # define VDSO_IFUNC_RET(value) \ >>>>>> ({ \ >>>>>> static Elf64_FuncDesc vdso_opd = { .fd_toc = ~0x0 }; \ >>>>>> vdso_opd.fd_func = (Elf64_Addr)value; \ >>>>>> &vdso_opd; \ >>>>>> }) >>>>>> >>>>>> If we could make something which links more like any other dso with >>>>>> ELFv1, that would be good. Otherwise I think v2 is preferable so it >>>>>> doesn't have to calculate its own address. >>>>> >>>>> I see the following in glibc. So looks like PPC32 is like PPC64 elfv1. >>>>> By the way, they are talking about something not completely finished in >>>>> the kernel. Can we finish it ? >>>> >>>> Possibly can. It seems like a good idea to fix all loose ends if we are >>>> going to add new versions. Will have to check with the toolchain people >>>> to make sure we're doing the right thing. >>> >>> "ELFv1" and "ELFv2" are PPC64-specific names for the old and new >>> version of the ELF psABI for PPC64. They have nothing at all to do >>> with PPC32 which is a completely different ABI from either. >> >> Right, I'm just talking about those comments -- it seems like the kernel >> vdso should contain an .opd section with function descriptors in it for >> elfv1 calls, rather than the hack it has now of creating one in the >> caller's .data section. >> >> But all that function descriptor code is gated by >> >> #if (defined(__PPC64__) || defined(__powerpc64__)) && _CALL_ELF != 2 >> >> So it seems PPC32 does not use function descriptors but a direct pointer >> to the entry point like PPC64 with ELFv2. > > Yes, this hack is only for ELFv1. The missing ODP has not been an issue > or glibc because it has been using the inline assembly to emulate the > functions call since initial vDSO support (INTERNAL_VSYSCALL_CALL_TYPE). > It just has become an issue when I added a ifunc optimization to > gettimeofday so it can bypass the libc.so and make plt branch to vDSO > directly. I can't understand if it's actually a problem for you or not. Regardless if you can hack around it, it seems to me that if we're going to add sane calling conventions to the vdso, then we should also just have a .opd section for it as well, whether or not a particular libc requires it. > Recently on some y2038 refactoring it was suggested to get rid of this > and make gettimeofday call clock_gettime regardless. But some felt that > the performance degradation was not worth for a symbol that is still used > extensibility, so we stuck with the hack. > > And I think having this synthetic opd entry is not an issue, since for > full relro the program's will be used and correctly set as read-only. I'm not quite sure what this means, I don't really know how glibc ifunc works. How do you set r2 if you have no opd? > The issue is more for glibc itself, and I wouldn't mind to just remove the > gettimeofday and time optimizations and use the default vDSO support > (which might increase in function latency though). > > As Rich has put, it would be simpler to just have powerpc vDSO symbols > to have a default function call semantic so we could issue a function > call directly. But for powerpc64, we glibc will need to continue to > support this non-standard call on older kernels and I am not sure if > adding new symbols with a different semantic will help much. Yeah, we will add entry points with default function call semantics. At which point we make the things look like any other dso unless there is good reason otherwise. > GLibc already hides this powerpc semantic on INTERNAL_VSYSCALL_CALL_TYPE, > so internally all syscalls are assumed to have the new semantic (-errno > on error, 0 on success). Adding another ELFv1 would require to add > more logic to handle multiple symbol version for vDSO setup > (sysdeps/unix/sysv/linux/dl-vdso-setup.h), which would mostly likely to > require an arch-specific implementation to handle it. Is it not a build-time choice? The arch can set its own vdso symbol names AFAIKS. Thanks, Nick From libc-dev at lists.llvm.org Wed Apr 29 05:15:54 2020 From: libc-dev at lists.llvm.org (Adhemerval Zanella via libc-dev) Date: Wed, 29 Apr 2020 09:15:54 -0300 Subject: [libc-dev] New powerpc vdso calling convention In-Reply-To: <1588126678.zjwj4d1d90.astroid@bobo.none> References: <1587790194.w180xsw5be.astroid@bobo.none> <9371cac5-20bb-0552-2609-0d537f41fecd@c-s.fr> <1587810370.tg8ym9yjpc.astroid@bobo.none> <976551e8-229e-54c1-8fb2-c5df94b979c3@c-s.fr> <1587855423.jug0f1n0b8.astroid@bobo.none> <20200425231119.GM11469@brightrain.aerifal.cx> <1587872025.rtx2ygrmn0.astroid@bobo.none> <1588126678.zjwj4d1d90.astroid@bobo.none> Message-ID: On 28/04/2020 23:39, Nicholas Piggin wrote: > Excerpts from Adhemerval Zanella's message of April 27, 2020 11:09 pm: >> >> >> On 26/04/2020 00:41, Nicholas Piggin wrote: >>> Excerpts from Rich Felker's message of April 26, 2020 9:11 am: >>>> On Sun, Apr 26, 2020 at 08:58:19AM +1000, Nicholas Piggin wrote: >>>>> Excerpts from Christophe Leroy's message of April 25, 2020 10:20 pm: >>>>>> >>>>>> >>>>>> Le 25/04/2020 à 12:56, Nicholas Piggin a écrit : >>>>>>> Excerpts from Christophe Leroy's message of April 25, 2020 5:47 pm: >>>>>>>> >>>>>>>> >>>>>>>> Le 25/04/2020 à 07:22, Nicholas Piggin a écrit : >>>>>>>>> As noted in the 'scv' thread, powerpc's vdso calling convention does not >>>>>>>>> match the C ELF ABI calling convention (or the proposed scv convention). >>>>>>>>> I think we could implement a new ABI by basically duplicating function >>>>>>>>> entry points with different names. >>>>>>>> >>>>>>>> I think doing this is a real good idea. >>>>>>>> >>>>>>>> I've been working at porting powerpc VDSO to the GENERIC C VDSO, and the >>>>>>>> main pitfall has been that our vdso calling convention is not compatible >>>>>>>> with C calling convention, so we have go through an ASM entry/exit. >>>>>>>> >>>>>>>> See https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=171469 >>>>>>>> >>>>>>>> We should kill this error flag return through CR[SO] and get it the >>>>>>>> "modern" way like other architectectures implementing the C VDSO: return >>>>>>>> 0 when successfull, return -err when failed. >>>>>>> >>>>>>> Agreed. >>>>>>> >>>>>>>>> The ELF v2 ABI convention would suit it well, because the caller already >>>>>>>>> requires the function address for ctr, so having it in r12 will >>>>>>>>> eliminate the need for address calculation, which suits the vdso data >>>>>>>>> page access. >>>>>>>>> >>>>>>>>> Is there a need for ELF v1 specific calls as well, or could those just be >>>>>>>>> deprecated and remain on existing functions or required to use the ELF >>>>>>>>> v2 calls using asm wrappers? >>>>>>>> >>>>>>>> What's ELF v1 and ELF v2 ? Is ELF v1 what PPC32 uses ? If so, I'd say >>>>>>>> yes, it would be good to have it to avoid going through ASM in the middle. >>>>>>> >>>>>>> I'm not sure about PPC32. On PPC64, ELFv2 functions must be called with >>>>>>> their address in r12 if called at their global entry point. ELFv1 have a >>>>>>> function descriptor with call address and TOC in it, caller has to load >>>>>>> the TOC if it's global. >>>>>>> >>>>>>> The vdso doesn't have TOC, it has one global address (the vdso data >>>>>>> page) which it loads by calculating its own address. >>>>>>> >>>>>>> The kernel doesn't change the vdso based on whether it's called by a v1 >>>>>>> or v2 userspace (it doesn't really know itself and would have to export >>>>>>> different functions). glibc has a hack to create something: >>>>>>> >>>>>>> # define VDSO_IFUNC_RET(value) \ >>>>>>> ({ \ >>>>>>> static Elf64_FuncDesc vdso_opd = { .fd_toc = ~0x0 }; \ >>>>>>> vdso_opd.fd_func = (Elf64_Addr)value; \ >>>>>>> &vdso_opd; \ >>>>>>> }) >>>>>>> >>>>>>> If we could make something which links more like any other dso with >>>>>>> ELFv1, that would be good. Otherwise I think v2 is preferable so it >>>>>>> doesn't have to calculate its own address. >>>>>> >>>>>> I see the following in glibc. So looks like PPC32 is like PPC64 elfv1. >>>>>> By the way, they are talking about something not completely finished in >>>>>> the kernel. Can we finish it ? >>>>> >>>>> Possibly can. It seems like a good idea to fix all loose ends if we are >>>>> going to add new versions. Will have to check with the toolchain people >>>>> to make sure we're doing the right thing. >>>> >>>> "ELFv1" and "ELFv2" are PPC64-specific names for the old and new >>>> version of the ELF psABI for PPC64. They have nothing at all to do >>>> with PPC32 which is a completely different ABI from either. >>> >>> Right, I'm just talking about those comments -- it seems like the kernel >>> vdso should contain an .opd section with function descriptors in it for >>> elfv1 calls, rather than the hack it has now of creating one in the >>> caller's .data section. >>> >>> But all that function descriptor code is gated by >>> >>> #if (defined(__PPC64__) || defined(__powerpc64__)) && _CALL_ELF != 2 >>> >>> So it seems PPC32 does not use function descriptors but a direct pointer >>> to the entry point like PPC64 with ELFv2. >> >> Yes, this hack is only for ELFv1. The missing ODP has not been an issue >> or glibc because it has been using the inline assembly to emulate the >> functions call since initial vDSO support (INTERNAL_VSYSCALL_CALL_TYPE). >> It just has become an issue when I added a ifunc optimization to >> gettimeofday so it can bypass the libc.so and make plt branch to vDSO >> directly. > > I can't understand if it's actually a problem for you or not. > > Regardless if you can hack around it, it seems to me that if we're going > to add sane calling conventions to the vdso, then we should also just > have a .opd section for it as well, whether or not a particular libc > requires it. The main problem for glibc is the complication of having to handle two different calling conventions. Specially if kernel starts to provide new vDSO symbols with only with the new semantic. But I think it is doable, it will require some internal tinkering on how to handle vDSO (to indicate which mechanism to use) which will most likely be powerpc specific. > >> Recently on some y2038 refactoring it was suggested to get rid of this >> and make gettimeofday call clock_gettime regardless. But some felt that >> the performance degradation was not worth for a symbol that is still used >> extensibility, so we stuck with the hack. >> >> And I think having this synthetic opd entry is not an issue, since for >> full relro the program's will be used and correctly set as read-only. > > I'm not quite sure what this means, I don't really know how glibc ifunc > works. How do you set r2 if you have no opd? IFUNC itself is not an issue here, since it just a dynamic relocation that instruct the dynamic linker to issue a defined function that provides the actual symbol. The problem is symbol resolution for kernel vDSO symbol that returns a pointer to the text segment instead of the expected ODP entry. And currently glibc assumes that kernel vDSO does not use TOC or AUX, so it sets a bogus value (~0x0) just to avoid trigger lazy resolution in some cases. It makes sense with the current contract that vDSO calls should behave as syscall, but lesser the flexibility of kernel implementation. > >> The issue is more for glibc itself, and I wouldn't mind to just remove the >> gettimeofday and time optimizations and use the default vDSO support >> (which might increase in function latency though). >> >> As Rich has put, it would be simpler to just have powerpc vDSO symbols >> to have a default function call semantic so we could issue a function >> call directly. But for powerpc64, we glibc will need to continue to >> support this non-standard call on older kernels and I am not sure if >> adding new symbols with a different semantic will help much. > > Yeah, we will add entry points with default function call semantics. > At which point we make the things look like any other dso unless there > is good reason otherwise. I think the move to make vDSO has the same semantic as an usual DSO is the correct one. I am just pointing out that different than musl, glibc already support vDSO for powerpc and changing its interface will most likely require more handling in powerpc specific bits. > >> GLibc already hides this powerpc semantic on INTERNAL_VSYSCALL_CALL_TYPE, >> so internally all syscalls are assumed to have the new semantic (-errno >> on error, 0 on success). Adding another ELFv1 would require to add >> more logic to handle multiple symbol version for vDSO setup >> (sysdeps/unix/sysv/linux/dl-vdso-setup.h), which would mostly likely to >> require an arch-specific implementation to handle it. > > Is it not a build-time choice? The arch can set its own vdso symbol > names AFAIKS. To enable vDSO support the architecture just need to define the correspondent macros with the expected names. For instance, for powerpc: sysdeps/unix/sysv/linux/powerpc/sysdep.h [...] 195 #if defined(__PPC64__) || defined(__powerpc64__) 196 #define HAVE_CLOCK_GETRES64_VSYSCALL "__kernel_clock_getres" 197 #define HAVE_CLOCK_GETTIME64_VSYSCALL "__kernel_clock_gettime" 198 #else 199 #define HAVE_CLOCK_GETRES_VSYSCALL "__kernel_clock_getres" 200 #define HAVE_CLOCK_GETTIME_VSYSCALL "__kernel_clock_gettime" 201 #endif 202 #define HAVE_GETCPU_VSYSCALL "__kernel_getcpu" 203 #define HAVE_TIME_VSYSCALL "__kernel_time" 204 #define HAVE_GETTIMEOFDAY_VSYSCALL "__kernel_gettimeofday" 205 #define HAVE_GET_TBFREQ "__kernel_get_tbfreq" [...] GLIBC will create and initialize the vDSO pointers in a arch neutral way, however the vDSO call itself is parametrized to handle the powerpc specific bits (the INTERNAL_VSYSCALL_CALL_TYPE which is called by INLINE_SYSCALL_CALL). > > Thanks, > Nick > From libc-dev at lists.llvm.org Wed Apr 29 19:51:56 2020 From: libc-dev at lists.llvm.org (Michael Ellerman via libc-dev) Date: Thu, 30 Apr 2020 12:51:56 +1000 Subject: [libc-dev] [musl] Re: New powerpc vdso calling convention In-Reply-To: <20200425162204.GJ11469@brightrain.aerifal.cx> References: <1587790194.w180xsw5be.astroid@bobo.none> <9371cac5-20bb-0552-2609-0d537f41fecd@c-s.fr> <1587810370.tg8ym9yjpc.astroid@bobo.none> <20200425162204.GJ11469@brightrain.aerifal.cx> Message-ID: <87v9lheldf.fsf@mpe.ellerman.id.au> Rich Felker writes: > On Sat, Apr 25, 2020 at 08:56:54PM +1000, Nicholas Piggin wrote: >> >> The ELF v2 ABI convention would suit it well, because the caller already >> >> requires the function address for ctr, so having it in r12 will >> >> eliminate the need for address calculation, which suits the vdso data >> >> page access. >> >> >> >> Is there a need for ELF v1 specific calls as well, or could those just be >> >> deprecated and remain on existing functions or required to use the ELF >> >> v2 calls using asm wrappers? >> > >> > What's ELF v1 and ELF v2 ? Is ELF v1 what PPC32 uses ? If so, I'd say >> > yes, it would be good to have it to avoid going through ASM in the middle.. >> >> I'm not sure about PPC32. On PPC64, ELFv2 functions must be called with >> their address in r12 if called at their global entry point. ELFv1 have a >> function descriptor with call address and TOC in it, caller has to load >> the TOC if it's global. >> >> The vdso doesn't have TOC, it has one global address (the vdso data >> page) which it loads by calculating its own address. > > A function descriptor could be put in the VDSO data page, or as it's > done now by glibc the vdso linkage code could create it. My leaning is > to at least have a version of the code that's callable (with the right > descriptor around it) by v1 binaries, but since musl does not use > ELFv1 at all we really have no stake in this and I'm fine with > whatever outcome users of v1 decide on. > >> The kernel doesn't change the vdso based on whether it's called by a v1 >> or v2 userspace (it doesn't really know itself and would have to export >> different functions). glibc has a hack to create something: > > I'm pretty sure it does know because signal invocation has to know > whether the function pointer points to a descriptor or code. At least > for FDPIC archs (similar to PPC64 ELFv1 function descriptors) it knows > and has to know. It does know, see TIF_ELF2ABI which is tested by is_elf2_task(), and as you say is used by the signal delivery code. Currently the VDSO entry points are not functions, so they don't need to change based on the ABI. cheers