[PATCH] D68351: [profile] Add a mode to continuously sync counter updates to a file

Fri Oct 4 16:43:25 PDT 2019

On Fri, Oct 4, 2019 at 4:35 PM <vsk at apple.com> wrote:

>
>
> On Oct 4, 2019, at 3:43 PM, Petr Hosek <phosek at google.com> wrote:
>
> On Fri, Oct 4, 2019 at 1:44 PM Xinliang David Li <davidxl at google.com>
> wrote:
>
>> I will let Petr chime in explaining more details -- the reply I provided
>> are based on my understanding of 'Reloc' proposal, which may be quite off.
>>
>> David
>>
>> On Fri, Oct 4, 2019 at 1:28 PM <vsk at apple.com> wrote:
>>
>>>
>>>
>>> On Oct 4, 2019, at 10:30 AM, Xinliang David Li <xinliangli at gmail.com>
>>> wrote:
>>>
>>> Petr's method is more like 'relocating' the counter memory from static
>>> allocated memory to mmapped memory, but better be named 'reloc' method.
>>>
>>>
>>> Thank you, that's a much better name.
>>>
>>>
>>> Now lets do some comparison:
>>>
>>> 1) reloc method allows runtime to mmap from the start of the raw
>>> profile, so there is no need to page align the counter data in the file;
>>>
>>>
>>> Let me see if I follow. Is the idea to allocate a buffer large enough to
>>> contain the entire contiguous raw profile, in each instrumented image? I
>>> can see how it would be possible to mmap() a single such buffer onto a raw
>>> profile without any extra padding or format changes. However, this buffer
>>> allocation entails doubling the memory overhead of the
>>> `__llvm_prf_{cnts,data,names}` sections, as these would all need to be
>>> copied into the buffer-to-be-mmap'd. This may not work on well on iOS, as
>>> the devices we support are quite memory-constrained. The per-process memory
>>> limit on iOS is ~1.4GB on many devices -- clients like News.app or
>>> Music.app already get very close to this limit.
>>>
>>> Also, with more than one instrumented image (DSOs), I believe
>>> page-alignment padding is still necessary so that each in-memory raw
>>> profile buffer can be mmap()'d onto a page-aligned file offset.
>>>
>>> Perhaps that is a non-issue, as it can be solved by simply creating N
>>> different raw profiles -- one per instrumented image. Actually this should
>>> work for both proposed schemes.
>>>
>>
> David's understanding is correct, my plan is to mmap the entire raw
> profile from the start which means that there's no requirement for page
> alignment. Alternative would be to mmap only a page-aligned portion of the
> raw file of file that contains counters, i.e. we would mmap
> [__llvm_prf_cnts & -PAGE_SIZE, (__llvm_prf_cnts + __llvm_prf_cnts_length +
> PAGE_SIZE - 1) & -PAGE_SIZE). There's no need for buffer allocation, we can
> simply mmap the file directly as read/writable region.
>
>> 2) reloc method does have more instrumentation overhead, however it is
>>> not as large as we think.
>>>    a) the bias load can be done at the start of the function
>>>    b) on x86, the addressing mode kicks in, there is actually no size
>>> difference in terms of updating:
>>>         addq    $1, counter+16(%rip)
>>>    vs
>>>        addq    $1, counter+16(%rax)
>>>
>>>   both are 7 bytes long.
>>>
>>> so the overhead will be a few bytes more per function.
>>>
>>>
>>> I see, thanks for this correction!
>>>
>>
> Thanks for the suggestion David! The prototype implementation I have loads
> the bias/reloc on every counter update but loading the bias/reloc at the
> start of the function is going to be more efficient. I'll update the
> implementation and try to get some numbers by compiling instrumented Clang.
>
>> 3) reloc method produces smaller sized raw profile
>>>
>>>
>>> I'm interested in understanding how major of an issue this is for
>>> Google's use cases. On iOS/macOS, the worst case overhead of section
>>> alignment padding is < 32K/8K respectively per image (due to 16K/4K pages).
>>> What's the expected file size overhead for Google? Is a much larger page
>>> size in use? If so, is switching to a smaller page size when running
>>> instrumented binaries an option?
>>>
>>> In addition to understanding this, per my last comment to Petr, I would
>>> also like to understand the specific security concerns w.r.t mapping a
>>> section onto a file. If this is something Darwin should not allow, I would
>>> like to let our kernel team know!
>>>
>>
> It's not a specific security concern, rather it's a consequence of
> Fuchsia's design/capability model. In Fuchsia, the address space
> (sub)regions are objects which are accessed and "operated on" through
> handles (capabilities). When loading modules (executables and DSOs), our
> dynamic linker creates a new subregion for each one, mmaps individual
> segments into those subregions and then closes all subregion handles. That
> doesn't unmap those files, but it means that the address space layout can
> never be modified because there's no way to recover those handles. That's a
> nice property because it means that all module mapping is immutable (e.g.
> it's impossible to make any of the executable segments writable or even
> readable if you use execute-only memory on AArch64). However, it also means
> that we cannot mmap anything over those segments as would be needed for the
> continuous mode. The only way to make this work is to avoid closing the
> subregion handles and keep those for later use, but that would also make
> our dynamic linker a potential attack vector.
>
> More generally, relying on the overmap behavior means tying the
> implementation to the OS-specific API which is available on Linux and
> Darwin, but may not be available on other OSes (e.g. I'm not even sure if
> this is possible on Windows). It'd be nice to implement a solution that
> could be made to work on all OSes.
>
>
> Ok, I'm convinced that the reloc mode is the more portable approach. Per
> David's comments (the one about not actually emitting `__llvm_prf_cnts` in
> the binary //especially//, although I haven't quoted it here), reloc mode
> can probably be implemented reasonably efficiently on Linux/Darwin. The
> runtime can inspect `__llvm_prf_data` to determine the number of counters,
> mmap() a sufficient amount of memory, and then update the CounterPtr fields
> of all the data entries. This means the instrumentation would need to look
> like: https://godbolt.org/z/-PvObE.
>
>
incr3 version looks great.

> @Petr I'd be happy to work together on this. Let me know if you'd like any
> help prototyping/testing, perhaps we can split up some of the work.
>
> One thing we haven't really resolved is the conversation about not
> changing the raw profile format. Fwiw, I don't think avoiding changes to
> the raw profile format should be a design goal, as the point of having
> separate raw/indexed formats is that the raw format should be cheap to rev.
> I think that I've demonstrated that the file size overhead from adding
> section alignment padding is small: I suspect that getting rid of this
> would introduce a considerable amount of complexity to any kind of
> mmap()-mode implementation.
>
>
if there is a need to change raw format to make things more efficient, go
for it.

David

> Anyway, I'm excited to see your results. Please keep me in the loop!
>
> thanks,
> vedant
>
>
> 4) The continuous method has the advantage of being more efficient in
>>> running speed.
>>>
>>>
>>> As pointed out above, the reduced memory overhead is another (perhaps
>>> more significant) factor.
>>>
>>>
>>> Also, I believe both methods should support on-line merging -- as the
>>> profile update happens 'In-place' -- the online merging code can be
>>> simplified (no locking, reading, updating are needed).
>>>
>>>
>>> Thanks for pointing this out. In my last experiment, I added a `while
>>> (true) { print-counters; sleep(...) }` loop to `darwin-proof-of-concept.c`
>>> after it set up the mmap(), and then edited the counter section of the
>>> profile using a hex editor. The updates did not appear in the
>>> `proof-of-concept` process. But perhaps there is a more sophisticated way
>>> to share the physical pages backing `__llvm_prf_cnts` between processes,
>>> while they are also mapped onto the filesystem. I need to run more
>>> experiments to understand this.
>>>
>>> vedant
>>>
>>>
>>>
>>> David
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Oct 3, 2019 at 4:18 PM Vedant Kumar via Phabricator via
>>> llvm-commits <llvm-commits at lists.llvm.org> wrote:
>>>
>>>> vsk added a comment.
>>>>
>>>> In D68351#1693693 <https://reviews.llvm.org/D68351#1693693>, @phosek
>>>> wrote:
>>>>
>>>> > In D68351#1693307 <https://reviews.llvm.org/D68351#1693307>,
>>>> @davidxl wrote:
>>>> >
>>>> > > +petr who has similar needs for Fuchia platform
>>>> >
>>>> >
>>>> > Thank you for including me, we have exactly the same use case in
>>>> Fuchsia and this is the solution I've initially considered and been
>>>> experimenting with locally based on the discussion with @davidxl. However,
>>>> after internal discussion we've decided against this solution for two
>>>> reasons:
>>>> >
>>>> > 1. This requires the ability to mmap the output file over the
>>>> (portion of) binary which is something we don't allow in Fuchsia for
>>>> security reasons; once all modules have been mapped in by the dynamic
>>>> linker, we don't allow any further changes to their mapping and using this
>>>> solution would require special dynamic linker which is possible but highly
>>>> undesirable.
>>>> > 2. It introduces bloat to both the binary and the output file. It
>>>> also complicates the runtime implementation due to alignment and padding
>>>> requirements.
>>>> >
>>>> >   The alternative solution we've came up and that I've now been
>>>> implementing is to change the instrumentation to allow extra level of
>>>> indirection. Concretely, the instrumentation conceptually does the
>>>> following: `c = __profc_foo[idx]; c++; __profc_foo[idx] = c`. We'd like to
>>>> change this `c = *(&__profc[idx] + *bias); c++; *(&__profc[idx] + *bias) =
>>>> c` where `bias` is a global variable set by the runtime to be the offset
>>>> between the `__llvm_prf_cnts` section and the corresponding location in the
>>>> file that's mmapped in. Initially, that offset would be `0` which would
>>>> result in exactly the same behavior as today, but the runtime can mmap the
>>>> output file into address space and then change the offset to make the
>>>> counters be continuously updated.
>>>> >
>>>> >   The advantage of this solution is that there are no changes needed
>>>> to the profile format. It also doesn't require mmapping the output file
>>>> over the binary, the output file can be mmapped anywhere in the address
>>>> space. The disadvantage is extra overhead since instrumentation is going to
>>>> be slightly more complicated, although we don't have any numbers yet to
>>>> quantify how much slower it's going to be. The implementation should be
>>>> fairly minimal and my tentative plan was to gate it on compiler switch, so
>>>> it wouldn't affect existing in any way (modulo introducing one new variable
>>>> in the runtime to hold the bias). I'm hoping to have the prototype ready
>>>> and uploaded for review within the next few days. What's your opinion on
>>>> this idea? Would this be something that you'd be interested in as an
>>>> alternative approach?
>>>>
>>>>
>>>> @phosek thank you for sharing this alternative. I hope you don't mind
>>>> my calling this the bias method, after the term in the proposed
>>>> instrumentation.
>>>>
>>>> The TLDR is that I don't think the bias method is a good fit for
>>>> Linux/Darwin. Imho, this method won't reduce the complexity of continuous
>>>> mode, and its instrumentation overhead is likely to outweigh the savings
>>>> from reduced section alignment. Let me try and justify these claims :).
>>>>
>>>> First and most critically, note that (on Linux & Darwin, at least),
>>>> mmap() requires that the file `offset` argument be page-aligned. So, I
>>>> don't believe there's a way to avoid changing the raw profile format, or to
>>>> avoid the concomitant complexity necessary to calculate padding bytes in
>>>> the runtime. I don't see how the bias method solves this problem: this
>>>> seems to be a fundamental limitation of mmap().
>>>>
>>>> Second, note that the size overhead of counter increment
>>>> instrumentation dwarfs the alignment overhead for `__llvm_prf_{cnts,data}`
>>>> for all but the smallest of programs. Assuming 16K pages, in the worst
>>>> case, the section alignment costs 32K per image. Assuming an increment can
>>>> be encoded in 7 bytes as `incq l___profc_main(%rip)`, this section
>>>> alignment cost is equivalent to ~4700 increments. For comparison, this is
>>>> roughly the number of counter increments in //FileCheck.cpp.o//. If the
>>>> size overhead of counter instrumentation were to double, as I believe it
>>>> would with the bias method, it would rapidly erase the memory savings from
>>>> eliminating the section alignment requirement.
>>>>
>>>> Third, I'm not sure I understand the security constraints in Fuchsia
>>>> that make it undesirable to map the contents of a section onto a file. Is
>>>> the concern that an attacker process may update the file, thereby changing
>>>> the in-memory contents of the section? I'll note that on Darwin, at least,
>>>> the kernel does not permit this kind of sharing. If an attacker process
>>>> modifies a file on disk that has the in-memory contents of a section
>>>> MAP_SHARED over it, the in-memory contents of the mapped section are not
>>>> changed (I verified this by making a small modification to
>>>> `darwin-proof-of-concept.c`). If there is some security aspect to the
>>>> problem I'm missing here, could you please elaborate on it? I'm hoping that
>>>> the bias method will not be necessary to support continuous mode on
>>>> Fuchsia, but it seems this depends on the answer to the security question.
>>>> Either way, istm that the high level approach taken in this patch
>>>> as-written is a better fit for Linux/Darwin, and moreover probably provides
>>>> support that can be used to implement the bias method.
>>>>
>>>> I appreciate your feedback and think we can definitely work together to
>>>> build some kind of common solution!
>>>>
>>>>
>>>> CHANGES SINCE LAST ACTION
>>>>   https://reviews.llvm.org/D68351/new/
>>>>
>>>> https://reviews.llvm.org/D68351
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> llvm-commits mailing list
>>>> llvm-commits at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20191004/d98e1344/attachment.html>