[PATCH] D68351: [profile] Add a mode to continuously sync counter updates to a file

Fri Oct 4 13:44:12 PDT 2019

I will let Petr chime in explaining more details -- the reply I provided
are based on my understanding of 'Reloc' proposal, which may be quite off.

David

On Fri, Oct 4, 2019 at 1:28 PM <vsk at apple.com> wrote:

>
>
> On Oct 4, 2019, at 10:30 AM, Xinliang David Li <xinliangli at gmail.com>
> wrote:
>
> Petr's method is more like 'relocating' the counter memory from static
> allocated memory to mmapped memory, but better be named 'reloc' method.
>
>
> Thank you, that's a much better name.
>
>
> Now lets do some comparison:
>
> 1) reloc method allows runtime to mmap from the start of the raw profile,
> so there is no need to page align the counter data in the file;
>
>
> Let me see if I follow. Is the idea to allocate a buffer large enough to
> contain the entire contiguous raw profile, in each instrumented image? I
> can see how it would be possible to mmap() a single such buffer onto a raw
> profile without any extra padding or format changes. However, this buffer
> allocation entails doubling the memory overhead of the
> `__llvm_prf_{cnts,data,names}` sections, as these would all need to be
> copied into the buffer-to-be-mmap'd. This may not work on well on iOS, as
> the devices we support are quite memory-constrained. The per-process memory
> limit on iOS is ~1.4GB on many devices -- clients like News.app or
> Music.app already get very close to this limit.
>
> Also, with more than one instrumented image (DSOs), I believe
> page-alignment padding is still necessary so that each in-memory raw
> profile buffer can be mmap()'d onto a page-aligned file offset.
>
> Perhaps that is a non-issue, as it can be solved by simply creating N
> different raw profiles -- one per instrumented image. Actually this should
> work for both proposed schemes.
>
>
> 2) reloc method does have more instrumentation overhead, however it is not
> as large as we think.
>    a) the bias load can be done at the start of the function
>    b) on x86, the addressing mode kicks in, there is actually no size
> difference in terms of updating:
>         addq    $1, counter+16(%rip)
>    vs
>        addq    $1, counter+16(%rax)
>
>   both are 7 bytes long.
>
> so the overhead will be a few bytes more per function.
>
>
> I see, thanks for this correction!
>
>
> 3) reloc method produces smaller sized raw profile
>
>
> I'm interested in understanding how major of an issue this is for Google's
> use cases. On iOS/macOS, the worst case overhead of section alignment
> padding is < 32K/8K respectively per image (due to 16K/4K pages). What's
> the expected file size overhead for Google? Is a much larger page size in
> use? If so, is switching to a smaller page size when running instrumented
> binaries an option?
>
> In addition to understanding this, per my last comment to Petr, I would
> also like to understand the specific security concerns w.r.t mapping a
> section onto a file. If this is something Darwin should not allow, I would
> like to let our kernel team know!
>
>
> 4) The continuous method has the advantage of being more efficient in
> running speed.
>
>
> As pointed out above, the reduced memory overhead is another (perhaps more
> significant) factor.
>
>
> Also, I believe both methods should support on-line merging -- as the
> profile update happens 'In-place' -- the online merging code can be
> simplified (no locking, reading, updating are needed).
>
>
> Thanks for pointing this out. In my last experiment, I added a `while
> (true) { print-counters; sleep(...) }` loop to `darwin-proof-of-concept.c`
> after it set up the mmap(), and then edited the counter section of the
> profile using a hex editor. The updates did not appear in the
> `proof-of-concept` process. But perhaps there is a more sophisticated way
> to share the physical pages backing `__llvm_prf_cnts` between processes,
> while they are also mapped onto the filesystem. I need to run more
> experiments to understand this.
>
> vedant
>
>
>
> David
>
>
>
>
>
>
>
>
> On Thu, Oct 3, 2019 at 4:18 PM Vedant Kumar via Phabricator via
> llvm-commits <llvm-commits at lists.llvm.org> wrote:
>
>> vsk added a comment.
>>
>> In D68351#1693693 <https://reviews.llvm.org/D68351#1693693>, @phosek
>> wrote:
>>
>> > In D68351#1693307 <https://reviews.llvm.org/D68351#1693307>, @davidxl
>> wrote:
>> >
>> > > +petr who has similar needs for Fuchia platform
>> >
>> >
>> > Thank you for including me, we have exactly the same use case in
>> Fuchsia and this is the solution I've initially considered and been
>> experimenting with locally based on the discussion with @davidxl. However,
>> after internal discussion we've decided against this solution for two
>> reasons:
>> >
>> > 1. This requires the ability to mmap the output file over the (portion
>> of) binary which is something we don't allow in Fuchsia for security
>> reasons; once all modules have been mapped in by the dynamic linker, we
>> don't allow any further changes to their mapping and using this solution
>> would require special dynamic linker which is possible but highly
>> undesirable.
>> > 2. It introduces bloat to both the binary and the output file. It also
>> complicates the runtime implementation due to alignment and padding
>> requirements.
>> >
>> >   The alternative solution we've came up and that I've now been
>> implementing is to change the instrumentation to allow extra level of
>> indirection. Concretely, the instrumentation conceptually does the
>> following: `c = __profc_foo[idx]; c++; __profc_foo[idx] = c`. We'd like to
>> change this `c = *(&__profc[idx] + *bias); c++; *(&__profc[idx] + *bias) =
>> c` where `bias` is a global variable set by the runtime to be the offset
>> between the `__llvm_prf_cnts` section and the corresponding location in the
>> file that's mmapped in. Initially, that offset would be `0` which would
>> result in exactly the same behavior as today, but the runtime can mmap the
>> output file into address space and then change the offset to make the
>> counters be continuously updated.
>> >
>> >   The advantage of this solution is that there are no changes needed to
>> the profile format. It also doesn't require mmapping the output file over
>> the binary, the output file can be mmapped anywhere in the address space.
>> The disadvantage is extra overhead since instrumentation is going to be
>> slightly more complicated, although we don't have any numbers yet to
>> quantify how much slower it's going to be. The implementation should be
>> fairly minimal and my tentative plan was to gate it on compiler switch, so
>> it wouldn't affect existing in any way (modulo introducing one new variable
>> in the runtime to hold the bias). I'm hoping to have the prototype ready
>> and uploaded for review within the next few days. What's your opinion on
>> this idea? Would this be something that you'd be interested in as an
>> alternative approach?
>>
>>
>> @phosek thank you for sharing this alternative. I hope you don't mind my
>> calling this the bias method, after the term in the proposed
>> instrumentation.
>>
>> The TLDR is that I don't think the bias method is a good fit for
>> Linux/Darwin. Imho, this method won't reduce the complexity of continuous
>> mode, and its instrumentation overhead is likely to outweigh the savings
>> from reduced section alignment. Let me try and justify these claims :).
>>
>> First and most critically, note that (on Linux & Darwin, at least),
>> mmap() requires that the file `offset` argument be page-aligned. So, I
>> don't believe there's a way to avoid changing the raw profile format, or to
>> avoid the concomitant complexity necessary to calculate padding bytes in
>> the runtime. I don't see how the bias method solves this problem: this
>> seems to be a fundamental limitation of mmap().
>>
>> Second, note that the size overhead of counter increment instrumentation
>> dwarfs the alignment overhead for `__llvm_prf_{cnts,data}` for all but the
>> smallest of programs. Assuming 16K pages, in the worst case, the section
>> alignment costs 32K per image. Assuming an increment can be encoded in 7
>> bytes as `incq l___profc_main(%rip)`, this section alignment cost is
>> equivalent to ~4700 increments. For comparison, this is roughly the number
>> of counter increments in //FileCheck.cpp.o//. If the size overhead of
>> counter instrumentation were to double, as I believe it would with the bias
>> method, it would rapidly erase the memory savings from eliminating the
>> section alignment requirement.
>>
>> Third, I'm not sure I understand the security constraints in Fuchsia that
>> make it undesirable to map the contents of a section onto a file. Is the
>> concern that an attacker process may update the file, thereby changing the
>> in-memory contents of the section? I'll note that on Darwin, at least, the
>> kernel does not permit this kind of sharing. If an attacker process
>> modifies a file on disk that has the in-memory contents of a section
>> MAP_SHARED over it, the in-memory contents of the mapped section are not
>> changed (I verified this by making a small modification to
>> `darwin-proof-of-concept.c`). If there is some security aspect to the
>> problem I'm missing here, could you please elaborate on it? I'm hoping that
>> the bias method will not be necessary to support continuous mode on
>> Fuchsia, but it seems this depends on the answer to the security question.
>> Either way, istm that the high level approach taken in this patch
>> as-written is a better fit for Linux/Darwin, and moreover probably provides
>> support that can be used to implement the bias method.
>>
>> I appreciate your feedback and think we can definitely work together to
>> build some kind of common solution!
>>
>>
>> CHANGES SINCE LAST ACTION
>>   https://reviews.llvm.org/D68351/new/
>>
>> https://reviews.llvm.org/D68351
>>
>>
>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20191004/0869a6ce/attachment.html>