[PATCH] D68351: [profile] Add a mode to continuously sync counter updates to a file

Sat Oct 5 13:03:39 PDT 2019

I've considered omitting the `__llvm_prf_cnts` section, the only downside
of doing so is you need to make sure that the space for counters is
allocated before you execute any instrumented code. That makes it tricky if
you want to instrument libc which is something we do today.

We can work around it though: a simple solution is to avoid collecting
profiles for libc, a better solution would be to carefully annotate all
libc functions that are executed before constructors to ensure that they
aren't instrumented e.g. by introducing a no_profile attribute, this is the
same approach we use ASan by annotating all function on the early startup
path with __attribute__((no_sanitize("address"))).

On Fri, Oct 4, 2019 at 4:43 PM Xinliang David Li <davidxl at google.com> wrote:

>
>
> On Fri, Oct 4, 2019 at 4:35 PM <vsk at apple.com> wrote:
>
>>
>>
>> On Oct 4, 2019, at 3:43 PM, Petr Hosek <phosek at google.com> wrote:
>>
>> On Fri, Oct 4, 2019 at 1:44 PM Xinliang David Li <davidxl at google.com>
>> wrote:
>>
>>> I will let Petr chime in explaining more details -- the reply I provided
>>> are based on my understanding of 'Reloc' proposal, which may be quite off.
>>>
>>> David
>>>
>>> On Fri, Oct 4, 2019 at 1:28 PM <vsk at apple.com> wrote:
>>>
>>>>
>>>>
>>>> On Oct 4, 2019, at 10:30 AM, Xinliang David Li <xinliangli at gmail.com>
>>>> wrote:
>>>>
>>>> Petr's method is more like 'relocating' the counter memory from static
>>>> allocated memory to mmapped memory, but better be named 'reloc' method.
>>>>
>>>>
>>>> Thank you, that's a much better name.
>>>>
>>>>
>>>> Now lets do some comparison:
>>>>
>>>> 1) reloc method allows runtime to mmap from the start of the raw
>>>> profile, so there is no need to page align the counter data in the file;
>>>>
>>>>
>>>> Let me see if I follow. Is the idea to allocate a buffer large enough
>>>> to contain the entire contiguous raw profile, in each instrumented image? I
>>>> can see how it would be possible to mmap() a single such buffer onto a raw
>>>> profile without any extra padding or format changes. However, this buffer
>>>> allocation entails doubling the memory overhead of the
>>>> `__llvm_prf_{cnts,data,names}` sections, as these would all need to be
>>>> copied into the buffer-to-be-mmap'd. This may not work on well on iOS, as
>>>> the devices we support are quite memory-constrained. The per-process memory
>>>> limit on iOS is ~1.4GB on many devices -- clients like News.app or
>>>> Music.app already get very close to this limit.
>>>>
>>>> Also, with more than one instrumented image (DSOs), I believe
>>>> page-alignment padding is still necessary so that each in-memory raw
>>>> profile buffer can be mmap()'d onto a page-aligned file offset.
>>>>
>>>> Perhaps that is a non-issue, as it can be solved by simply creating N
>>>> different raw profiles -- one per instrumented image. Actually this should
>>>> work for both proposed schemes.
>>>>
>>>
>> David's understanding is correct, my plan is to mmap the entire raw
>> profile from the start which means that there's no requirement for page
>> alignment. Alternative would be to mmap only a page-aligned portion of the
>> raw file of file that contains counters, i.e. we would mmap
>> [__llvm_prf_cnts & -PAGE_SIZE, (__llvm_prf_cnts + __llvm_prf_cnts_length +
>> PAGE_SIZE - 1) & -PAGE_SIZE). There's no need for buffer allocation, we can
>> simply mmap the file directly as read/writable region.
>>
>>> 2) reloc method does have more instrumentation overhead, however it is
>>>> not as large as we think.
>>>>    a) the bias load can be done at the start of the function
>>>>    b) on x86, the addressing mode kicks in, there is actually no size
>>>> difference in terms of updating:
>>>>         addq    $1, counter+16(%rip)
>>>>    vs
>>>>        addq    $1, counter+16(%rax)
>>>>
>>>>   both are 7 bytes long.
>>>>
>>>> so the overhead will be a few bytes more per function.
>>>>
>>>>
>>>> I see, thanks for this correction!
>>>>
>>>
>> Thanks for the suggestion David! The prototype implementation I have
>> loads the bias/reloc on every counter update but loading the bias/reloc at
>> the start of the function is going to be more efficient. I'll update the
>> implementation and try to get some numbers by compiling instrumented Clang.
>>
>>> 3) reloc method produces smaller sized raw profile
>>>>
>>>>
>>>> I'm interested in understanding how major of an issue this is for
>>>> Google's use cases. On iOS/macOS, the worst case overhead of section
>>>> alignment padding is < 32K/8K respectively per image (due to 16K/4K pages).
>>>> What's the expected file size overhead for Google? Is a much larger page
>>>> size in use? If so, is switching to a smaller page size when running
>>>> instrumented binaries an option?
>>>>
>>>> In addition to understanding this, per my last comment to Petr, I would
>>>> also like to understand the specific security concerns w.r.t mapping a
>>>> section onto a file. If this is something Darwin should not allow, I would
>>>> like to let our kernel team know!
>>>>
>>>
>> It's not a specific security concern, rather it's a consequence of
>> Fuchsia's design/capability model. In Fuchsia, the address space
>> (sub)regions are objects which are accessed and "operated on" through
>> handles (capabilities). When loading modules (executables and DSOs), our
>> dynamic linker creates a new subregion for each one, mmaps individual
>> segments into those subregions and then closes all subregion handles. That
>> doesn't unmap those files, but it means that the address space layout can
>> never be modified because there's no way to recover those handles. That's a
>> nice property because it means that all module mapping is immutable (e.g.
>> it's impossible to make any of the executable segments writable or even
>> readable if you use execute-only memory on AArch64). However, it also means
>> that we cannot mmap anything over those segments as would be needed for the
>> continuous mode. The only way to make this work is to avoid closing the
>> subregion handles and keep those for later use, but that would also make
>> our dynamic linker a potential attack vector.
>>
>> More generally, relying on the overmap behavior means tying the
>> implementation to the OS-specific API which is available on Linux and
>> Darwin, but may not be available on other OSes (e.g. I'm not even sure if
>> this is possible on Windows). It'd be nice to implement a solution that
>> could be made to work on all OSes.
>>
>>
>> Ok, I'm convinced that the reloc mode is the more portable approach. Per
>> David's comments (the one about not actually emitting `__llvm_prf_cnts` in
>> the binary //especially//, although I haven't quoted it here), reloc mode
>> can probably be implemented reasonably efficiently on Linux/Darwin. The
>> runtime can inspect `__llvm_prf_data` to determine the number of counters,
>> mmap() a sufficient amount of memory, and then update the CounterPtr fields
>> of all the data entries. This means the instrumentation would need to look
>> like: https://godbolt.org/z/-PvObE.
>>
>>
> incr3 version looks great.
>
>
>> @Petr I'd be happy to work together on this. Let me know if you'd like
>> any help prototyping/testing, perhaps we can split up some of the work.
>>
>> One thing we haven't really resolved is the conversation about not
>> changing the raw profile format. Fwiw, I don't think avoiding changes to
>> the raw profile format should be a design goal, as the point of having
>> separate raw/indexed formats is that the raw format should be cheap to rev.
>> I think that I've demonstrated that the file size overhead from adding
>> section alignment padding is small: I suspect that getting rid of this
>> would introduce a considerable amount of complexity to any kind of
>> mmap()-mode implementation.
>>
>>
> if there is a need to change raw format to make things more efficient, go
> for it.
>
> David
>
>
>
>> Anyway, I'm excited to see your results. Please keep me in the loop!
>>
>> thanks,
>> vedant
>>
>>
>> 4) The continuous method has the advantage of being more efficient in
>>>> running speed.
>>>>
>>>>
>>>> As pointed out above, the reduced memory overhead is another (perhaps
>>>> more significant) factor.
>>>>
>>>>
>>>> Also, I believe both methods should support on-line merging -- as the
>>>> profile update happens 'In-place' -- the online merging code can be
>>>> simplified (no locking, reading, updating are needed).
>>>>
>>>>
>>>> Thanks for pointing this out. In my last experiment, I added a `while
>>>> (true) { print-counters; sleep(...) }` loop to `darwin-proof-of-concept.c`
>>>> after it set up the mmap(), and then edited the counter section of the
>>>> profile using a hex editor. The updates did not appear in the
>>>> `proof-of-concept` process. But perhaps there is a more sophisticated way
>>>> to share the physical pages backing `__llvm_prf_cnts` between processes,
>>>> while they are also mapped onto the filesystem. I need to run more
>>>> experiments to understand this.
>>>>
>>>> vedant
>>>>
>>>>
>>>>
>>>> David
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Oct 3, 2019 at 4:18 PM Vedant Kumar via Phabricator via
>>>> llvm-commits <llvm-commits at lists.llvm.org> wrote:
>>>>
>>>>> vsk added a comment.
>>>>>
>>>>> In D68351#1693693 <https://reviews.llvm.org/D68351#1693693>, @phosek
>>>>> wrote:
>>>>>
>>>>> > In D68351#1693307 <https://reviews.llvm.org/D68351#1693307>,
>>>>> @davidxl wrote:
>>>>> >
>>>>> > > +petr who has similar needs for Fuchia platform
>>>>> >
>>>>> >
>>>>> > Thank you for including me, we have exactly the same use case in
>>>>> Fuchsia and this is the solution I've initially considered and been
>>>>> experimenting with locally based on the discussion with @davidxl. However,
>>>>> after internal discussion we've decided against this solution for two
>>>>> reasons:
>>>>> >
>>>>> > 1. This requires the ability to mmap the output file over the
>>>>> (portion of) binary which is something we don't allow in Fuchsia for
>>>>> security reasons; once all modules have been mapped in by the dynamic
>>>>> linker, we don't allow any further changes to their mapping and using this
>>>>> solution would require special dynamic linker which is possible but highly
>>>>> undesirable.
>>>>> > 2. It introduces bloat to both the binary and the output file. It
>>>>> also complicates the runtime implementation due to alignment and padding
>>>>> requirements.
>>>>> >
>>>>> >   The alternative solution we've came up and that I've now been
>>>>> implementing is to change the instrumentation to allow extra level of
>>>>> indirection. Concretely, the instrumentation conceptually does the
>>>>> following: `c = __profc_foo[idx]; c++; __profc_foo[idx] = c`. We'd like to
>>>>> change this `c = *(&__profc[idx] + *bias); c++; *(&__profc[idx] + *bias) =
>>>>> c` where `bias` is a global variable set by the runtime to be the offset
>>>>> between the `__llvm_prf_cnts` section and the corresponding location in the
>>>>> file that's mmapped in. Initially, that offset would be `0` which would
>>>>> result in exactly the same behavior as today, but the runtime can mmap the
>>>>> output file into address space and then change the offset to make the
>>>>> counters be continuously updated.
>>>>> >
>>>>> >   The advantage of this solution is that there are no changes needed
>>>>> to the profile format. It also doesn't require mmapping the output file
>>>>> over the binary, the output file can be mmapped anywhere in the address
>>>>> space. The disadvantage is extra overhead since instrumentation is going to
>>>>> be slightly more complicated, although we don't have any numbers yet to
>>>>> quantify how much slower it's going to be. The implementation should be
>>>>> fairly minimal and my tentative plan was to gate it on compiler switch, so
>>>>> it wouldn't affect existing in any way (modulo introducing one new variable
>>>>> in the runtime to hold the bias). I'm hoping to have the prototype ready
>>>>> and uploaded for review within the next few days. What's your opinion on
>>>>> this idea? Would this be something that you'd be interested in as an
>>>>> alternative approach?
>>>>>
>>>>>
>>>>> @phosek thank you for sharing this alternative. I hope you don't mind
>>>>> my calling this the bias method, after the term in the proposed
>>>>> instrumentation.
>>>>>
>>>>> The TLDR is that I don't think the bias method is a good fit for
>>>>> Linux/Darwin. Imho, this method won't reduce the complexity of continuous
>>>>> mode, and its instrumentation overhead is likely to outweigh the savings
>>>>> from reduced section alignment. Let me try and justify these claims :).
>>>>>
>>>>> First and most critically, note that (on Linux & Darwin, at least),
>>>>> mmap() requires that the file `offset` argument be page-aligned. So, I
>>>>> don't believe there's a way to avoid changing the raw profile format, or to
>>>>> avoid the concomitant complexity necessary to calculate padding bytes in
>>>>> the runtime. I don't see how the bias method solves this problem: this
>>>>> seems to be a fundamental limitation of mmap().
>>>>>
>>>>> Second, note that the size overhead of counter increment
>>>>> instrumentation dwarfs the alignment overhead for `__llvm_prf_{cnts,data}`
>>>>> for all but the smallest of programs. Assuming 16K pages, in the worst
>>>>> case, the section alignment costs 32K per image. Assuming an increment can
>>>>> be encoded in 7 bytes as `incq l___profc_main(%rip)`, this section
>>>>> alignment cost is equivalent to ~4700 increments. For comparison, this is
>>>>> roughly the number of counter increments in //FileCheck.cpp.o//. If the
>>>>> size overhead of counter instrumentation were to double, as I believe it
>>>>> would with the bias method, it would rapidly erase the memory savings from
>>>>> eliminating the section alignment requirement.
>>>>>
>>>>> Third, I'm not sure I understand the security constraints in Fuchsia
>>>>> that make it undesirable to map the contents of a section onto a file. Is
>>>>> the concern that an attacker process may update the file, thereby changing
>>>>> the in-memory contents of the section? I'll note that on Darwin, at least,
>>>>> the kernel does not permit this kind of sharing. If an attacker process
>>>>> modifies a file on disk that has the in-memory contents of a section
>>>>> MAP_SHARED over it, the in-memory contents of the mapped section are not
>>>>> changed (I verified this by making a small modification to
>>>>> `darwin-proof-of-concept.c`). If there is some security aspect to the
>>>>> problem I'm missing here, could you please elaborate on it? I'm hoping that
>>>>> the bias method will not be necessary to support continuous mode on
>>>>> Fuchsia, but it seems this depends on the answer to the security question.
>>>>> Either way, istm that the high level approach taken in this patch
>>>>> as-written is a better fit for Linux/Darwin, and moreover probably provides
>>>>> support that can be used to implement the bias method.
>>>>>
>>>>> I appreciate your feedback and think we can definitely work together
>>>>> to build some kind of common solution!
>>>>>
>>>>>
>>>>> CHANGES SINCE LAST ACTION
>>>>>   https://reviews.llvm.org/D68351/new/
>>>>>
>>>>> https://reviews.llvm.org/D68351
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> llvm-commits mailing list
>>>>> llvm-commits at lists.llvm.org
>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>>
>>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20191005/e11c4e4d/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4843 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20191005/e11c4e4d/attachment.bin>