[PATCH] D68351: [profile] Add a mode to continuously sync counter updates to a file

Fri Oct 4 15:43:56 PDT 2019

On Fri, Oct 4, 2019 at 1:44 PM Xinliang David Li <davidxl at google.com> wrote:

> I will let Petr chime in explaining more details -- the reply I provided
> are based on my understanding of 'Reloc' proposal, which may be quite off.
>
> David
>
> On Fri, Oct 4, 2019 at 1:28 PM <vsk at apple.com> wrote:
>
>>
>>
>> On Oct 4, 2019, at 10:30 AM, Xinliang David Li <xinliangli at gmail.com>
>> wrote:
>>
>> Petr's method is more like 'relocating' the counter memory from static
>> allocated memory to mmapped memory, but better be named 'reloc' method.
>>
>>
>> Thank you, that's a much better name.
>>
>>
>> Now lets do some comparison:
>>
>> 1) reloc method allows runtime to mmap from the start of the raw profile,
>> so there is no need to page align the counter data in the file;
>>
>>
>> Let me see if I follow. Is the idea to allocate a buffer large enough to
>> contain the entire contiguous raw profile, in each instrumented image? I
>> can see how it would be possible to mmap() a single such buffer onto a raw
>> profile without any extra padding or format changes. However, this buffer
>> allocation entails doubling the memory overhead of the
>> `__llvm_prf_{cnts,data,names}` sections, as these would all need to be
>> copied into the buffer-to-be-mmap'd. This may not work on well on iOS, as
>> the devices we support are quite memory-constrained. The per-process memory
>> limit on iOS is ~1.4GB on many devices -- clients like News.app or
>> Music.app already get very close to this limit.
>>
>> Also, with more than one instrumented image (DSOs), I believe
>> page-alignment padding is still necessary so that each in-memory raw
>> profile buffer can be mmap()'d onto a page-aligned file offset.
>>
>> Perhaps that is a non-issue, as it can be solved by simply creating N
>> different raw profiles -- one per instrumented image. Actually this should
>> work for both proposed schemes.
>>
>
David's understanding is correct, my plan is to mmap the entire raw profile
from the start which means that there's no requirement for page alignment.
Alternative would be to mmap only a page-aligned portion of the raw file of
file that contains counters, i.e. we would mmap [__llvm_prf_cnts &
-PAGE_SIZE, (__llvm_prf_cnts + __llvm_prf_cnts_length + PAGE_SIZE - 1) &
-PAGE_SIZE). There's no need for buffer allocation, we can simply mmap the
file directly as read/writable region.

> 2) reloc method does have more instrumentation overhead, however it is not
>> as large as we think.
>>    a) the bias load can be done at the start of the function
>>    b) on x86, the addressing mode kicks in, there is actually no size
>> difference in terms of updating:
>>         addq    $1, counter+16(%rip)
>>    vs
>>        addq    $1, counter+16(%rax)
>>
>>   both are 7 bytes long.
>>
>> so the overhead will be a few bytes more per function.
>>
>>
>> I see, thanks for this correction!
>>
>
Thanks for the suggestion David! The prototype implementation I have loads
the bias/reloc on every counter update but loading the bias/reloc at the
start of the function is going to be more efficient. I'll update the
implementation and try to get some numbers by compiling instrumented Clang.

> 3) reloc method produces smaller sized raw profile
>>
>>
>> I'm interested in understanding how major of an issue this is for
>> Google's use cases. On iOS/macOS, the worst case overhead of section
>> alignment padding is < 32K/8K respectively per image (due to 16K/4K pages).
>> What's the expected file size overhead for Google? Is a much larger page
>> size in use? If so, is switching to a smaller page size when running
>> instrumented binaries an option?
>>
>> In addition to understanding this, per my last comment to Petr, I would
>> also like to understand the specific security concerns w.r.t mapping a
>> section onto a file. If this is something Darwin should not allow, I would
>> like to let our kernel team know!
>>
>
It's not a specific security concern, rather it's a consequence of
Fuchsia's design/capability model. In Fuchsia, the address space
(sub)regions are objects which are accessed and "operated on" through
handles (capabilities). When loading modules (executables and DSOs), our
dynamic linker creates a new subregion for each one, mmaps individual
segments into those subregions and then closes all subregion handles. That
doesn't unmap those files, but it means that the address space layout can
never be modified because there's no way to recover those handles. That's a
nice property because it means that all module mapping is immutable (e.g.
it's impossible to make any of the executable segments writable or even
readable if you use execute-only memory on AArch64). However, it also means
that we cannot mmap anything over those segments as would be needed for the
continuous mode. The only way to make this work is to avoid closing the
subregion handles and keep those for later use, but that would also make
our dynamic linker a potential attack vector.

More generally, relying on the overmap behavior means tying the
implementation to the OS-specific API which is available on Linux and
Darwin, but may not be available on other OSes (e.g. I'm not even sure if
this is possible on Windows). It'd be nice to implement a solution that
could be made to work on all OSes.

> 4) The continuous method has the advantage of being more efficient in
>> running speed.
>>
>>
>> As pointed out above, the reduced memory overhead is another (perhaps
>> more significant) factor.
>>
>>
>> Also, I believe both methods should support on-line merging -- as the
>> profile update happens 'In-place' -- the online merging code can be
>> simplified (no locking, reading, updating are needed).
>>
>>
>> Thanks for pointing this out. In my last experiment, I added a `while
>> (true) { print-counters; sleep(...) }` loop to `darwin-proof-of-concept.c`
>> after it set up the mmap(), and then edited the counter section of the
>> profile using a hex editor. The updates did not appear in the
>> `proof-of-concept` process. But perhaps there is a more sophisticated way
>> to share the physical pages backing `__llvm_prf_cnts` between processes,
>> while they are also mapped onto the filesystem. I need to run more
>> experiments to understand this.
>>
>> vedant
>>
>>
>>
>> David
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Oct 3, 2019 at 4:18 PM Vedant Kumar via Phabricator via
>> llvm-commits <llvm-commits at lists.llvm.org> wrote:
>>
>>> vsk added a comment.
>>>
>>> In D68351#1693693 <https://reviews.llvm.org/D68351#1693693>, @phosek
>>> wrote:
>>>
>>> > In D68351#1693307 <https://reviews.llvm.org/D68351#1693307>, @davidxl
>>> wrote:
>>> >
>>> > > +petr who has similar needs for Fuchia platform
>>> >
>>> >
>>> > Thank you for including me, we have exactly the same use case in
>>> Fuchsia and this is the solution I've initially considered and been
>>> experimenting with locally based on the discussion with @davidxl. However,
>>> after internal discussion we've decided against this solution for two
>>> reasons:
>>> >
>>> > 1. This requires the ability to mmap the output file over the (portion
>>> of) binary which is something we don't allow in Fuchsia for security
>>> reasons; once all modules have been mapped in by the dynamic linker, we
>>> don't allow any further changes to their mapping and using this solution
>>> would require special dynamic linker which is possible but highly
>>> undesirable.
>>> > 2. It introduces bloat to both the binary and the output file. It also
>>> complicates the runtime implementation due to alignment and padding
>>> requirements.
>>> >
>>> >   The alternative solution we've came up and that I've now been
>>> implementing is to change the instrumentation to allow extra level of
>>> indirection. Concretely, the instrumentation conceptually does the
>>> following: `c = __profc_foo[idx]; c++; __profc_foo[idx] = c`. We'd like to
>>> change this `c = *(&__profc[idx] + *bias); c++; *(&__profc[idx] + *bias) =
>>> c` where `bias` is a global variable set by the runtime to be the offset
>>> between the `__llvm_prf_cnts` section and the corresponding location in the
>>> file that's mmapped in. Initially, that offset would be `0` which would
>>> result in exactly the same behavior as today, but the runtime can mmap the
>>> output file into address space and then change the offset to make the
>>> counters be continuously updated.
>>> >
>>> >   The advantage of this solution is that there are no changes needed
>>> to the profile format. It also doesn't require mmapping the output file
>>> over the binary, the output file can be mmapped anywhere in the address
>>> space. The disadvantage is extra overhead since instrumentation is going to
>>> be slightly more complicated, although we don't have any numbers yet to
>>> quantify how much slower it's going to be. The implementation should be
>>> fairly minimal and my tentative plan was to gate it on compiler switch, so
>>> it wouldn't affect existing in any way (modulo introducing one new variable
>>> in the runtime to hold the bias). I'm hoping to have the prototype ready
>>> and uploaded for review within the next few days. What's your opinion on
>>> this idea? Would this be something that you'd be interested in as an
>>> alternative approach?
>>>
>>>
>>> @phosek thank you for sharing this alternative. I hope you don't mind my
>>> calling this the bias method, after the term in the proposed
>>> instrumentation.
>>>
>>> The TLDR is that I don't think the bias method is a good fit for
>>> Linux/Darwin. Imho, this method won't reduce the complexity of continuous
>>> mode, and its instrumentation overhead is likely to outweigh the savings
>>> from reduced section alignment. Let me try and justify these claims :).
>>>
>>> First and most critically, note that (on Linux & Darwin, at least),
>>> mmap() requires that the file `offset` argument be page-aligned. So, I
>>> don't believe there's a way to avoid changing the raw profile format, or to
>>> avoid the concomitant complexity necessary to calculate padding bytes in
>>> the runtime. I don't see how the bias method solves this problem: this
>>> seems to be a fundamental limitation of mmap().
>>>
>>> Second, note that the size overhead of counter increment instrumentation
>>> dwarfs the alignment overhead for `__llvm_prf_{cnts,data}` for all but the
>>> smallest of programs. Assuming 16K pages, in the worst case, the section
>>> alignment costs 32K per image. Assuming an increment can be encoded in 7
>>> bytes as `incq l___profc_main(%rip)`, this section alignment cost is
>>> equivalent to ~4700 increments. For comparison, this is roughly the number
>>> of counter increments in //FileCheck.cpp.o//. If the size overhead of
>>> counter instrumentation were to double, as I believe it would with the bias
>>> method, it would rapidly erase the memory savings from eliminating the
>>> section alignment requirement.
>>>
>>> Third, I'm not sure I understand the security constraints in Fuchsia
>>> that make it undesirable to map the contents of a section onto a file. Is
>>> the concern that an attacker process may update the file, thereby changing
>>> the in-memory contents of the section? I'll note that on Darwin, at least,
>>> the kernel does not permit this kind of sharing. If an attacker process
>>> modifies a file on disk that has the in-memory contents of a section
>>> MAP_SHARED over it, the in-memory contents of the mapped section are not
>>> changed (I verified this by making a small modification to
>>> `darwin-proof-of-concept.c`). If there is some security aspect to the
>>> problem I'm missing here, could you please elaborate on it? I'm hoping that
>>> the bias method will not be necessary to support continuous mode on
>>> Fuchsia, but it seems this depends on the answer to the security question.
>>> Either way, istm that the high level approach taken in this patch
>>> as-written is a better fit for Linux/Darwin, and moreover probably provides
>>> support that can be used to implement the bias method.
>>>
>>> I appreciate your feedback and think we can definitely work together to
>>> build some kind of common solution!
>>>
>>>
>>> CHANGES SINCE LAST ACTION
>>>   https://reviews.llvm.org/D68351/new/
>>>
>>> https://reviews.llvm.org/D68351
>>>
>>>
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20191004/e8b41764/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4843 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20191004/e8b41764/attachment-0001.bin>