[PATCH] D68351: [profile] Add a mode to continuously sync counter updates to a file

Mon Oct 14 15:22:07 PDT 2019

On Mon, Oct 14, 2019 at 3:17 PM Petr Hosek <phosek at google.com> wrote:

> I've prototyped both approaches: relocated counters and dynamically
> allocated counters (the most recent proposal), to see how they'd perform.
> I've used both prototypes to build Clang and then used the instrumented
> Clang to compile zstd_opt.c
> <https://github.com/facebook/zstd/blob/dev/lib/compress/zstd_opt.c> which
> is one of the slowest files to compile in our build (the reported times are
> median from several runs).
>
> Uninstrumented Clang: 112M, 11.013s
> Existing instrumentation (statically allocated counters): 1247M, 34.626s
> Dynamically allocated counters: 1176M, 48.43s
> Relocated counters: 1292M, 47.847s
>
> I'm not sure why the instrumented Clang binary (in all cases) tends to get
> so big, originally I suspected it may be because the instrumentation is
> incompatible with --gc-sections so I disabled that option for the
> uninstrumented Clang but the difference is still roughly 10x. Do you have
> any insight on this?
>
>
I suspect it is mostly from the prfname section (uncompressed). Try using a
clang built with zlib, then the names will be compressed.

David

> In terms of performance, both approaches are similar, and roughly 40%
> slower than the existing instrumentation. In terms of size, as expected the
> dynamically allocated counters variant is smaller (by about 5.7%) because
> we can omit the counters section, the relocated counters version is larger
> (by about 3.6%) because of the extra instrumentation although this
> instrumentation doesn't seem to be affecting performance as much as I
> expected.
>
> In terms of pros and cons:
>
>    - In case of the dynamically allocated counters, aside from the issue
>    mentioned before where we need to ensure that the runtime ran before we
>    start executing any instrumented code (this actually came up in my testing
>    and I had to annotate Runtime Registration
>    with __attribute__((init_priority(101)))), there's also the requirement
>    that __llvm_prf_data has to be writable; this is the case today, but I've
>    been separately working on a patch to use relative relocations which would
>    avoid that requirement making it possible to move __llvm_prf_data into
>    rodata and deduplicating memory across processes (this would be useful when
>    collecting profiles from shared libraries that are used by multiple
>    processes), but with this solution that won't be possible
>    - In case of relocated counters, there's the extra overhead because of
>    the increased binary size and having two copies of counters in memory
>    (although it might be possible to deduplicate one of those for shared
>    libraries across processes, see above).
>
> Either of these solutions should work for us (the dynamically allocated
> counter one might requirement a bit of extra work to annotate the early
> path in our libc and dynamic linker). Do you have any preference? Once we
> agree, I'll try to clean up the implementation a bit and upload it for
> review.
>
> On Mon, Oct 7, 2019 at 10:56 AM Vedant Kumar <vedant_kumar at apple.com>
> wrote:
>
>>
>>
>> On Oct 5, 2019, at 1:03 PM, Petr Hosek <phosek at google.com> wrote:
>>
>> I've considered omitting the `__llvm_prf_cnts` section, the only downside
>> of doing so is you need to make sure that the space for counters is
>> allocated before you execute any instrumented code. That makes it tricky if
>> you want to instrument libc which is something we do today.
>>
>> We can work around it though: a simple solution is to avoid collecting
>> profiles for libc, a better solution would be to carefully annotate all
>> libc functions that are executed before constructors to ensure that they
>> aren't instrumented e.g. by introducing a no_profile attribute, this is the
>> same approach we use ASan by annotating all function on the early startup
>> path with __attribute__((no_sanitize("address"))).
>>
>>
>> I think the `no_profile` approach makes sense. One downside is that in
>> non-continuous mode, annotated libc functions will not be profiled. Some
>> alternatives are 1) having a more specific `no_profile_in_continuous_mode`
>> annotation, or 2) to not instrument libc in continuous mode and rely on
>> libc unittests which run in non-continuous mode only. It’s probably a bit
>> easier to support value profiling with (2), if that’s a concern.
>>
>> vedant
>>
>>
>> On Fri, Oct 4, 2019 at 4:43 PM Xinliang David Li <davidxl at google.com>
>> wrote:
>>
>>>
>>>
>>> On Fri, Oct 4, 2019 at 4:35 PM <vsk at apple.com> wrote:
>>>
>>>>
>>>>
>>>> On Oct 4, 2019, at 3:43 PM, Petr Hosek <phosek at google.com> wrote:
>>>>
>>>> On Fri, Oct 4, 2019 at 1:44 PM Xinliang David Li <davidxl at google.com>
>>>> wrote:
>>>>
>>>>> I will let Petr chime in explaining more details -- the reply I
>>>>> provided are based on my understanding of 'Reloc' proposal, which may be
>>>>> quite off.
>>>>>
>>>>> David
>>>>>
>>>>> On Fri, Oct 4, 2019 at 1:28 PM <vsk at apple.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Oct 4, 2019, at 10:30 AM, Xinliang David Li <xinliangli at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Petr's method is more like 'relocating' the counter memory from
>>>>>> static allocated memory to mmapped memory, but better be named 'reloc'
>>>>>> method.
>>>>>>
>>>>>>
>>>>>> Thank you, that's a much better name.
>>>>>>
>>>>>>
>>>>>> Now lets do some comparison:
>>>>>>
>>>>>> 1) reloc method allows runtime to mmap from the start of the raw
>>>>>> profile, so there is no need to page align the counter data in the file;
>>>>>>
>>>>>>
>>>>>> Let me see if I follow. Is the idea to allocate a buffer large enough
>>>>>> to contain the entire contiguous raw profile, in each instrumented image? I
>>>>>> can see how it would be possible to mmap() a single such buffer onto a raw
>>>>>> profile without any extra padding or format changes. However, this buffer
>>>>>> allocation entails doubling the memory overhead of the
>>>>>> `__llvm_prf_{cnts,data,names}` sections, as these would all need to be
>>>>>> copied into the buffer-to-be-mmap'd. This may not work on well on iOS, as
>>>>>> the devices we support are quite memory-constrained. The per-process memory
>>>>>> limit on iOS is ~1.4GB on many devices -- clients like News.app or
>>>>>> Music.app already get very close to this limit.
>>>>>>
>>>>>> Also, with more than one instrumented image (DSOs), I believe
>>>>>> page-alignment padding is still necessary so that each in-memory raw
>>>>>> profile buffer can be mmap()'d onto a page-aligned file offset.
>>>>>>
>>>>>> Perhaps that is a non-issue, as it can be solved by simply creating N
>>>>>> different raw profiles -- one per instrumented image. Actually this should
>>>>>> work for both proposed schemes.
>>>>>>
>>>>>
>>>> David's understanding is correct, my plan is to mmap the entire raw
>>>> profile from the start which means that there's no requirement for page
>>>> alignment. Alternative would be to mmap only a page-aligned portion of the
>>>> raw file of file that contains counters, i.e. we would mmap
>>>> [__llvm_prf_cnts & -PAGE_SIZE, (__llvm_prf_cnts + __llvm_prf_cnts_length +
>>>> PAGE_SIZE - 1) & -PAGE_SIZE). There's no need for buffer allocation, we can
>>>> simply mmap the file directly as read/writable region.
>>>>
>>>>> 2) reloc method does have more instrumentation overhead, however it is
>>>>>> not as large as we think.
>>>>>>    a) the bias load can be done at the start of the function
>>>>>>    b) on x86, the addressing mode kicks in, there is actually no size
>>>>>> difference in terms of updating:
>>>>>>         addq    $1, counter+16(%rip)
>>>>>>    vs
>>>>>>        addq    $1, counter+16(%rax)
>>>>>>
>>>>>>   both are 7 bytes long.
>>>>>>
>>>>>> so the overhead will be a few bytes more per function.
>>>>>>
>>>>>>
>>>>>> I see, thanks for this correction!
>>>>>>
>>>>>
>>>> Thanks for the suggestion David! The prototype implementation I have
>>>> loads the bias/reloc on every counter update but loading the bias/reloc at
>>>> the start of the function is going to be more efficient. I'll update the
>>>> implementation and try to get some numbers by compiling instrumented Clang.
>>>>
>>>>> 3) reloc method produces smaller sized raw profile
>>>>>>
>>>>>>
>>>>>> I'm interested in understanding how major of an issue this is for
>>>>>> Google's use cases. On iOS/macOS, the worst case overhead of section
>>>>>> alignment padding is < 32K/8K respectively per image (due to 16K/4K pages).
>>>>>> What's the expected file size overhead for Google? Is a much larger page
>>>>>> size in use? If so, is switching to a smaller page size when running
>>>>>> instrumented binaries an option?
>>>>>>
>>>>>> In addition to understanding this, per my last comment to Petr, I
>>>>>> would also like to understand the specific security concerns w.r.t mapping
>>>>>> a section onto a file. If this is something Darwin should not allow, I
>>>>>> would like to let our kernel team know!
>>>>>>
>>>>>
>>>> It's not a specific security concern, rather it's a consequence of
>>>> Fuchsia's design/capability model. In Fuchsia, the address space
>>>> (sub)regions are objects which are accessed and "operated on" through
>>>> handles (capabilities). When loading modules (executables and DSOs), our
>>>> dynamic linker creates a new subregion for each one, mmaps individual
>>>> segments into those subregions and then closes all subregion handles. That
>>>> doesn't unmap those files, but it means that the address space layout can
>>>> never be modified because there's no way to recover those handles. That's a
>>>> nice property because it means that all module mapping is immutable (e.g.
>>>> it's impossible to make any of the executable segments writable or even
>>>> readable if you use execute-only memory on AArch64). However, it also means
>>>> that we cannot mmap anything over those segments as would be needed for the
>>>> continuous mode. The only way to make this work is to avoid closing the
>>>> subregion handles and keep those for later use, but that would also make
>>>> our dynamic linker a potential attack vector.
>>>>
>>>> More generally, relying on the overmap behavior means tying the
>>>> implementation to the OS-specific API which is available on Linux and
>>>> Darwin, but may not be available on other OSes (e.g. I'm not even sure if
>>>> this is possible on Windows). It'd be nice to implement a solution that
>>>> could be made to work on all OSes.
>>>>
>>>>
>>>> Ok, I'm convinced that the reloc mode is the more portable approach.
>>>> Per David's comments (the one about not actually emitting `__llvm_prf_cnts`
>>>> in the binary //especially//, although I haven't quoted it here), reloc
>>>> mode can probably be implemented reasonably efficiently on Linux/Darwin.
>>>> The runtime can inspect `__llvm_prf_data` to determine the number of
>>>> counters, mmap() a sufficient amount of memory, and then update the
>>>> CounterPtr fields of all the data entries. This means the instrumentation
>>>> would need to look like: https://godbolt.org/z/-PvObE.
>>>>
>>>>
>>> incr3 version looks great.
>>>
>>>
>>>> @Petr I'd be happy to work together on this. Let me know if you'd like
>>>> any help prototyping/testing, perhaps we can split up some of the work.
>>>>
>>>> One thing we haven't really resolved is the conversation about not
>>>> changing the raw profile format. Fwiw, I don't think avoiding changes to
>>>> the raw profile format should be a design goal, as the point of having
>>>> separate raw/indexed formats is that the raw format should be cheap to rev.
>>>> I think that I've demonstrated that the file size overhead from adding
>>>> section alignment padding is small: I suspect that getting rid of this
>>>> would introduce a considerable amount of complexity to any kind of
>>>> mmap()-mode implementation.
>>>>
>>>>
>>> if there is a need to change raw format to make things more efficient,
>>> go for it.
>>>
>>> David
>>>
>>>
>>>
>>>> Anyway, I'm excited to see your results. Please keep me in the loop!
>>>>
>>>> thanks,
>>>> vedant
>>>>
>>>>
>>>> 4) The continuous method has the advantage of being more efficient in
>>>>>> running speed.
>>>>>>
>>>>>>
>>>>>> As pointed out above, the reduced memory overhead is another (perhaps
>>>>>> more significant) factor.
>>>>>>
>>>>>>
>>>>>> Also, I believe both methods should support on-line merging -- as the
>>>>>> profile update happens 'In-place' -- the online merging code can be
>>>>>> simplified (no locking, reading, updating are needed).
>>>>>>
>>>>>>
>>>>>> Thanks for pointing this out. In my last experiment, I added a `while
>>>>>> (true) { print-counters; sleep(...) }` loop to `darwin-proof-of-concept.c`
>>>>>> after it set up the mmap(), and then edited the counter section of the
>>>>>> profile using a hex editor. The updates did not appear in the
>>>>>> `proof-of-concept` process. But perhaps there is a more sophisticated way
>>>>>> to share the physical pages backing `__llvm_prf_cnts` between processes,
>>>>>> while they are also mapped onto the filesystem. I need to run more
>>>>>> experiments to understand this.
>>>>>>
>>>>>> vedant
>>>>>>
>>>>>>
>>>>>>
>>>>>> David
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 3, 2019 at 4:18 PM Vedant Kumar via Phabricator via
>>>>>> llvm-commits <llvm-commits at lists.llvm.org> wrote:
>>>>>>
>>>>>>> vsk added a comment.
>>>>>>>
>>>>>>> In D68351#1693693 <https://reviews.llvm.org/D68351#1693693>,
>>>>>>> @phosek wrote:
>>>>>>>
>>>>>>> > In D68351#1693307 <https://reviews.llvm.org/D68351#1693307>,
>>>>>>> @davidxl wrote:
>>>>>>> >
>>>>>>> > > +petr who has similar needs for Fuchia platform
>>>>>>> >
>>>>>>> >
>>>>>>> > Thank you for including me, we have exactly the same use case in
>>>>>>> Fuchsia and this is the solution I've initially considered and been
>>>>>>> experimenting with locally based on the discussion with @davidxl. However,
>>>>>>> after internal discussion we've decided against this solution for two
>>>>>>> reasons:
>>>>>>> >
>>>>>>> > 1. This requires the ability to mmap the output file over the
>>>>>>> (portion of) binary which is something we don't allow in Fuchsia for
>>>>>>> security reasons; once all modules have been mapped in by the dynamic
>>>>>>> linker, we don't allow any further changes to their mapping and using this
>>>>>>> solution would require special dynamic linker which is possible but highly
>>>>>>> undesirable.
>>>>>>> > 2. It introduces bloat to both the binary and the output file. It
>>>>>>> also complicates the runtime implementation due to alignment and padding
>>>>>>> requirements.
>>>>>>> >
>>>>>>> >   The alternative solution we've came up and that I've now been
>>>>>>> implementing is to change the instrumentation to allow extra level of
>>>>>>> indirection. Concretely, the instrumentation conceptually does the
>>>>>>> following: `c = __profc_foo[idx]; c++; __profc_foo[idx] = c`. We'd like to
>>>>>>> change this `c = *(&__profc[idx] + *bias); c++; *(&__profc[idx] + *bias) =
>>>>>>> c` where `bias` is a global variable set by the runtime to be the offset
>>>>>>> between the `__llvm_prf_cnts` section and the corresponding location in the
>>>>>>> file that's mmapped in. Initially, that offset would be `0` which would
>>>>>>> result in exactly the same behavior as today, but the runtime can mmap the
>>>>>>> output file into address space and then change the offset to make the
>>>>>>> counters be continuously updated.
>>>>>>> >
>>>>>>> >   The advantage of this solution is that there are no changes
>>>>>>> needed to the profile format. It also doesn't require mmapping the output
>>>>>>> file over the binary, the output file can be mmapped anywhere in the
>>>>>>> address space. The disadvantage is extra overhead since instrumentation is
>>>>>>> going to be slightly more complicated, although we don't have any numbers
>>>>>>> yet to quantify how much slower it's going to be. The implementation should
>>>>>>> be fairly minimal and my tentative plan was to gate it on compiler switch,
>>>>>>> so it wouldn't affect existing in any way (modulo introducing one new
>>>>>>> variable in the runtime to hold the bias). I'm hoping to have the prototype
>>>>>>> ready and uploaded for review within the next few days. What's your opinion
>>>>>>> on this idea? Would this be something that you'd be interested in as an
>>>>>>> alternative approach?
>>>>>>>
>>>>>>>
>>>>>>> @phosek thank you for sharing this alternative. I hope you don't
>>>>>>> mind my calling this the bias method, after the term in the proposed
>>>>>>> instrumentation.
>>>>>>>
>>>>>>> The TLDR is that I don't think the bias method is a good fit for
>>>>>>> Linux/Darwin. Imho, this method won't reduce the complexity of continuous
>>>>>>> mode, and its instrumentation overhead is likely to outweigh the savings
>>>>>>> from reduced section alignment. Let me try and justify these claims :).
>>>>>>>
>>>>>>> First and most critically, note that (on Linux & Darwin, at least),
>>>>>>> mmap() requires that the file `offset` argument be page-aligned. So, I
>>>>>>> don't believe there's a way to avoid changing the raw profile format, or to
>>>>>>> avoid the concomitant complexity necessary to calculate padding bytes in
>>>>>>> the runtime. I don't see how the bias method solves this problem: this
>>>>>>> seems to be a fundamental limitation of mmap().
>>>>>>>
>>>>>>> Second, note that the size overhead of counter increment
>>>>>>> instrumentation dwarfs the alignment overhead for `__llvm_prf_{cnts,data}`
>>>>>>> for all but the smallest of programs. Assuming 16K pages, in the worst
>>>>>>> case, the section alignment costs 32K per image. Assuming an increment can
>>>>>>> be encoded in 7 bytes as `incq l___profc_main(%rip)`, this section
>>>>>>> alignment cost is equivalent to ~4700 increments. For comparison, this is
>>>>>>> roughly the number of counter increments in //FileCheck.cpp.o//. If the
>>>>>>> size overhead of counter instrumentation were to double, as I believe it
>>>>>>> would with the bias method, it would rapidly erase the memory savings from
>>>>>>> eliminating the section alignment requirement.
>>>>>>>
>>>>>>> Third, I'm not sure I understand the security constraints in Fuchsia
>>>>>>> that make it undesirable to map the contents of a section onto a file. Is
>>>>>>> the concern that an attacker process may update the file, thereby changing
>>>>>>> the in-memory contents of the section? I'll note that on Darwin, at least,
>>>>>>> the kernel does not permit this kind of sharing. If an attacker process
>>>>>>> modifies a file on disk that has the in-memory contents of a section
>>>>>>> MAP_SHARED over it, the in-memory contents of the mapped section are not
>>>>>>> changed (I verified this by making a small modification to
>>>>>>> `darwin-proof-of-concept.c`). If there is some security aspect to the
>>>>>>> problem I'm missing here, could you please elaborate on it? I'm hoping that
>>>>>>> the bias method will not be necessary to support continuous mode on
>>>>>>> Fuchsia, but it seems this depends on the answer to the security question.
>>>>>>> Either way, istm that the high level approach taken in this patch
>>>>>>> as-written is a better fit for Linux/Darwin, and moreover probably provides
>>>>>>> support that can be used to implement the bias method.
>>>>>>>
>>>>>>> I appreciate your feedback and think we can definitely work together
>>>>>>> to build some kind of common solution!
>>>>>>>
>>>>>>>
>>>>>>> CHANGES SINCE LAST ACTION
>>>>>>>   https://reviews.llvm.org/D68351/new/
>>>>>>>
>>>>>>> https://reviews.llvm.org/D68351
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> llvm-commits mailing list
>>>>>>> llvm-commits at lists.llvm.org
>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>>>>
>>>>>>
>>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20191014/dd8b339c/attachment.html>