[PATCH] D68351: [profile] Add a mode to continuously sync counter updates to a file

Mon Oct 14 15:32:08 PDT 2019

Ok. covmap section is not loadable, so it is not a big issue.

On Mon, Oct 14, 2019 at 3:29 PM Petr Hosek <phosek at google.com> wrote:

> I have zlib enabled. It appears to be mostly __llvm_covmap which
> is 655MB, __llvm_prf_names is 159MB, __llvm_prf_data is 32MB.
>
> On Mon, Oct 14, 2019 at 3:22 PM Xinliang David Li <davidxl at google.com>
> wrote:
>
>>
>>
>> On Mon, Oct 14, 2019 at 3:17 PM Petr Hosek <phosek at google.com> wrote:
>>
>>> I've prototyped both approaches: relocated counters and dynamically
>>> allocated counters (the most recent proposal), to see how they'd perform.
>>> I've used both prototypes to build Clang and then used the instrumented
>>> Clang to compile zstd_opt.c
>>> <https://github.com/facebook/zstd/blob/dev/lib/compress/zstd_opt.c> which
>>> is one of the slowest files to compile in our build (the reported times are
>>> median from several runs).
>>>
>>> Uninstrumented Clang: 112M, 11.013s
>>> Existing instrumentation (statically allocated counters): 1247M, 34.626s
>>> Dynamically allocated counters: 1176M, 48.43s
>>> Relocated counters: 1292M, 47.847s
>>>
>>> I'm not sure why the instrumented Clang binary (in all cases) tends to
>>> get so big, originally I suspected it may be because the instrumentation is
>>> incompatible with --gc-sections so I disabled that option for the
>>> uninstrumented Clang but the difference is still roughly 10x. Do you have
>>> any insight on this?
>>>
>>>
>> I suspect it is mostly from the prfname section (uncompressed). Try using
>> a clang built with zlib, then the names will be compressed.
>>
>> David
>>
>>
>>
>>> In terms of performance, both approaches are similar, and roughly 40%
>>> slower than the existing instrumentation. In terms of size, as expected the
>>> dynamically allocated counters variant is smaller (by about 5.7%) because
>>> we can omit the counters section, the relocated counters version is larger
>>> (by about 3.6%) because of the extra instrumentation although this
>>> instrumentation doesn't seem to be affecting performance as much as I
>>> expected.
>>>
>>> In terms of pros and cons:
>>>
>>>    - In case of the dynamically allocated counters, aside from the
>>>    issue mentioned before where we need to ensure that the runtime ran before
>>>    we start executing any instrumented code (this actually came up in my
>>>    testing and I had to annotate Runtime Registration
>>>    with __attribute__((init_priority(101)))), there's also the requirement
>>>    that __llvm_prf_data has to be writable; this is the case today, but I've
>>>    been separately working on a patch to use relative relocations which would
>>>    avoid that requirement making it possible to move __llvm_prf_data into
>>>    rodata and deduplicating memory across processes (this would be useful when
>>>    collecting profiles from shared libraries that are used by multiple
>>>    processes), but with this solution that won't be possible
>>>    - In case of relocated counters, there's the extra overhead because
>>>    of the increased binary size and having two copies of counters in memory
>>>    (although it might be possible to deduplicate one of those for shared
>>>    libraries across processes, see above).
>>>
>>> Either of these solutions should work for us (the dynamically allocated
>>> counter one might requirement a bit of extra work to annotate the early
>>> path in our libc and dynamic linker). Do you have any preference? Once we
>>> agree, I'll try to clean up the implementation a bit and upload it for
>>> review.
>>>
>>> On Mon, Oct 7, 2019 at 10:56 AM Vedant Kumar <vedant_kumar at apple.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Oct 5, 2019, at 1:03 PM, Petr Hosek <phosek at google.com> wrote:
>>>>
>>>> I've considered omitting the `__llvm_prf_cnts` section, the only
>>>> downside of doing so is you need to make sure that the space for counters
>>>> is allocated before you execute any instrumented code. That makes it tricky
>>>> if you want to instrument libc which is something we do today.
>>>>
>>>> We can work around it though: a simple solution is to avoid collecting
>>>> profiles for libc, a better solution would be to carefully annotate all
>>>> libc functions that are executed before constructors to ensure that they
>>>> aren't instrumented e.g. by introducing a no_profile attribute, this is the
>>>> same approach we use ASan by annotating all function on the early startup
>>>> path with __attribute__((no_sanitize("address"))).
>>>>
>>>>
>>>> I think the `no_profile` approach makes sense. One downside is that in
>>>> non-continuous mode, annotated libc functions will not be profiled. Some
>>>> alternatives are 1) having a more specific `no_profile_in_continuous_mode`
>>>> annotation, or 2) to not instrument libc in continuous mode and rely on
>>>> libc unittests which run in non-continuous mode only. It’s probably a bit
>>>> easier to support value profiling with (2), if that’s a concern.
>>>>
>>>> vedant
>>>>
>>>>
>>>> On Fri, Oct 4, 2019 at 4:43 PM Xinliang David Li <davidxl at google.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 4, 2019 at 4:35 PM <vsk at apple.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Oct 4, 2019, at 3:43 PM, Petr Hosek <phosek at google.com> wrote:
>>>>>>
>>>>>> On Fri, Oct 4, 2019 at 1:44 PM Xinliang David Li <davidxl at google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I will let Petr chime in explaining more details -- the reply I
>>>>>>> provided are based on my understanding of 'Reloc' proposal, which may be
>>>>>>> quite off.
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>> On Fri, Oct 4, 2019 at 1:28 PM <vsk at apple.com> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Oct 4, 2019, at 10:30 AM, Xinliang David Li <
>>>>>>>> xinliangli at gmail.com> wrote:
>>>>>>>>
>>>>>>>> Petr's method is more like 'relocating' the counter memory from
>>>>>>>> static allocated memory to mmapped memory, but better be named 'reloc'
>>>>>>>> method.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you, that's a much better name.
>>>>>>>>
>>>>>>>>
>>>>>>>> Now lets do some comparison:
>>>>>>>>
>>>>>>>> 1) reloc method allows runtime to mmap from the start of the raw
>>>>>>>> profile, so there is no need to page align the counter data in the file;
>>>>>>>>
>>>>>>>>
>>>>>>>> Let me see if I follow. Is the idea to allocate a buffer large
>>>>>>>> enough to contain the entire contiguous raw profile, in each instrumented
>>>>>>>> image? I can see how it would be possible to mmap() a single such buffer
>>>>>>>> onto a raw profile without any extra padding or format changes. However,
>>>>>>>> this buffer allocation entails doubling the memory overhead of the
>>>>>>>> `__llvm_prf_{cnts,data,names}` sections, as these would all need to be
>>>>>>>> copied into the buffer-to-be-mmap'd. This may not work on well on iOS, as
>>>>>>>> the devices we support are quite memory-constrained. The per-process memory
>>>>>>>> limit on iOS is ~1.4GB on many devices -- clients like News.app or
>>>>>>>> Music.app already get very close to this limit.
>>>>>>>>
>>>>>>>> Also, with more than one instrumented image (DSOs), I believe
>>>>>>>> page-alignment padding is still necessary so that each in-memory raw
>>>>>>>> profile buffer can be mmap()'d onto a page-aligned file offset.
>>>>>>>>
>>>>>>>> Perhaps that is a non-issue, as it can be solved by simply creating
>>>>>>>> N different raw profiles -- one per instrumented image. Actually this
>>>>>>>> should work for both proposed schemes.
>>>>>>>>
>>>>>>>
>>>>>> David's understanding is correct, my plan is to mmap the entire raw
>>>>>> profile from the start which means that there's no requirement for page
>>>>>> alignment. Alternative would be to mmap only a page-aligned portion of the
>>>>>> raw file of file that contains counters, i.e. we would mmap
>>>>>> [__llvm_prf_cnts & -PAGE_SIZE, (__llvm_prf_cnts + __llvm_prf_cnts_length +
>>>>>> PAGE_SIZE - 1) & -PAGE_SIZE). There's no need for buffer allocation, we can
>>>>>> simply mmap the file directly as read/writable region.
>>>>>>
>>>>>>> 2) reloc method does have more instrumentation overhead, however it
>>>>>>>> is not as large as we think.
>>>>>>>>    a) the bias load can be done at the start of the function
>>>>>>>>    b) on x86, the addressing mode kicks in, there is actually no
>>>>>>>> size difference in terms of updating:
>>>>>>>>         addq    $1, counter+16(%rip)
>>>>>>>>    vs
>>>>>>>>        addq    $1, counter+16(%rax)
>>>>>>>>
>>>>>>>>   both are 7 bytes long.
>>>>>>>>
>>>>>>>> so the overhead will be a few bytes more per function.
>>>>>>>>
>>>>>>>>
>>>>>>>> I see, thanks for this correction!
>>>>>>>>
>>>>>>>
>>>>>> Thanks for the suggestion David! The prototype implementation I have
>>>>>> loads the bias/reloc on every counter update but loading the bias/reloc at
>>>>>> the start of the function is going to be more efficient. I'll update the
>>>>>> implementation and try to get some numbers by compiling instrumented Clang.
>>>>>>
>>>>>>> 3) reloc method produces smaller sized raw profile
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm interested in understanding how major of an issue this is for
>>>>>>>> Google's use cases. On iOS/macOS, the worst case overhead of section
>>>>>>>> alignment padding is < 32K/8K respectively per image (due to 16K/4K pages).
>>>>>>>> What's the expected file size overhead for Google? Is a much larger page
>>>>>>>> size in use? If so, is switching to a smaller page size when running
>>>>>>>> instrumented binaries an option?
>>>>>>>>
>>>>>>>> In addition to understanding this, per my last comment to Petr, I
>>>>>>>> would also like to understand the specific security concerns w.r.t mapping
>>>>>>>> a section onto a file. If this is something Darwin should not allow, I
>>>>>>>> would like to let our kernel team know!
>>>>>>>>
>>>>>>>
>>>>>> It's not a specific security concern, rather it's a consequence of
>>>>>> Fuchsia's design/capability model. In Fuchsia, the address space
>>>>>> (sub)regions are objects which are accessed and "operated on" through
>>>>>> handles (capabilities). When loading modules (executables and DSOs), our
>>>>>> dynamic linker creates a new subregion for each one, mmaps individual
>>>>>> segments into those subregions and then closes all subregion handles. That
>>>>>> doesn't unmap those files, but it means that the address space layout can
>>>>>> never be modified because there's no way to recover those handles. That's a
>>>>>> nice property because it means that all module mapping is immutable (e.g.
>>>>>> it's impossible to make any of the executable segments writable or even
>>>>>> readable if you use execute-only memory on AArch64). However, it also means
>>>>>> that we cannot mmap anything over those segments as would be needed for the
>>>>>> continuous mode. The only way to make this work is to avoid closing the
>>>>>> subregion handles and keep those for later use, but that would also make
>>>>>> our dynamic linker a potential attack vector.
>>>>>>
>>>>>> More generally, relying on the overmap behavior means tying the
>>>>>> implementation to the OS-specific API which is available on Linux and
>>>>>> Darwin, but may not be available on other OSes (e.g. I'm not even sure if
>>>>>> this is possible on Windows). It'd be nice to implement a solution that
>>>>>> could be made to work on all OSes.
>>>>>>
>>>>>>
>>>>>> Ok, I'm convinced that the reloc mode is the more portable approach.
>>>>>> Per David's comments (the one about not actually emitting `__llvm_prf_cnts`
>>>>>> in the binary //especially//, although I haven't quoted it here), reloc
>>>>>> mode can probably be implemented reasonably efficiently on Linux/Darwin.
>>>>>> The runtime can inspect `__llvm_prf_data` to determine the number of
>>>>>> counters, mmap() a sufficient amount of memory, and then update the
>>>>>> CounterPtr fields of all the data entries. This means the instrumentation
>>>>>> would need to look like: https://godbolt.org/z/-PvObE.
>>>>>>
>>>>>>
>>>>> incr3 version looks great.
>>>>>
>>>>>
>>>>>> @Petr I'd be happy to work together on this. Let me know if you'd
>>>>>> like any help prototyping/testing, perhaps we can split up some of the work.
>>>>>>
>>>>>> One thing we haven't really resolved is the conversation about not
>>>>>> changing the raw profile format. Fwiw, I don't think avoiding changes to
>>>>>> the raw profile format should be a design goal, as the point of having
>>>>>> separate raw/indexed formats is that the raw format should be cheap to rev.
>>>>>> I think that I've demonstrated that the file size overhead from adding
>>>>>> section alignment padding is small: I suspect that getting rid of this
>>>>>> would introduce a considerable amount of complexity to any kind of
>>>>>> mmap()-mode implementation.
>>>>>>
>>>>>>
>>>>> if there is a need to change raw format to make things more efficient,
>>>>> go for it.
>>>>>
>>>>> David
>>>>>
>>>>>
>>>>>
>>>>>> Anyway, I'm excited to see your results. Please keep me in the loop!
>>>>>>
>>>>>> thanks,
>>>>>> vedant
>>>>>>
>>>>>>
>>>>>> 4) The continuous method has the advantage of being more efficient in
>>>>>>>> running speed.
>>>>>>>>
>>>>>>>>
>>>>>>>> As pointed out above, the reduced memory overhead is another
>>>>>>>> (perhaps more significant) factor.
>>>>>>>>
>>>>>>>>
>>>>>>>> Also, I believe both methods should support on-line merging -- as
>>>>>>>> the profile update happens 'In-place' -- the online merging code can be
>>>>>>>> simplified (no locking, reading, updating are needed).
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for pointing this out. In my last experiment, I added a
>>>>>>>> `while (true) { print-counters; sleep(...) }` loop to
>>>>>>>> `darwin-proof-of-concept.c` after it set up the mmap(), and then edited the
>>>>>>>> counter section of the profile using a hex editor. The updates did not
>>>>>>>> appear in the `proof-of-concept` process. But perhaps there is a more
>>>>>>>> sophisticated way to share the physical pages backing `__llvm_prf_cnts`
>>>>>>>> between processes, while they are also mapped onto the filesystem. I need
>>>>>>>> to run more experiments to understand this.
>>>>>>>>
>>>>>>>> vedant
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> David
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Oct 3, 2019 at 4:18 PM Vedant Kumar via Phabricator via
>>>>>>>> llvm-commits <llvm-commits at lists.llvm.org> wrote:
>>>>>>>>
>>>>>>>>> vsk added a comment.
>>>>>>>>>
>>>>>>>>> In D68351#1693693 <https://reviews.llvm.org/D68351#1693693>,
>>>>>>>>> @phosek wrote:
>>>>>>>>>
>>>>>>>>> > In D68351#1693307 <https://reviews.llvm.org/D68351#1693307>,
>>>>>>>>> @davidxl wrote:
>>>>>>>>> >
>>>>>>>>> > > +petr who has similar needs for Fuchia platform
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > Thank you for including me, we have exactly the same use case in
>>>>>>>>> Fuchsia and this is the solution I've initially considered and been
>>>>>>>>> experimenting with locally based on the discussion with @davidxl. However,
>>>>>>>>> after internal discussion we've decided against this solution for two
>>>>>>>>> reasons:
>>>>>>>>> >
>>>>>>>>> > 1. This requires the ability to mmap the output file over the
>>>>>>>>> (portion of) binary which is something we don't allow in Fuchsia for
>>>>>>>>> security reasons; once all modules have been mapped in by the dynamic
>>>>>>>>> linker, we don't allow any further changes to their mapping and using this
>>>>>>>>> solution would require special dynamic linker which is possible but highly
>>>>>>>>> undesirable.
>>>>>>>>> > 2. It introduces bloat to both the binary and the output file.
>>>>>>>>> It also complicates the runtime implementation due to alignment and padding
>>>>>>>>> requirements.
>>>>>>>>> >
>>>>>>>>> >   The alternative solution we've came up and that I've now been
>>>>>>>>> implementing is to change the instrumentation to allow extra level of
>>>>>>>>> indirection. Concretely, the instrumentation conceptually does the
>>>>>>>>> following: `c = __profc_foo[idx]; c++; __profc_foo[idx] = c`. We'd like to
>>>>>>>>> change this `c = *(&__profc[idx] + *bias); c++; *(&__profc[idx] + *bias) =
>>>>>>>>> c` where `bias` is a global variable set by the runtime to be the offset
>>>>>>>>> between the `__llvm_prf_cnts` section and the corresponding location in the
>>>>>>>>> file that's mmapped in. Initially, that offset would be `0` which would
>>>>>>>>> result in exactly the same behavior as today, but the runtime can mmap the
>>>>>>>>> output file into address space and then change the offset to make the
>>>>>>>>> counters be continuously updated.
>>>>>>>>> >
>>>>>>>>> >   The advantage of this solution is that there are no changes
>>>>>>>>> needed to the profile format. It also doesn't require mmapping the output
>>>>>>>>> file over the binary, the output file can be mmapped anywhere in the
>>>>>>>>> address space. The disadvantage is extra overhead since instrumentation is
>>>>>>>>> going to be slightly more complicated, although we don't have any numbers
>>>>>>>>> yet to quantify how much slower it's going to be. The implementation should
>>>>>>>>> be fairly minimal and my tentative plan was to gate it on compiler switch,
>>>>>>>>> so it wouldn't affect existing in any way (modulo introducing one new
>>>>>>>>> variable in the runtime to hold the bias). I'm hoping to have the prototype
>>>>>>>>> ready and uploaded for review within the next few days. What's your opinion
>>>>>>>>> on this idea? Would this be something that you'd be interested in as an
>>>>>>>>> alternative approach?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> @phosek thank you for sharing this alternative. I hope you don't
>>>>>>>>> mind my calling this the bias method, after the term in the proposed
>>>>>>>>> instrumentation.
>>>>>>>>>
>>>>>>>>> The TLDR is that I don't think the bias method is a good fit for
>>>>>>>>> Linux/Darwin. Imho, this method won't reduce the complexity of continuous
>>>>>>>>> mode, and its instrumentation overhead is likely to outweigh the savings
>>>>>>>>> from reduced section alignment. Let me try and justify these claims :).
>>>>>>>>>
>>>>>>>>> First and most critically, note that (on Linux & Darwin, at
>>>>>>>>> least), mmap() requires that the file `offset` argument be page-aligned.
>>>>>>>>> So, I don't believe there's a way to avoid changing the raw profile format,
>>>>>>>>> or to avoid the concomitant complexity necessary to calculate padding bytes
>>>>>>>>> in the runtime. I don't see how the bias method solves this problem: this
>>>>>>>>> seems to be a fundamental limitation of mmap().
>>>>>>>>>
>>>>>>>>> Second, note that the size overhead of counter increment
>>>>>>>>> instrumentation dwarfs the alignment overhead for `__llvm_prf_{cnts,data}`
>>>>>>>>> for all but the smallest of programs. Assuming 16K pages, in the worst
>>>>>>>>> case, the section alignment costs 32K per image. Assuming an increment can
>>>>>>>>> be encoded in 7 bytes as `incq l___profc_main(%rip)`, this section
>>>>>>>>> alignment cost is equivalent to ~4700 increments. For comparison, this is
>>>>>>>>> roughly the number of counter increments in //FileCheck.cpp.o//. If the
>>>>>>>>> size overhead of counter instrumentation were to double, as I believe it
>>>>>>>>> would with the bias method, it would rapidly erase the memory savings from
>>>>>>>>> eliminating the section alignment requirement.
>>>>>>>>>
>>>>>>>>> Third, I'm not sure I understand the security constraints in
>>>>>>>>> Fuchsia that make it undesirable to map the contents of a section onto a
>>>>>>>>> file. Is the concern that an attacker process may update the file, thereby
>>>>>>>>> changing the in-memory contents of the section? I'll note that on Darwin,
>>>>>>>>> at least, the kernel does not permit this kind of sharing. If an attacker
>>>>>>>>> process modifies a file on disk that has the in-memory contents of a
>>>>>>>>> section MAP_SHARED over it, the in-memory contents of the mapped section
>>>>>>>>> are not changed (I verified this by making a small modification to
>>>>>>>>> `darwin-proof-of-concept.c`). If there is some security aspect to the
>>>>>>>>> problem I'm missing here, could you please elaborate on it? I'm hoping that
>>>>>>>>> the bias method will not be necessary to support continuous mode on
>>>>>>>>> Fuchsia, but it seems this depends on the answer to the security question.
>>>>>>>>> Either way, istm that the high level approach taken in this patch
>>>>>>>>> as-written is a better fit for Linux/Darwin, and moreover probably provides
>>>>>>>>> support that can be used to implement the bias method.
>>>>>>>>>
>>>>>>>>> I appreciate your feedback and think we can definitely work
>>>>>>>>> together to build some kind of common solution!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> CHANGES SINCE LAST ACTION
>>>>>>>>>   https://reviews.llvm.org/D68351/new/
>>>>>>>>>
>>>>>>>>> https://reviews.llvm.org/D68351
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> llvm-commits mailing list
>>>>>>>>> llvm-commits at lists.llvm.org
>>>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20191014/d042d200/attachment-0001.html>