[PATCH] D68351: [profile] Add a mode to continuously sync counter updates to a file

Thu Oct 31 23:18:57 PDT 2019

My understanding is that the additional instr runtime overhead is not
something iOS can take (as this will be the default mode). It is a little
unfortunate we need to have two implementation basically doing the same
thing. Hopefully we can make some unficiation in the future.

On Thu, Oct 31, 2019 at 8:39 PM Petr Hosek <phosek at google.com> wrote:

> I've uploaded the change that implements the dynamically allocated
> counters as https://reviews.llvm.org/D69700. I still need to rebase it on
> top of Vedant's change that landed today.
>
> The question is whether this implementation is something we want? I was
> hoping that this variant would satisfy both the requirements of iOS and
> Fuchsia, but Vedant has already landed the implementation that uses padding
> which isn't usable for us. On the other hand, for us the version with
> relocated counters (what I originally described as "bias") is better since
> it can be used even for low-level components (i.e. before the runtime has
> been loaded). I'll try to upload the patch that implements the relocated
> counters tomorrow.
>
> On Tue, Oct 15, 2019 at 2:46 PM Vedant Kumar <vedant_kumar at apple.com>
> wrote:
>
>>
>>
>> On Oct 14, 2019, at 3:16 PM, Petr Hosek <phosek at google.com> wrote:
>>
>> I've prototyped both approaches: relocated counters and dynamically
>> allocated counters (the most recent proposal), to see how they'd perform.
>> I've used both prototypes to build Clang and then used the instrumented
>> Clang to compile zstd_opt.c
>> <https://github.com/facebook/zstd/blob/dev/lib/compress/zstd_opt.c> which
>> is one of the slowest files to compile in our build (the reported times are
>> median from several runs).
>>
>> Uninstrumented Clang: 112M, 11.013s
>> Existing instrumentation (statically allocated counters): 1247M, 34.626s
>> Dynamically allocated counters: 1176M, 48.43s
>> Relocated counters: 1292M, 47.847s
>>
>> I'm not sure why the instrumented Clang binary (in all cases) tends to
>> get so big, originally I suspected it may be because the instrumentation is
>> incompatible with --gc-sections so I disabled that option for the
>> uninstrumented Clang but the difference is still roughly 10x. Do you have
>> any insight on this?
>>
>>
>> Coverage mapping data grows in a super-linear way with the number of
>> translation units, due to definitions #included through headers getting
>> duplicate mappings. This is PR34533, which I haven't had time to pick back
>> up, unfortunately.
>>
>>
>> In terms of performance, both approaches are similar, and roughly 40%
>> slower than the existing instrumentation. In terms of size, as expected the
>> dynamically allocated counters variant is smaller (by about 5.7%) because
>> we can omit the counters section, the relocated counters version is larger
>> (by about 3.6%) because of the extra instrumentation although this
>> instrumentation doesn't seem to be affecting performance as much as I
>> expected.
>>
>> In terms of pros and cons:
>>
>>    - In case of the dynamically allocated counters, aside from the issue
>>    mentioned before where we need to ensure that the runtime ran before we
>>    start executing any instrumented code (this actually came up in my testing
>>    and I had to annotate Runtime Registration
>>    with __attribute__((init_priority(101)))), there's also the requirement
>>    that __llvm_prf_data has to be writable; this is the case today, but I've
>>    been separately working on a patch to use relative relocations which would
>>    avoid that requirement making it possible to move __llvm_prf_data into
>>    rodata and deduplicating memory across processes (this would be useful when
>>    collecting profiles from shared libraries that are used by multiple
>>    processes), but with this solution that won't be possible
>>    - In case of relocated counters, there's the extra overhead because
>>    of the increased binary size and having two copies of counters in memory
>>    (although it might be possible to deduplicate one of those for shared
>>    libraries across processes, see above).
>>
>> Either of these solutions should work for us (the dynamically allocated
>> counter one might requirement a bit of extra work to annotate the early
>> path in our libc and dynamic linker). Do you have any preference? Once we
>> agree, I'll try to clean up the implementation a bit and upload it for
>> review.
>>
>>
>> Thanks for doing this comparison. Based on your results, I think we
>> should support both the current counter instrumentation and the proposed
>> relocated counters method. It looks like there's a significant
>> performance/portability tradeoff that makes both necessary.
>>
>> Do you see any blockers that make maintaining both instrumentation modes
>> infeasible?
>>
>> vedant
>>
>>
>>
>> On Mon, Oct 7, 2019 at 10:56 AM Vedant Kumar <vedant_kumar at apple.com>
>> wrote:
>>
>>>
>>>
>>> On Oct 5, 2019, at 1:03 PM, Petr Hosek <phosek at google.com> wrote:
>>>
>>> I've considered omitting the `__llvm_prf_cnts` section, the only
>>> downside of doing so is you need to make sure that the space for counters
>>> is allocated before you execute any instrumented code. That makes it tricky
>>> if you want to instrument libc which is something we do today.
>>>
>>> We can work around it though: a simple solution is to avoid collecting
>>> profiles for libc, a better solution would be to carefully annotate all
>>> libc functions that are executed before constructors to ensure that they
>>> aren't instrumented e.g. by introducing a no_profile attribute, this is the
>>> same approach we use ASan by annotating all function on the early startup
>>> path with __attribute__((no_sanitize("address"))).
>>>
>>>
>>> I think the `no_profile` approach makes sense. One downside is that in
>>> non-continuous mode, annotated libc functions will not be profiled. Some
>>> alternatives are 1) having a more specific `no_profile_in_continuous_mode`
>>> annotation, or 2) to not instrument libc in continuous mode and rely on
>>> libc unittests which run in non-continuous mode only. It’s probably a bit
>>> easier to support value profiling with (2), if that’s a concern.
>>>
>>> vedant
>>>
>>>
>>> On Fri, Oct 4, 2019 at 4:43 PM Xinliang David Li <davidxl at google.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Oct 4, 2019 at 4:35 PM <vsk at apple.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Oct 4, 2019, at 3:43 PM, Petr Hosek <phosek at google.com> wrote:
>>>>>
>>>>> On Fri, Oct 4, 2019 at 1:44 PM Xinliang David Li <davidxl at google.com>
>>>>> wrote:
>>>>>
>>>>>> I will let Petr chime in explaining more details -- the reply I
>>>>>> provided are based on my understanding of 'Reloc' proposal, which may be
>>>>>> quite off.
>>>>>>
>>>>>> David
>>>>>>
>>>>>> On Fri, Oct 4, 2019 at 1:28 PM <vsk at apple.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Oct 4, 2019, at 10:30 AM, Xinliang David Li <xinliangli at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Petr's method is more like 'relocating' the counter memory from
>>>>>>> static allocated memory to mmapped memory, but better be named 'reloc'
>>>>>>> method.
>>>>>>>
>>>>>>>
>>>>>>> Thank you, that's a much better name.
>>>>>>>
>>>>>>>
>>>>>>> Now lets do some comparison:
>>>>>>>
>>>>>>> 1) reloc method allows runtime to mmap from the start of the raw
>>>>>>> profile, so there is no need to page align the counter data in the file;
>>>>>>>
>>>>>>>
>>>>>>> Let me see if I follow. Is the idea to allocate a buffer large
>>>>>>> enough to contain the entire contiguous raw profile, in each instrumented
>>>>>>> image? I can see how it would be possible to mmap() a single such buffer
>>>>>>> onto a raw profile without any extra padding or format changes. However,
>>>>>>> this buffer allocation entails doubling the memory overhead of the
>>>>>>> `__llvm_prf_{cnts,data,names}` sections, as these would all need to be
>>>>>>> copied into the buffer-to-be-mmap'd. This may not work on well on iOS, as
>>>>>>> the devices we support are quite memory-constrained. The per-process memory
>>>>>>> limit on iOS is ~1.4GB on many devices -- clients like News.app or
>>>>>>> Music.app already get very close to this limit.
>>>>>>>
>>>>>>> Also, with more than one instrumented image (DSOs), I believe
>>>>>>> page-alignment padding is still necessary so that each in-memory raw
>>>>>>> profile buffer can be mmap()'d onto a page-aligned file offset.
>>>>>>>
>>>>>>> Perhaps that is a non-issue, as it can be solved by simply creating
>>>>>>> N different raw profiles -- one per instrumented image. Actually this
>>>>>>> should work for both proposed schemes.
>>>>>>>
>>>>>>
>>>>> David's understanding is correct, my plan is to mmap the entire raw
>>>>> profile from the start which means that there's no requirement for page
>>>>> alignment. Alternative would be to mmap only a page-aligned portion of the
>>>>> raw file of file that contains counters, i.e. we would mmap
>>>>> [__llvm_prf_cnts & -PAGE_SIZE, (__llvm_prf_cnts + __llvm_prf_cnts_length +
>>>>> PAGE_SIZE - 1) & -PAGE_SIZE). There's no need for buffer allocation, we can
>>>>> simply mmap the file directly as read/writable region.
>>>>>
>>>>>> 2) reloc method does have more instrumentation overhead, however it
>>>>>>> is not as large as we think.
>>>>>>>    a) the bias load can be done at the start of the function
>>>>>>>    b) on x86, the addressing mode kicks in, there is actually no
>>>>>>> size difference in terms of updating:
>>>>>>>         addq    $1, counter+16(%rip)
>>>>>>>    vs
>>>>>>>        addq    $1, counter+16(%rax)
>>>>>>>
>>>>>>>   both are 7 bytes long.
>>>>>>>
>>>>>>> so the overhead will be a few bytes more per function.
>>>>>>>
>>>>>>>
>>>>>>> I see, thanks for this correction!
>>>>>>>
>>>>>>
>>>>> Thanks for the suggestion David! The prototype implementation I have
>>>>> loads the bias/reloc on every counter update but loading the bias/reloc at
>>>>> the start of the function is going to be more efficient. I'll update the
>>>>> implementation and try to get some numbers by compiling instrumented Clang.
>>>>>
>>>>>> 3) reloc method produces smaller sized raw profile
>>>>>>>
>>>>>>>
>>>>>>> I'm interested in understanding how major of an issue this is for
>>>>>>> Google's use cases. On iOS/macOS, the worst case overhead of section
>>>>>>> alignment padding is < 32K/8K respectively per image (due to 16K/4K pages).
>>>>>>> What's the expected file size overhead for Google? Is a much larger page
>>>>>>> size in use? If so, is switching to a smaller page size when running
>>>>>>> instrumented binaries an option?
>>>>>>>
>>>>>>> In addition to understanding this, per my last comment to Petr, I
>>>>>>> would also like to understand the specific security concerns w.r.t mapping
>>>>>>> a section onto a file. If this is something Darwin should not allow, I
>>>>>>> would like to let our kernel team know!
>>>>>>>
>>>>>>
>>>>> It's not a specific security concern, rather it's a consequence of
>>>>> Fuchsia's design/capability model. In Fuchsia, the address space
>>>>> (sub)regions are objects which are accessed and "operated on" through
>>>>> handles (capabilities). When loading modules (executables and DSOs), our
>>>>> dynamic linker creates a new subregion for each one, mmaps individual
>>>>> segments into those subregions and then closes all subregion handles. That
>>>>> doesn't unmap those files, but it means that the address space layout can
>>>>> never be modified because there's no way to recover those handles. That's a
>>>>> nice property because it means that all module mapping is immutable (e.g.
>>>>> it's impossible to make any of the executable segments writable or even
>>>>> readable if you use execute-only memory on AArch64). However, it also means
>>>>> that we cannot mmap anything over those segments as would be needed for the
>>>>> continuous mode. The only way to make this work is to avoid closing the
>>>>> subregion handles and keep those for later use, but that would also make
>>>>> our dynamic linker a potential attack vector.
>>>>>
>>>>> More generally, relying on the overmap behavior means tying the
>>>>> implementation to the OS-specific API which is available on Linux and
>>>>> Darwin, but may not be available on other OSes (e.g. I'm not even sure if
>>>>> this is possible on Windows). It'd be nice to implement a solution that
>>>>> could be made to work on all OSes.
>>>>>
>>>>>
>>>>> Ok, I'm convinced that the reloc mode is the more portable approach.
>>>>> Per David's comments (the one about not actually emitting `__llvm_prf_cnts`
>>>>> in the binary //especially//, although I haven't quoted it here), reloc
>>>>> mode can probably be implemented reasonably efficiently on Linux/Darwin.
>>>>> The runtime can inspect `__llvm_prf_data` to determine the number of
>>>>> counters, mmap() a sufficient amount of memory, and then update the
>>>>> CounterPtr fields of all the data entries. This means the instrumentation
>>>>> would need to look like: https://godbolt.org/z/-PvObE.
>>>>>
>>>>>
>>>> incr3 version looks great.
>>>>
>>>>
>>>>> @Petr I'd be happy to work together on this. Let me know if you'd like
>>>>> any help prototyping/testing, perhaps we can split up some of the work.
>>>>>
>>>>> One thing we haven't really resolved is the conversation about not
>>>>> changing the raw profile format. Fwiw, I don't think avoiding changes to
>>>>> the raw profile format should be a design goal, as the point of having
>>>>> separate raw/indexed formats is that the raw format should be cheap to rev.
>>>>> I think that I've demonstrated that the file size overhead from adding
>>>>> section alignment padding is small: I suspect that getting rid of this
>>>>> would introduce a considerable amount of complexity to any kind of
>>>>> mmap()-mode implementation.
>>>>>
>>>>>
>>>> if there is a need to change raw format to make things more efficient,
>>>> go for it.
>>>>
>>>> David
>>>>
>>>>
>>>>
>>>>> Anyway, I'm excited to see your results. Please keep me in the loop!
>>>>>
>>>>> thanks,
>>>>> vedant
>>>>>
>>>>>
>>>>> 4) The continuous method has the advantage of being more efficient in
>>>>>>> running speed.
>>>>>>>
>>>>>>>
>>>>>>> As pointed out above, the reduced memory overhead is another
>>>>>>> (perhaps more significant) factor.
>>>>>>>
>>>>>>>
>>>>>>> Also, I believe both methods should support on-line merging -- as
>>>>>>> the profile update happens 'In-place' -- the online merging code can be
>>>>>>> simplified (no locking, reading, updating are needed).
>>>>>>>
>>>>>>>
>>>>>>> Thanks for pointing this out. In my last experiment, I added a
>>>>>>> `while (true) { print-counters; sleep(...) }` loop to
>>>>>>> `darwin-proof-of-concept.c` after it set up the mmap(), and then edited the
>>>>>>> counter section of the profile using a hex editor. The updates did not
>>>>>>> appear in the `proof-of-concept` process. But perhaps there is a more
>>>>>>> sophisticated way to share the physical pages backing `__llvm_prf_cnts`
>>>>>>> between processes, while they are also mapped onto the filesystem. I need
>>>>>>> to run more experiments to understand this.
>>>>>>>
>>>>>>> vedant
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 3, 2019 at 4:18 PM Vedant Kumar via Phabricator via
>>>>>>> llvm-commits <llvm-commits at lists.llvm.org> wrote:
>>>>>>>
>>>>>>>> vsk added a comment.
>>>>>>>>
>>>>>>>> In D68351#1693693 <https://reviews.llvm.org/D68351#1693693>,
>>>>>>>> @phosek wrote:
>>>>>>>>
>>>>>>>> > In D68351#1693307 <https://reviews.llvm.org/D68351#1693307>,
>>>>>>>> @davidxl wrote:
>>>>>>>> >
>>>>>>>> > > +petr who has similar needs for Fuchia platform
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Thank you for including me, we have exactly the same use case in
>>>>>>>> Fuchsia and this is the solution I've initially considered and been
>>>>>>>> experimenting with locally based on the discussion with @davidxl. However,
>>>>>>>> after internal discussion we've decided against this solution for two
>>>>>>>> reasons:
>>>>>>>> >
>>>>>>>> > 1. This requires the ability to mmap the output file over the
>>>>>>>> (portion of) binary which is something we don't allow in Fuchsia for
>>>>>>>> security reasons; once all modules have been mapped in by the dynamic
>>>>>>>> linker, we don't allow any further changes to their mapping and using this
>>>>>>>> solution would require special dynamic linker which is possible but highly
>>>>>>>> undesirable.
>>>>>>>> > 2. It introduces bloat to both the binary and the output file. It
>>>>>>>> also complicates the runtime implementation due to alignment and padding
>>>>>>>> requirements.
>>>>>>>> >
>>>>>>>> >   The alternative solution we've came up and that I've now been
>>>>>>>> implementing is to change the instrumentation to allow extra level of
>>>>>>>> indirection. Concretely, the instrumentation conceptually does the
>>>>>>>> following: `c = __profc_foo[idx]; c++; __profc_foo[idx] = c`. We'd like to
>>>>>>>> change this `c = *(&__profc[idx] + *bias); c++; *(&__profc[idx] + *bias) =
>>>>>>>> c` where `bias` is a global variable set by the runtime to be the offset
>>>>>>>> between the `__llvm_prf_cnts` section and the corresponding location in the
>>>>>>>> file that's mmapped in. Initially, that offset would be `0` which would
>>>>>>>> result in exactly the same behavior as today, but the runtime can mmap the
>>>>>>>> output file into address space and then change the offset to make the
>>>>>>>> counters be continuously updated.
>>>>>>>> >
>>>>>>>> >   The advantage of this solution is that there are no changes
>>>>>>>> needed to the profile format. It also doesn't require mmapping the output
>>>>>>>> file over the binary, the output file can be mmapped anywhere in the
>>>>>>>> address space. The disadvantage is extra overhead since instrumentation is
>>>>>>>> going to be slightly more complicated, although we don't have any numbers
>>>>>>>> yet to quantify how much slower it's going to be. The implementation should
>>>>>>>> be fairly minimal and my tentative plan was to gate it on compiler switch,
>>>>>>>> so it wouldn't affect existing in any way (modulo introducing one new
>>>>>>>> variable in the runtime to hold the bias). I'm hoping to have the prototype
>>>>>>>> ready and uploaded for review within the next few days. What's your opinion
>>>>>>>> on this idea? Would this be something that you'd be interested in as an
>>>>>>>> alternative approach?
>>>>>>>>
>>>>>>>>
>>>>>>>> @phosek thank you for sharing this alternative. I hope you don't
>>>>>>>> mind my calling this the bias method, after the term in the proposed
>>>>>>>> instrumentation.
>>>>>>>>
>>>>>>>> The TLDR is that I don't think the bias method is a good fit for
>>>>>>>> Linux/Darwin. Imho, this method won't reduce the complexity of continuous
>>>>>>>> mode, and its instrumentation overhead is likely to outweigh the savings
>>>>>>>> from reduced section alignment. Let me try and justify these claims :).
>>>>>>>>
>>>>>>>> First and most critically, note that (on Linux & Darwin, at least),
>>>>>>>> mmap() requires that the file `offset` argument be page-aligned. So, I
>>>>>>>> don't believe there's a way to avoid changing the raw profile format, or to
>>>>>>>> avoid the concomitant complexity necessary to calculate padding bytes in
>>>>>>>> the runtime. I don't see how the bias method solves this problem: this
>>>>>>>> seems to be a fundamental limitation of mmap().
>>>>>>>>
>>>>>>>> Second, note that the size overhead of counter increment
>>>>>>>> instrumentation dwarfs the alignment overhead for `__llvm_prf_{cnts,data}`
>>>>>>>> for all but the smallest of programs. Assuming 16K pages, in the worst
>>>>>>>> case, the section alignment costs 32K per image. Assuming an increment can
>>>>>>>> be encoded in 7 bytes as `incq l___profc_main(%rip)`, this section
>>>>>>>> alignment cost is equivalent to ~4700 increments. For comparison, this is
>>>>>>>> roughly the number of counter increments in //FileCheck.cpp.o//. If the
>>>>>>>> size overhead of counter instrumentation were to double, as I believe it
>>>>>>>> would with the bias method, it would rapidly erase the memory savings from
>>>>>>>> eliminating the section alignment requirement.
>>>>>>>>
>>>>>>>> Third, I'm not sure I understand the security constraints in
>>>>>>>> Fuchsia that make it undesirable to map the contents of a section onto a
>>>>>>>> file. Is the concern that an attacker process may update the file, thereby
>>>>>>>> changing the in-memory contents of the section? I'll note that on Darwin,
>>>>>>>> at least, the kernel does not permit this kind of sharing. If an attacker
>>>>>>>> process modifies a file on disk that has the in-memory contents of a
>>>>>>>> section MAP_SHARED over it, the in-memory contents of the mapped section
>>>>>>>> are not changed (I verified this by making a small modification to
>>>>>>>> `darwin-proof-of-concept.c`). If there is some security aspect to the
>>>>>>>> problem I'm missing here, could you please elaborate on it? I'm hoping that
>>>>>>>> the bias method will not be necessary to support continuous mode on
>>>>>>>> Fuchsia, but it seems this depends on the answer to the security question.
>>>>>>>> Either way, istm that the high level approach taken in this patch
>>>>>>>> as-written is a better fit for Linux/Darwin, and moreover probably provides
>>>>>>>> support that can be used to implement the bias method.
>>>>>>>>
>>>>>>>> I appreciate your feedback and think we can definitely work
>>>>>>>> together to build some kind of common solution!
>>>>>>>>
>>>>>>>>
>>>>>>>> CHANGES SINCE LAST ACTION
>>>>>>>>   https://reviews.llvm.org/D68351/new/
>>>>>>>>
>>>>>>>> https://reviews.llvm.org/D68351
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> llvm-commits mailing list
>>>>>>>> llvm-commits at lists.llvm.org
>>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>>>>>
>>>>>>>
>>>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20191031/b3d3754a/attachment.html>