[PATCH] D68351: [profile] Add a mode to continuously sync counter updates to a file

Fri Nov 1 09:50:23 PDT 2019

> On Oct 31, 2019, at 11:18 PM, Xinliang David Li <davidxl at google.com> wrote:
> 
> My understanding is that the additional instr runtime overhead is not something iOS can take (as this will be the default mode). It is a little unfortunate we need to have two implementation basically doing the same thing. Hopefully we can make some unficiation in the future.
> 
> On Thu, Oct 31, 2019 at 8:39 PM Petr Hosek <phosek at google.com <mailto:phosek at google.com>> wrote:
> I've uploaded the change that implements the dynamically allocated counters as https://reviews.llvm.org/D69700 <https://reviews.llvm.org/D69700>. I still need to rebase it on top of Vedant's change that landed today.
> 
> The question is whether this implementation is something we want? I was hoping that this variant would satisfy both the requirements of iOS and Fuchsia, but Vedant has already landed the implementation that uses padding which isn't usable for us. On the other hand, for us the version with relocated counters (what I originally described as "bias") is better since it can be used even for low-level components (i.e. before the runtime has been loaded). I'll try to upload the patch that implements the relocated counters tomorrow.

I would really like to see D69700 in tree. We'd like to use the section mmap() approach from D68351 on Darwin, but it would be great to have a more portable solution.

The amount of padding introduced due to D68351 can be set to 0 when the whole profile is mmap'd to avoid any incompatibility on Fuchsia. However if it's simpler to always mmap() the whole profile on every platform, I'm fine with that (the only cost should be virtual address space?), and we can delete the two padding fields in the profile header.

vedant

> 
> On Tue, Oct 15, 2019 at 2:46 PM Vedant Kumar <vedant_kumar at apple.com <mailto:vedant_kumar at apple.com>> wrote:
> 
> 
>> On Oct 14, 2019, at 3:16 PM, Petr Hosek <phosek at google.com <mailto:phosek at google.com>> wrote:
>> 
>> I've prototyped both approaches: relocated counters and dynamically allocated counters (the most recent proposal), to see how they'd perform. I've used both prototypes to build Clang and then used the instrumented Clang to compile zstd_opt.c <https://github.com/facebook/zstd/blob/dev/lib/compress/zstd_opt.c> which is one of the slowest files to compile in our build (the reported times are median from several runs).
>> 
>> Uninstrumented Clang: 112M, 11.013s
>> Existing instrumentation (statically allocated counters): 1247M, 34.626s
>> Dynamically allocated counters: 1176M, 48.43s
>> Relocated counters: 1292M, 47.847s
>> 
>> I'm not sure why the instrumented Clang binary (in all cases) tends to get so big, originally I suspected it may be because the instrumentation is incompatible with --gc-sections so I disabled that option for the uninstrumented Clang but the difference is still roughly 10x. Do you have any insight on this?
> 
> Coverage mapping data grows in a super-linear way with the number of translation units, due to definitions #included through headers getting duplicate mappings. This is PR34533, which I haven't had time to pick back up, unfortunately.
> 
> 
>> In terms of performance, both approaches are similar, and roughly 40% slower than the existing instrumentation. In terms of size, as expected the dynamically allocated counters variant is smaller (by about 5.7%) because we can omit the counters section, the relocated counters version is larger (by about 3.6%) because of the extra instrumentation although this instrumentation doesn't seem to be affecting performance as much as I expected.
>> 
>> In terms of pros and cons:
>> In case of the dynamically allocated counters, aside from the issue mentioned before where we need to ensure that the runtime ran before we start executing any instrumented code (this actually came up in my testing and I had to annotate Runtime Registration with __attribute__((init_priority(101)))), there's also the requirement that __llvm_prf_data has to be writable; this is the case today, but I've been separately working on a patch to use relative relocations which would avoid that requirement making it possible to move __llvm_prf_data into rodata and deduplicating memory across processes (this would be useful when collecting profiles from shared libraries that are used by multiple processes), but with this solution that won't be possible
>> In case of relocated counters, there's the extra overhead because of the increased binary size and having two copies of counters in memory (although it might be possible to deduplicate one of those for shared libraries across processes, see above).
>> Either of these solutions should work for us (the dynamically allocated counter one might requirement a bit of extra work to annotate the early path in our libc and dynamic linker). Do you have any preference? Once we agree, I'll try to clean up the implementation a bit and upload it for review.
> 
> Thanks for doing this comparison. Based on your results, I think we should support both the current counter instrumentation and the proposed relocated counters method. It looks like there's a significant performance/portability tradeoff that makes both necessary.
> 
> Do you see any blockers that make maintaining both instrumentation modes infeasible?
> 
> vedant
> 
> 
>> 
>> On Mon, Oct 7, 2019 at 10:56 AM Vedant Kumar <vedant_kumar at apple.com <mailto:vedant_kumar at apple.com>> wrote:
>> 
>> 
>>> On Oct 5, 2019, at 1:03 PM, Petr Hosek <phosek at google.com <mailto:phosek at google.com>> wrote:
>>> 
>>> I've considered omitting the `__llvm_prf_cnts` section, the only downside of doing so is you need to make sure that the space for counters is allocated before you execute any instrumented code. That makes it tricky if you want to instrument libc which is something we do today.
>>> 
>>> We can work around it though: a simple solution is to avoid collecting profiles for libc, a better solution would be to carefully annotate all libc functions that are executed before constructors to ensure that they aren't instrumented e.g. by introducing a no_profile attribute, this is the same approach we use ASan by annotating all function on the early startup path with __attribute__((no_sanitize("address"))).
>> 
>> I think the `no_profile` approach makes sense. One downside is that in non-continuous mode, annotated libc functions will not be profiled. Some alternatives are 1) having a more specific `no_profile_in_continuous_mode` annotation, or 2) to not instrument libc in continuous mode and rely on libc unittests which run in non-continuous mode only. It’s probably a bit easier to support value profiling with (2), if that’s a concern.
>> 
>> vedant
>> 
>>> 
>>> On Fri, Oct 4, 2019 at 4:43 PM Xinliang David Li <davidxl at google.com <mailto:davidxl at google.com>> wrote:
>>> 
>>> 
>>> On Fri, Oct 4, 2019 at 4:35 PM <vsk at apple.com <mailto:vsk at apple.com>> wrote:
>>> 
>>> 
>>>> On Oct 4, 2019, at 3:43 PM, Petr Hosek <phosek at google.com <mailto:phosek at google.com>> wrote:
>>>> 
>>>> On Fri, Oct 4, 2019 at 1:44 PM Xinliang David Li <davidxl at google.com <mailto:davidxl at google.com>> wrote:
>>>> I will let Petr chime in explaining more details -- the reply I provided are based on my understanding of 'Reloc' proposal, which may be quite off.
>>>> 
>>>> David
>>>> 
>>>> On Fri, Oct 4, 2019 at 1:28 PM <vsk at apple.com <mailto:vsk at apple.com>> wrote:
>>>> 
>>>> 
>>>>> On Oct 4, 2019, at 10:30 AM, Xinliang David Li <xinliangli at gmail.com <mailto:xinliangli at gmail.com>> wrote:
>>>>> 
>>>>> Petr's method is more like 'relocating' the counter memory from static allocated memory to mmapped memory, but better be named 'reloc' method.
>>>> 
>>>> Thank you, that's a much better name.
>>>> 
>>>> 
>>>>> Now lets do some comparison:
>>>>> 
>>>>> 1) reloc method allows runtime to mmap from the start of the raw profile, so there is no need to page align the counter data in the file;
>>>> 
>>>> Let me see if I follow. Is the idea to allocate a buffer large enough to contain the entire contiguous raw profile, in each instrumented image? I can see how it would be possible to mmap() a single such buffer onto a raw profile without any extra padding or format changes. However, this buffer allocation entails doubling the memory overhead of the `__llvm_prf_{cnts,data,names}` sections, as these would all need to be copied into the buffer-to-be-mmap'd. This may not work on well on iOS, as the devices we support are quite memory-constrained. The per-process memory limit on iOS is ~1.4GB on many devices -- clients like News.app or Music.app already get very close to this limit.
>>>> 
>>>> Also, with more than one instrumented image (DSOs), I believe page-alignment padding is still necessary so that each in-memory raw profile buffer can be mmap()'d onto a page-aligned file offset.
>>>> 
>>>> Perhaps that is a non-issue, as it can be solved by simply creating N different raw profiles -- one per instrumented image. Actually this should work for both proposed schemes.
>>>> 
>>>> David's understanding is correct, my plan is to mmap the entire raw profile from the start which means that there's no requirement for page alignment. Alternative would be to mmap only a page-aligned portion of the raw file of file that contains counters, i.e. we would mmap [__llvm_prf_cnts & -PAGE_SIZE, (__llvm_prf_cnts + __llvm_prf_cnts_length + PAGE_SIZE - 1) & -PAGE_SIZE). There's no need for buffer allocation, we can simply mmap the file directly as read/writable region.
>>>>> 2) reloc method does have more instrumentation overhead, however it is not as large as we think.
>>>>>    a) the bias load can be done at the start of the function
>>>>>    b) on x86, the addressing mode kicks in, there is actually no size difference in terms of updating:
>>>>>         addq    $1, counter+16(%rip)
>>>>>    vs
>>>>>        addq    $1, counter+16(%rax)
>>>>> 
>>>>>   both are 7 bytes long.
>>>>> 
>>>>> so the overhead will be a few bytes more per function.
>>>> 
>>>> I see, thanks for this correction!
>>>> 
>>>> Thanks for the suggestion David! The prototype implementation I have loads the bias/reloc on every counter update but loading the bias/reloc at the start of the function is going to be more efficient. I'll update the implementation and try to get some numbers by compiling instrumented Clang.
>>>>> 3) reloc method produces smaller sized raw profile
>>>> 
>>>> I'm interested in understanding how major of an issue this is for Google's use cases. On iOS/macOS, the worst case overhead of section alignment padding is < 32K/8K respectively per image (due to 16K/4K pages). What's the expected file size overhead for Google? Is a much larger page size in use? If so, is switching to a smaller page size when running instrumented binaries an option?
>>>> 
>>>> In addition to understanding this, per my last comment to Petr, I would also like to understand the specific security concerns w.r.t mapping a section onto a file. If this is something Darwin should not allow, I would like to let our kernel team know!
>>>> 
>>>> It's not a specific security concern, rather it's a consequence of Fuchsia's design/capability model. In Fuchsia, the address space (sub)regions are objects which are accessed and "operated on" through handles (capabilities). When loading modules (executables and DSOs), our dynamic linker creates a new subregion for each one, mmaps individual segments into those subregions and then closes all subregion handles. That doesn't unmap those files, but it means that the address space layout can never be modified because there's no way to recover those handles. That's a nice property because it means that all module mapping is immutable (e.g. it's impossible to make any of the executable segments writable or even readable if you use execute-only memory on AArch64). However, it also means that we cannot mmap anything over those segments as would be needed for the continuous mode. The only way to make this work is to avoid closing the subregion handles and keep those for later use, but that would also make our dynamic linker a potential attack vector.
>>>> 
>>>> More generally, relying on the overmap behavior means tying the implementation to the OS-specific API which is available on Linux and Darwin, but may not be available on other OSes (e.g. I'm not even sure if this is possible on Windows). It'd be nice to implement a solution that could be made to work on all OSes.
>>> 
>>> Ok, I'm convinced that the reloc mode is the more portable approach. Per David's comments (the one about not actually emitting `__llvm_prf_cnts` in the binary //especially//, although I haven't quoted it here), reloc mode can probably be implemented reasonably efficiently on Linux/Darwin. The runtime can inspect `__llvm_prf_data` to determine the number of counters, mmap() a sufficient amount of memory, and then update the CounterPtr fields of all the data entries. This means the instrumentation would need to look like: https://godbolt.org/z/-PvObE <https://godbolt.org/z/-PvObE>.
>>> 
>>> 
>>> incr3 version looks great. 
>>>  
>>> @Petr I'd be happy to work together on this. Let me know if you'd like any help prototyping/testing, perhaps we can split up some of the work.
>>> 
>>> One thing we haven't really resolved is the conversation about not changing the raw profile format. Fwiw, I don't think avoiding changes to the raw profile format should be a design goal, as the point of having separate raw/indexed formats is that the raw format should be cheap to rev. I think that I've demonstrated that the file size overhead from adding section alignment padding is small: I suspect that getting rid of this would introduce a considerable amount of complexity to any kind of mmap()-mode implementation.
>>> 
>>> 
>>> if there is a need to change raw format to make things more efficient, go for it.
>>> 
>>> David
>>> 
>>>  
>>> Anyway, I'm excited to see your results. Please keep me in the loop!
>>> 
>>> thanks,
>>> vedant
>>> 
>>> 
>>>>> 4) The continuous method has the advantage of being more efficient in running speed.
>>>> 
>>>> As pointed out above, the reduced memory overhead is another (perhaps more significant) factor.
>>>> 
>>>> 
>>>>> Also, I believe both methods should support on-line merging -- as the profile update happens 'In-place' -- the online merging code can be simplified (no locking, reading, updating are needed).
>>>> 
>>>> Thanks for pointing this out. In my last experiment, I added a `while (true) { print-counters; sleep(...) }` loop to `darwin-proof-of-concept.c` after it set up the mmap(), and then edited the counter section of the profile using a hex editor. The updates did not appear in the `proof-of-concept` process. But perhaps there is a more sophisticated way to share the physical pages backing `__llvm_prf_cnts` between processes, while they are also mapped onto the filesystem. I need to run more experiments to understand this. 
>>>> 
>>>> vedant
>>>> 
>>>> 
>>>>> 
>>>>> David
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Oct 3, 2019 at 4:18 PM Vedant Kumar via Phabricator via llvm-commits <llvm-commits at lists.llvm.org <mailto:llvm-commits at lists.llvm.org>> wrote:
>>>>> vsk added a comment.
>>>>> 
>>>>> In D68351#1693693 <https://reviews.llvm.org/D68351#1693693 <https://reviews.llvm.org/D68351#1693693>>, @phosek wrote:
>>>>> 
>>>>> > In D68351#1693307 <https://reviews.llvm.org/D68351#1693307 <https://reviews.llvm.org/D68351#1693307>>, @davidxl wrote:
>>>>> >
>>>>> > > +petr who has similar needs for Fuchia platform
>>>>> >
>>>>> >
>>>>> > Thank you for including me, we have exactly the same use case in Fuchsia and this is the solution I've initially considered and been experimenting with locally based on the discussion with @davidxl. However, after internal discussion we've decided against this solution for two reasons:
>>>>> >
>>>>> > 1. This requires the ability to mmap the output file over the (portion of) binary which is something we don't allow in Fuchsia for security reasons; once all modules have been mapped in by the dynamic linker, we don't allow any further changes to their mapping and using this solution would require special dynamic linker which is possible but highly undesirable.
>>>>> > 2. It introduces bloat to both the binary and the output file. It also complicates the runtime implementation due to alignment and padding requirements.
>>>>> >
>>>>> >   The alternative solution we've came up and that I've now been implementing is to change the instrumentation to allow extra level of indirection. Concretely, the instrumentation conceptually does the following: `c = __profc_foo[idx]; c++; __profc_foo[idx] = c`. We'd like to change this `c = *(&__profc[idx] + *bias); c++; *(&__profc[idx] + *bias) = c` where `bias` is a global variable set by the runtime to be the offset between the `__llvm_prf_cnts` section and the corresponding location in the file that's mmapped in. Initially, that offset would be `0` which would result in exactly the same behavior as today, but the runtime can mmap the output file into address space and then change the offset to make the counters be continuously updated.
>>>>> >
>>>>> >   The advantage of this solution is that there are no changes needed to the profile format. It also doesn't require mmapping the output file over the binary, the output file can be mmapped anywhere in the address space. The disadvantage is extra overhead since instrumentation is going to be slightly more complicated, although we don't have any numbers yet to quantify how much slower it's going to be. The implementation should be fairly minimal and my tentative plan was to gate it on compiler switch, so it wouldn't affect existing in any way (modulo introducing one new variable in the runtime to hold the bias). I'm hoping to have the prototype ready and uploaded for review within the next few days. What's your opinion on this idea? Would this be something that you'd be interested in as an alternative approach?
>>>>> 
>>>>> 
>>>>> @phosek thank you for sharing this alternative. I hope you don't mind my calling this the bias method, after the term in the proposed instrumentation.
>>>>> 
>>>>> The TLDR is that I don't think the bias method is a good fit for Linux/Darwin. Imho, this method won't reduce the complexity of continuous mode, and its instrumentation overhead is likely to outweigh the savings from reduced section alignment. Let me try and justify these claims :).
>>>>> 
>>>>> First and most critically, note that (on Linux & Darwin, at least), mmap() requires that the file `offset` argument be page-aligned. So, I don't believe there's a way to avoid changing the raw profile format, or to avoid the concomitant complexity necessary to calculate padding bytes in the runtime. I don't see how the bias method solves this problem: this seems to be a fundamental limitation of mmap().
>>>>> 
>>>>> Second, note that the size overhead of counter increment instrumentation dwarfs the alignment overhead for `__llvm_prf_{cnts,data}` for all but the smallest of programs. Assuming 16K pages, in the worst case, the section alignment costs 32K per image. Assuming an increment can be encoded in 7 bytes as `incq l___profc_main(%rip)`, this section alignment cost is equivalent to ~4700 increments. For comparison, this is roughly the number of counter increments in //FileCheck.cpp.o//. If the size overhead of counter instrumentation were to double, as I believe it would with the bias method, it would rapidly erase the memory savings from eliminating the section alignment requirement.
>>>>> 
>>>>> Third, I'm not sure I understand the security constraints in Fuchsia that make it undesirable to map the contents of a section onto a file. Is the concern that an attacker process may update the file, thereby changing the in-memory contents of the section? I'll note that on Darwin, at least, the kernel does not permit this kind of sharing. If an attacker process modifies a file on disk that has the in-memory contents of a section MAP_SHARED over it, the in-memory contents of the mapped section are not changed (I verified this by making a small modification to `darwin-proof-of-concept.c`). If there is some security aspect to the problem I'm missing here, could you please elaborate on it? I'm hoping that the bias method will not be necessary to support continuous mode on Fuchsia, but it seems this depends on the answer to the security question. Either way, istm that the high level approach taken in this patch as-written is a better fit for Linux/Darwin, and moreover probably provides support that can be used to implement the bias method.
>>>>> 
>>>>> I appreciate your feedback and think we can definitely work together to build some kind of common solution!
>>>>> 
>>>>> 
>>>>> CHANGES SINCE LAST ACTION
>>>>>   https://reviews.llvm.org/D68351/new/ <https://reviews.llvm.org/D68351/new/>
>>>>> 
>>>>> https://reviews.llvm.org/D68351 <https://reviews.llvm.org/D68351>
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> llvm-commits mailing list
>>>>> llvm-commits at lists.llvm.org <mailto:llvm-commits at lists.llvm.org>
>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits>
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20191101/875f336c/attachment-0001.html>