<div dir="ltr"><br><br>On Thu, Sep 10, 2015 at 4:08 PM, Sean Silva <<a href="mailto:chisophugis@gmail.com">chisophugis@gmail.com</a>> wrote:<br>><br>><br>> On Wed, Sep 9, 2015 at 4:29 PM, Xinliang David Li <<a href="mailto:davidxl@google.com">davidxl@google.com</a>><br>> wrote:<br>>><br>>> On Wed, Sep 9, 2015 at 1:16 AM, Sean Silva <<a href="mailto:chisophugis@gmail.com">chisophugis@gmail.com</a>> wrote:<br>>> ><br>>> ><br>>> > On Tue, Sep 8, 2015 at 10:34 PM, Xinliang David Li <<a href="mailto:davidxl@google.com">davidxl@google.com</a>><br>>> > wrote:<br>>> >><br>>> >> On Tue, Sep 8, 2015 at 10:00 PM, Sean Silva <<a href="mailto:chisophugis@gmail.com">chisophugis@gmail.com</a>><br>>> >> wrote:<br>>> >> > silvas added a comment.<br>>> >> ><br>>> >> > Fundamentally, this approach is not a long-term solution to reducing<br>>> >> > PGO<br>>> >> > data size. The correct long-term solution for reducing PGO data size<br>>> >> > requires passing the binary to llvm-profdata.<br>>> >><br>>> >> This basically goes back to my original proposed solution.  Whether<br>>> >> this solution is longer term or not depends on whether it is<br>>> >> sufficiently effective in reducing size -- the data we have now<br>>> >> confirm that it is indeed effective.<br>>> ><br>>> ><br>>> > If we have the binary, the size of the names should not matter because<br>>> > they<br>>> > should not be present in the runtime memory image or in the raw profile<br>>> > data.<br>>><br>>><br>>> If we can tolerate the inconvenience of passing binary to<br>>> llvm-profdata,<br>><br>><br>> Quick aside: for compatibility reasons (and I think for general convenience<br>> reasons), this would need to be opt-in (we can consider breaking<br>> compatibility, but for a user-visible frontend feature this is likely a long<br>> discussion and it is easier to just provide compatibility). But otherwise, I<br>> think requiring the binary is clearly the right approach.<br>>  <br>>><br>>> I think there is a very simple and nice way of doing<br>>> this.<br>>><br>>> Currently LLVM emits symbols like __llvm_profile_name_foo with<br>>> contents 'foo', this is basically a waste -- the string of 'foo' is<br>>> actually a substring of the symbol name itself.<br>>><br>>> In other words, we can choose to continue emit __llvm_profile_name_xxx<br>>> symbols as before, but with zero size.  llvm-profdata/llvm-cov tool<br>>> can read the symtab and reconstruct the md5hash values for them.<br>>><br>>> With this change,<br>>><br>>> 1) no new option is needed<br>>> 2) binary size, raw/indexed profile size will be minimized<br>>> consistently (even when coverage mapping is on).<br>><br>><br>> Could you explain how this affects the 4 different sizes I lay out below?<br>> It's not clear how saving a string by pointing into the symbol name affects<br>> 2. or 3. below.<br>><br>> I think we should make clear the different "sizes", which may matter<br>> differently for different use cases:<br>> 1. memory overhead in the program image. Ideal case: just the raw counter<br>> values (+ relevant text size increase)<div><br></div><div>raw counter itself is not enough. All implementation needs a per-function control data __llvm_profile_data_<func_name> that stores the key or reference to key, structure/cfg hash, and number of counts in the counter variable. It also has a pointer/reference to the counter variable itself. To save 8 byte per function, we can choose to embed the counter variable into llvm_profile_data_xx variable to make it variable length, but that is an orthogonal change.</div><div><br>> 2. size overhead in the raw profile. Ideal case: just the raw counter<br>> values.<br>> 3. size overhead in the indexed profile. No clear ideal case.<br>> 4. raw binary size of the program's ELF file. No clear ideal case.<br>> (did I miss any?)<div><br></div><div>I would split 4 into </div><div>4.1 unstripped binary size</div><div>4.2 stripped binary size</div><div><br><div><br></div><div><font face="monospace, monospace">Let's compare three cases</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace">A) The existing implementation</font></div><div><font face="monospace">B) The implementation using 64bit md5 integer as func-key (my originally proposed approach) -- which is the ideal solution I mentioned above</font></div><div><font face="monospace">C) The implementation using 16 byte md5 hash str as func key -- which is what this patch implements.</font></div><div><font face="monospace"><br></font></div><div><font face="monospace">The following summarizes the per-function overhead (assuming average func name size is 82.</font></div><div><font face="monospace"><br></font></div><div><font face="monospace"><br></font></div><div><font face="monospace">       (1)     (2)     (3)       (4.1)          (4.2)</font></div><div><font face="monospace">------------------------------------------------------------</font></div><div><font face="monospace"> A      82      82      82     4*82+3*40+59        82</font></div><div><font face="monospace"> B      0       0       0      3*82+3*40+59        0</font></div><div><font face="monospace"> C      16      16      16     16+3*82+3*40+59     16</font></div><div><br></div><div>A little explanation of the size overhead of 4.1. It includes symbol names for __llvm_profile_data_xxx, __llvm_profile_counters_xxx, and __llvm_profile_name_xxx. The value 59 comes from the name prefixes. It also assumes Elf64 symtab entry overhead of 40 bytes.</div><div><br></div><div>We can see B) is the clear winner in all categories.  There are other techniques to reduce overhead in 4.1</div><div>a) more aggressively making those symbols with private linkage so that linker can merge them. (with B, we do need to keep __llvm_profile_name_xxx symbol non-local, but other symbols can be privatized)</div><div>b) embed llvm_profile_counter into llvm_profile_data so that we don't need to create symbols for them. This also helps all other categories ( negative 8 byte overhead per func)</div><div>c) make prefixes of those symbols shorter.</div><div><br></div><div>Unstripped binary size is probably not too interesting to optimize for, but it may cause linker to reach limit (especially when -g is also specified).</div><div><br></div><div>><br>> For me, only 1. is of primary importance. 2. and 3. are not relevant for<br>> programs of the size I have (~ size of Clang) and my users' setups (an<br>> entire host machine is available to handle the data, and only a single<br>> profile is written at a given time). 4. is irrelevant since large binaries<br>> must already be handled anyway for debug info.</div><div>><br></div><div>> For an embedded use case 1. and 2. are important but 3,4 are irrelevant<br>> since that happens on the host (assume host has much more resources than<br>> target).</div><div><br></div><div>4) can prevent the binary from being installable on the device due to the size limit.</div><div><br>><br>> It sounds like for your use case, 2,3,4 are the primary concerns. Is this<br>> correct?</div><div><br></div><div>All of them (we care about large servers and small devices).</div><div><br></div><div>thanks,</div><div><br></div><div>David</div><div><br>><br>> -- Sean Silva<br>><br>>  <br>>><br>>><br>>> >Do you know how much improvement there would be over the current patch<br>>> > if we could completely eliminate the names from the memory image and<br>>> > from<br>>> > the raw profile?<br>>><br>>> There are 16 byte overhead per function for emitting the name. For<br>>> clang, the overhead of the additional rodata is about 6% of the total<br>>> text size.<br>>><br>>> Note that if we don't use strings (name or md5string) but just 64bit<br>>> md5hash as the key (as done by my original patch), the perf-function<br>>> overhead for the keys  is zero -- because the Name reference field in<br>>> llvm_profile_data_* structure can be reused.<br>>><br>>> > I.e., how large are the just the counters by themselves<br>>> > compared to the current raw profile data? Quickly looking at some source<br>>> > code (but not testing anything) suggests that there is one `struct<br>>> > __llvm_profile_data` per counter. Depending on the pointer size, this<br>>> > struct<br>>> > is 3-4x as large as the counter itself.<br>>><br>>> _llvm_profile_data_*** is  one per function. The size of the<br>>> per-function data is 32 bytes (64bit arch), so __llvm_prf_data section<br>>> is larger than __llvm_prf_name.  So the largest is the counter section<br>>> when this patch is applied. Without this patch, the largest is<br>>> __llvm_prf_name (way larger).<br>>><br>>><br>>> ><br>>> > One other thing that bothers me is that (while I don't use llvm-cov)<br>>> > this<br>>> > size reduction in this patch is incompatible with llvm-cov.<br>>><br>>> yes -- but they can be made compatible if we are ok with more drastic<br>>> change.<br>>><br>>> ><br>>> > My thinking is that if we want to provide a feature for reducing the<br>>> > overhead in the runtime memory image or in the profraw, it should:<br>>> > - be fully general (e.g. cover the llvm-cov use case)<br>>> > - have nearly optimal size (3-4x overhead is not acceptable)<br>>> > - be completely transparent to the user, except for the fact that the<br>>> > binary<br>>> > must be passed to llvm-profdata (but not to clang; llvm-profdata should<br>>> > extract all necessary information from the binary)<br>>> ><br>>> > I think this is achievable and not hugely more complicated than the<br>>> > current<br>>> > patch.<br>>> ><br>>><br>>> agreed (other than the 3-4x overhead statement above).<br>>><br>>> > When it comes to adding a feature to clang that we must maintain<br>>> > forever, it<br>>> > makes more sense to have it be something like<br>>> > `-fprofile-instr-reduced-image-size` which can have a clear contract and<br>>> > the<br>>> > implementation of which we can continually improve (or make optimal).<br>>> > For<br>>> > this, the transparency is important (e.g. output of llvm-profdata should<br>>> > be<br>>> > the same). The current patch is highly tied to the precise<br>>> > implementation of<br>>> > what we are doing to reduce the size, and does not leave margin for<br>>> > improvement or for changing the underlying approach to cover new use<br>>> > cases<br>>> > (see below).<br>>> ><br>>> > What do you think?<br>>><br>>> See above, we can even do without the option and make every use case<br>>> happy.<br>>><br>>> ><br>>> > Out of curiosity, could you post the patch for your original approach<br>>> > that<br>>> > required the binary for llvm-profdata? Could it potentially work as an<br>>> > initial approach to "`-fprofile-instr-reduced-image-size`" that we could<br>>> > incrementally build on?<br>>> ><br>>><br>>> I have not added the llvm-profdata debug dump support yet.  If we<br>>> reach an agreement that is a good thing to do (which I agree), I will<br>>> add that submit that patch instead.<br>>><br>>> ><br>>> > The idea is that if a user says that there is too much overhead in the<br>>> > runtime memory image or in the profraw, then we can tell them to use a<br>>> > particular option to improve the situation, with the only caveat being<br>>> > that<br>>> > they must pass the binary to llvm-profdata;<br>>><br>>> I don't have supporting data, but if majority of users do not need<br>>> debug symbol info for llvm-profdata (that often), would it be better<br>>> to require passing binary if per function dump is needed? Post<br>>> proposing script can also be easily developed for that. It also makes<br>>> the use model consistent with llvm-cov.<br>>><br>>> >besides that, there are no<br>>> > strings attached, and no hidden issues for the user (e.g. in an embedded<br>>> > project, the code may be mostly C and the function names may typically<br>>> > be<br>>> > shorter than the hash, so this patch could make things worse). The<br>>> > current<br>>> > patch assumes that function name size is the primary overhead which<br>>> > makes<br>>> > PGO not usable; do we don't have a reason to believe that to be the case<br>>> > in<br>>> > general? I am interested to hear from the other commenter on the thread<br>>> > who<br>>> > mentioned issues with pgo size; still our sample set is very small.<br>>><br>>> For C programs, using name hash string instead of name itself may<br>>> actually increase overhead -- it is probably a good reason that the<br>>> zero overhead approach is preferred.  Here are some data from SPEC<br>>> (average name size):<br>>><br>>> GCC (C): 20<br>>> SJENG (C): 12<br>>> GZIP (C): 11<br>>> BZIP2 (C): 15<br>>> MCF (C) : 15<br>>> LIBQUANTUM (C): 17<br>>> HMMER (C) : 13<br>>> GOBMK (C): 20<br>>><br>>> Xalan(C++) : 61<br>>> DealII (C++): 58<br>>> Clang (C++): 82<br>>><br>>> Google benchmarks<br>>> ><br>>> > It seems premature to think that function name size is the greatest<br>>> > overhead<br>>> > in general since there are only O(#functions) of them, while there are<br>>> > O(#counters) unnecessary instances of `struct __llvm_profile_data` (and<br>>> > that<br>>> > struct is reasonably large compared to typical identifiers, even<br>>> > mangled)<br>>><br>>> llvm_profile_data is also O(#functions).<br>>><br>>> . A<br>>> > quick exploration in Mathematica suggests that for Clang, the bulk of<br>>> > the<br>>> > size overhead of function names is concentrated in the range 50-150<br>>> > characters (see <a href="http://i.imgur.com/en2mpBF.png">http://i.imgur.com/en2mpBF.png</a>; area under the curve of<br>>> > the<br>>> > first plot is total size of identifiers, with horizontal axis being bins<br>>> > by<br>>> > identifier size; second plot is left to right accumulated values of the<br>>> > first plot). Since `struct __llvm_profile_data` is ~25 bytes, this means<br>>> > that we need only assume that functions typically have 2-6 counters for<br>>> > the<br>>> > overhead of the structs to be comparable to the mangled names.<br>>><br>>> llvm_prof_data is 32byte and it is per-function.<br>>><br>>> ave negligible effect<br>>> > on the size as a whole.<br>>> ><br>>> >><br>>> >><br>>> >> >The approach of using name hashes is also not an incremental step<br>>> >> > towards<br>>> >> > the long-term solution. So thinking about it a bit more, I'm on the<br>>> >> > fence<br>>> >> > about whether a quick solution like this one even makes sense to do.<br>>> >><br>>> >> I am on the fence too, but leaning towards having this one as the<br>>> >> longer term solution -- it is cleaner and less disruptive. It does not<br>>> >> require bumping up the version numbers. (On the other hand, my<br>>> >> original patch does require version change)<br>>> >><br>>> >> > I am very apprehensive of adding complexity to make this approach<br>>> >> > work,<br>>> >> > in particular threading information through many layers, as this is<br>>> >> > likely<br>>> >> > to get in the way of future work towards the long-term solution.<br>>> >><br>>> >> I don't see the complexity actually. Other than some refactoring, the<br>>> >> code here is very non-intrusive. I don't think it will get in the way<br>>> >> of 'longer term' solution if it ever became a need to do.<br>>> ><br>>> ><br>>> > My complexity-detector must be tuned differently than yours.<br>>> ><br>>> > For example, -instr-prof-with-namehash introduces a weird coupling<br>>> > between<br>>> > the frontend and the middle-end. It seems to just be used for setting<br>>> > IncludeNamesInProfile, which seems to be completely orthogonal to<br>>> > hashing?<br>>><br>>> This option is only useful for testing purpose (with opt).<br>>><br>>> Anyway, I am fine with going forward with the most aggressive size<br>>> reduction approach that every one can benefit consistently.<br>>><br>>> David<br>>><br>>><br>>><br>>> ><br>>> >><br>>> >><br>>> >> ><br>>> >> > My takeaways from the RFC thread (maybe I am misunderstanding) is<br>>> >> > that<br>>> >> > we are still using the same format just with one string vs. another.<br>>> >><br>>> >> yes that is true -- that is why version number is not bumped up.<br>>> >><br>>> >> > We shouldn't need to rev the version and there shouldn't need to be<br>>> >> > any<br>>> >> > changes in compiler-rt or llvm.<br>>> >><br>>> >> Note that version is not changed. It is just the version field is now<br>>> >> partitioned to allow more information to be passed around. This is a<br>>> >> very useful feature to have.<br>>> >><br>>> >><br>>> >> >Clang can easily diagnose (as an isolated piece of logic) if the user<br>>> >> > did<br>>> >> > the pgo training run with md5 mode enabled but the md5 mode flag was<br>>> >> > not<br>>> >> > passed or vice versa (it is easy to tell if input function names are<br>>> >> > md5<br>>> >> > sums or not as a diagnostic heuristic).<br>>> >><br>>> >> If we want to enforce matching options in profile-gen and profile-use,<br>>> >> the mismatched option and profile data should result in hard errors --<br>>> >> current implementation allows it to be caught as a hard error (with<br>>> >> encoding in the profile data).<br>>> ><br>>> ><br>>> > I don't see in the patch where this error is issued (nor where that<br>>> > error is<br>>> > tested).<br>>> ><br>>> >><br>>> >><br>>> >> >It is also inconsistent for llvm-profdata to be able to be passed<br>>> >> > `-function=foo` for an md5 profile; the user should have to pass<br>>> >> > (truncated)<br>>> >> > md5 of the function name.<br>>> >><br>>> >> This is usability and convenience -- not inconsistency.<br>>> ><br>>> ><br>>> > Fundamentally, trying to paper over the fact that what is in the file is<br>>> > md5<br>>> > sums is wrong. There are many ways that it sneaks through with this<br>>> > approach, and I don't see a fundamental reason to try to add a special<br>>> > case<br>>> > for this one option. If this is needed for convenience, a separate patch<br>>> > can<br>>> > add an option `-md5-function=` which automatically takes the md5 hash of<br>>> > the<br>>> > provided argument to save the trouble.<br>>> ><br>>> > -- Sean Silva<br>>> ><br>>> >><br>>> >><br>>> >> ><br>>> >> ><br>>> >> > ================<br>>> >> > Comment at:<br>>> >> > test/Instrumentation/InstrProfiling/profiling_md5hash_key.ll:5-6<br>>> >> > @@ +4,4 @@<br>>> >> > +<br>>> >> > +@__llvm_profile_name_foo = hidden constant [16 x i8]<br>>> >> > c"5cf8c24cdb18bdac"<br>>> >> > +; CHECK: @__llvm_profile_name_foo = hidden constant [16 x i8]<br>>> >> > c"5cf8c24cdb18bdac", section "__DATA,__llvm_prf_names", align 1<br>>> >> > +@__llvm_profile_name_bar = hidden constant [17 x i8]<br>>> >> > c"e413754a191db537\00"<br>>> >> > ----------------<br>>> >> > This is backward from a standard md5 sum in hex. We should maintain<br>>> >> > the<br>>> >> > invariant that the resulting symbol name is a substring of a standard<br>>> >> > md5<br>>> >> > sum. Otherwise it is very difficult to explain how to get the hash<br>>> >> > for a<br>>> >> > function.<br>>> >> ><br>>> >> > E.g. we should be able to say "the symbol name is effectively<br>>> >> > replaced<br>>> >> > by the first half of its md5 sum in hex":<br>>> >><br>>> >><br>>> >> Good catch. That can be fixed.<br>>> >><br>>> >> thanks,<br>>> >><br>>> >> David<br>>> >><br>>> >> ><br>>> >> > ```<br>>> >> > Sean:~ % echo -n 'foo' | md5<br>>> >> > acbd18db4cc2f85cedef654fccc4a4d8<br>>> >> > Sean:~ % echo -n 'foo' | md5 | head -c 16<br>>> >> > acbd18db4cc2f85c<br>>> >> > ```<br>>> >> ><br>>> >> > The symbol name string should be `acbd18db4cc2f85c`.<br>>> >> ><br>>> >> ><br>>> >> > <a href="http://reviews.llvm.org/D12715">http://reviews.llvm.org/D12715</a><br>>> >> ><br>>> >> ><br>>> >> ><br>>> ><br>>> ><br>><br>><br><br></div></div></div></div>