[PATCH] D29512: [PGO] Directory name stripping in global identifier for static functions

Sean Silva via llvm-commits llvm-commits at lists.llvm.org
Tue Feb 7 19:25:25 PST 2017


On Tue, Feb 7, 2017 at 7:19 PM, Xinliang David Li <davidxl at google.com>
wrote:

>
>
> On Tue, Feb 7, 2017 at 6:57 PM, Sean Silva <chisophugis at gmail.com> wrote:
>
>>
>>
>> On Mon, Feb 6, 2017 at 9:54 AM, Rong Xu <xur at google.com> wrote:
>>
>>> When I wrote IRPGO for PGOFuncName, I intentionally chose to use the
>>> source path with directory name. This is different from the Clang PGO where
>>> the base name is used. I believe I mentioned this in the comment or commit
>>> message. This is to avoid the name clashing like this.
>>>
>>> So the change in D22028 was a regression.
>>>
>>> Surely, this will not solve all the name clashing: if the user could
>>> have used the same source name string (with directory) in the build. In
>>> this case, there still could be name clashing. But I don't think it is
>>> common. I haven't encountered one in the real applications. On the other
>>> hand, using the base name only breaks many of out applications.  In
>>> addition, no matter how complicate the scheme is, like Hash, there is
>>> still a chance for the naming clash (unless, of course, using global
>>> module ID in thin-lto or lto).
>>>
>>
>> So from what you're saying it seems like we already have a global module
>> ID that could be used to disambiguate static functions? Why don't we use
>> that?
>>
>
> It is using the same scheme as PGO.
>

I thought ThinLTO hashed the exported symbols? (sorry, I guess I'm out of
the loop)


>
>> Figuring out how many paths to strip from the source dir path seems like
>> a pain that we shouldn't push on users if we already have a solution. We
>> want PGO to be as easy to use as possible.
>>
>
> The idea is that most users (who do not generate source into different
> dirs or use relative source paths) never need to use/worry about it.
>

Okay. I'll take your word on that for now.

-- Sean Silva

>
> David
>
>>
>> -- Sean Silva
>>
>>
>>> I still think using a path name is a good choice.
>>>
>>> Second, I don't think D22028's use case is common. The typical use case
>>> for FDO is to have the same command line options and just replacing
>>> -fprofile-generate with -fprofile-use. D22028 has different source string,
>>> I believe the user has a better position to deal with this than the
>>> compiler.
>>>
>>> -Rong
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Feb 5, 2017 at 12:11 AM, Xinliang David Li <davidxl at google.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Feb 3, 2017 at 11:28 PM, Sean Silva <chisophugis at gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Feb 3, 2017 at 10:18 PM, Xinliang David Li <davidxl at google.com
>>>>> > wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Feb 3, 2017 at 8:31 PM, Sean Silva via Phabricator <
>>>>>> reviews at reviews.llvm.org> wrote:
>>>>>>
>>>>>>> silvas added a comment.
>>>>>>>
>>>>>>> This change does two things (as you mention in the description):
>>>>>>>
>>>>>>> 1. Adding -static-func-strip-dirname-prefix which provides a way to
>>>>>>> have more control when `-static-func-full-module-prefix=true` is
>>>>>>> specified.
>>>>>>>
>>>>>>
>>>>>> This is actually a more general form of -static-func-full-mdoule-prefi
>>>>>> x.
>>>>>>
>>>>>>
>>>>>>> 2. Changing the default of -static-func-full-module-prefix to true.
>>>>>>>
>>>>>>> IIRC, -static-func-full-module-prefix defaults to false because it
>>>>>>> caused issues when set to true (in fact, it was introduced to avoid these
>>>>>>> issues). The default value of -static-func-strip-dirname-prefix
>>>>>>> introduced in this patch (i.e. 0) is effectively a no-op; so ignore 1. for
>>>>>>> now. This means that the net effect of this patch is that compilation will,
>>>>>>> by default, have a regression on the issue fixed by r275193 /
>>>>>>> http://reviews.llvm.org/D22028, which is not a good idea. I think
>>>>>>> that the default behavior (which is user-visible) should not be changed in
>>>>>>> this patch.
>>>>>>>
>>>>>>
>>>>>> I disagree. The original default behavior was to preserve the full
>>>>>> path which was also user visible :)
>>>>>>
>>>>>
>>>>> And yet we found a compelling-enough use case to change it.
>>>>>
>>>>
>>>> It was probably better to introduce the option but not flipping the
>>>> default the first time. The stripping-path-fully mode had not been widely
>>>> tested at that time.
>>>>
>>>> I am a little curious about the use case for D22028. The pgo name of
>>>> static function is only affected by source module path. Why would that be
>>>> different for pgo-gen/use builds? In most common setup I saw, the source
>>>> paths should remain the same.
>>>>
>>>>
>>>>> We may need to revisit that decision, but clearly the current default
>>>>> is intentional and part of changing away from that is explaining why we no
>>>>> longer care about that use case (or care about it less than some other
>>>>> thing).
>>>>>
>>>>
>>>> We care about all use cases, which is why the more general form of
>>>> option is introduced -- it makes sure the use case in D22028 can also be
>>>> handled but more safely (stripping all prefix will bound to cause problems).
>>>>
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> The whole rationale for changing the current default is that it is
>>>>>> generally not safe -- mainly problem #1 because of counter variables for
>>>>>> static functions can not guaranteed to be unique when full path is stripped.
>>>>>>
>>>>>
>>>>> Can the counter variables be static to match the static nature of the
>>>>> functions they describe? (there would still be collisions when indexing the
>>>>> profile data though; the function CFG hash could be included in the "name"
>>>>> to avoid this)
>>>>>
>>>>
>>>> For simple functions, cfg hash collision is also very likely, so the
>>>> first line of defense is always the name key.
>>>>
>>>>
>>>>>
>>>>>
>>>>>> The ThinLTO issue is secondary (probably irrelevant here because of
>>>>>> other bugs).
>>>>>>
>>>>>
>>>>>>
>>>>>> The issue addressed in D22028 is actually not common -- the source
>>>>>> module paths should generally match in profile-gen and profile-use phases,
>>>>>> so using internal option for that use case seems more reasonable to me.
>>>>>>
>>>>>
>>>>> This is subjective, but I think it is quite reasonable to assume that
>>>>> each build will use a different output directory. Hence any build that
>>>>> generates .cpp files into the output directory (which seems reasonable too)
>>>>> is susceptible.
>>>>>
>>>>
>>>> so the case is for generated source files? Should they be accessed with
>>>> relative paths?
>>>>
>>>>
>>>>>
>>>>> Overall, requiring a user to use a compiler-internal option for
>>>>> something that seems to happen in practice (e.g. back when I was a
>>>>> PlayStation we actually ran into it and spent time fixing it) is a pretty
>>>>> poor experience. I think we should aim to do better (though we might settle
>>>>> for less if that proves challenging).
>>>>>
>>>>
>>>> I agree in general. However I think it is reasonable for a user to use
>>>> an internal option for corner use cases.  Another choice is to introduce an
>>>> external option for this which user can rely on.
>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Overall, it sounds like this approach of relying on users to tweak
>>>>>>> internal compiler options (-mllvm) to get correct behavior in their
>>>>>>> environment is not the kind of user experience we want to deliver (or the
>>>>>>> kind of implementation that we want to maintain). IIRC, when we added
>>>>>>> -static-func-full-module-prefix, it was with the understanding that
>>>>>>> it was a simple hack for working around the larger issue of relying on the
>>>>>>> module name which we knew was not very robust. The further addition of the
>>>>>>> "InLTO" complicates things even further. It seems like a code smell that we
>>>>>>> do not have a Single Point Of Truth.
>>>>>>>
>>>>>>> I proposed a solution at one point https://groups.google.com/d/ms
>>>>>>> g/llvm-dev/s_VZbFTWbVs/d0b4Zh80CgAJ though it may no longer be
>>>>>>> applicable. It seems like ThinLTO already has to solve a problem of finding
>>>>>>> unique identifiers for all functions (even static), so we may want to
>>>>>>> piggy-back on that mechanism (this is just a high-level thought; haven't
>>>>>>> looked into the details).
>>>>>>>
>>>>>>>
>>>>>> For LTO/ThinLTO,  we solved the issue by using meta data which uses
>>>>>> getPGOFuncName as singe source of truth.
>>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>> So:
>>>>>>>
>>>>>>> - I specifically object to changing user-visible defaults in this
>>>>>>> patch. Those changes should be isolated, and I don't think we have
>>>>>>> justification to change those defaults anyway.
>>>>>>>
>>>>>>
>>>>>> See my reply about the safety issue of keeping the current default.
>>>>>>
>>>>>>
>>>>>>
>>>>>>> - I'm slightly opposed to adding the -static-func-strip-dirname-prefix
>>>>>>> flag, since it seems like a workaround (among others that have already
>>>>>>> piled up) for a more fundamental issue. This is a frog-in-boiling-water
>>>>>>> situation; if solving the fundamental issue would be a huge amount of work,
>>>>>>> then adding the new flag is probably fine for now, but we need to keep in
>>>>>>> mind the larger situation. IIUC, defaulting `-static-func-strip-dirname-prefix=-1`
>>>>>>> would emulate the current default behavior, so
>>>>>>> -static-func-full-module-prefix could just be removed in the same
>>>>>>> patch.
>>>>>>>
>>>>>>
>>>>>> The current -static-func-full-module-path=false is simply a special
>>>>>> case of the new option. For users who rely on this option may hit the
>>>>>> correctness issue, they won't have any fallback without the new option.
>>>>>>
>>>>>>
>>>>>>> - I would encourage brainstorming/discussion of alternative
>>>>>>> solutions that solve the fundamental problem (which seems to be more about
>>>>>>> having a stable globally unique identifier than being specifically about
>>>>>>> preserving/mangling the "name" per se).
>>>>>>>
>>>>>>
>>>>>> The problem itself is simple: handle name conflicts between
>>>>>>
>>>>>> /a/b/c/foo.c:static_func
>>>>>> /e/f/g/foo.c:static_func
>>>>>>
>>>>>> Path info is a natural choice. Note that FE instrumentation also uses
>>>>>> module path to uniquely identify static_func as well.
>>>>>>
>>>>>
>>>>> Yes, the problem was inherited from FE instrumentation. I remember
>>>>> that when I explained to Justin the issue, he said that it was clearly
>>>>> buggy and not intentional (an oversight when implementing FEPGO).
>>>>>
>>>>
>>>> It works very well in practice -- though it is not guaranteed to be
>>>> 100% free of conflict. I won't label it as buggy.
>>>>
>>>>
>>>>>
>>>>> It seems that the fundamental issue is coming up with a unique
>>>>> identifier for the current TU that is stable across compiler invocations.
>>>>> How do other compilers handle this?
>>>>>
>>>>> GCC does not suffer from the problem because it does not dump profile
>>>> into one file but one profile file per module. The profile data file tree
>>>> structure mirrors the build output file structure so there will be problem
>>>> if profile-gen and use do not share the same structure. Runtime options are
>>>> provided to strip prefixes from output directories. Compiler time option is
>>>> also provided to relocate profile data (e.g. pointing to different root).
>>>>
>>>>
>>>>
>>>>> For example, path names are not enough. E.g. a user may build
>>>>> /a/b/c/foo.c with two different sets of compiler options, yet static
>>>>> functions of the same name must still be treated as separate. A file like:
>>>>>
>>>>> foo.c:
>>>>>
>>>>> static void PreprocessHelper(/* something */) {
>>>>> #ifdef USE_AVX
>>>>>   // something
>>>>> #else
>>>>>   // something else
>>>>> #endif
>>>>> }
>>>>>
>>>>> #ifdef USE_AVX
>>>>> void foo_avx(/* something */) {
>>>>> #else
>>>>> void foo_noavx(/* something */) {
>>>>> #endif
>>>>>   PreprocessHelper(/* something */);
>>>>>   // something
>>>>> }
>>>>>
>>>>>
>>>>> IIRC, one option (suggested by pcc if I remember correctly) is to use
>>>>> a hash of the TU's exported symbols (or something like that) to uniquely
>>>>> identify the TU.
>>>>>
>>>>
>>>> Should be limited to user defined public symbols.
>>>>
>>>>
>>>>> That seems more robust than a path name.
>>>>>
>>>>
>>>> I believe so.
>>>>
>>>> Content based ID has its advantage but has disadvantages too. For
>>>> instance more expensive to compute, less readable names. Using path based
>>>> naming, we can immediately identify where the static function is defined.
>>>> Perhaps we can use base name plus content hash.
>>>>
>>>> Teresa, Rong, do you see a situation when  module ID needs to be
>>>> identified but it is difficult or  too expensive to access the module's
>>>> content?
>>>>
>>>> David
>>>>
>>>>
>>>>> -- Sean Silva
>>>>>
>>>>>
>>>>>>
>>>>>> thanks,
>>>>>>
>>>>>> David
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> https://reviews.llvm.org/D29512
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20170207/4a9384d0/attachment.html>


More information about the llvm-commits mailing list