[cfe-dev] How to generate Unique Module identifier

Hubert Tong via cfe-dev cfe-dev at lists.llvm.org
Wed Jun 3 15:55:50 PDT 2020


On Wed, Jun 3, 2020 at 3:10 PM Sriraman Tallam <tmsriram at google.com> wrote:

>
>
> On Wed, Jun 3, 2020 at 8:30 AM Xiangling Liao <xiangxdh at gmail.com> wrote:
>
>> ping.
>>
>> ---------- Forwarded message ---------
>> From: Xiangling Liao <xiangxdh at gmail.com>
>> Date: Fri, May 29, 2020 at 3:15 PM
>> Subject: [cfe-dev] How to generate Unique Module identifier
>> To: <cfe-dev at lists.llvm.org>
>> Cc: <hubert.reinterpretcast at gmail.com>
>>
>>
>> Hi All,
>>
>> There have been recent discussions about how to generate unique module
>> identifiers which can be embedded in AIX static init function names.
>>
>> On AIX, static init functions are sinit/sterm pairs looking like this:
>>
>>
>> *__sinit<priority #>_<unique module identifier>__sterm<priority
>> #>_<unique module identifier>*
>>
>> There is one sinit/sterm pair per priority number for each module.
>>
>> The AIX linker collects static init functions simply based on their name.
>> So we need to guarantee that each module has its own unique sinit/sterm
>> pairs. To achieve that, we need a unique module identifier which will be
>> used as a part of static init function name as suffix.
>>
>> Our several thoughts about this so far are as follows:
>>
>> *1. `getUniqueModuleId` function to generate unique module identifier*
>> *https://llvm.org/doxygen/ModuleUtils_8cpp_source.html#l00255 *
>> <https://llvm.org/doxygen/ModuleUtils_8cpp_source.html#l00255>
>>
>> *“Produce unique identifier for a module by taking the MD5 sum of the
>> names of the module's strong external symbols. However, if the module has
>> no strong external symbols (such a module may still have a semantic effect
>> if it performs global initialization), we cannot produce a unique
>> identifier for this module, so we return the empty string.”*
>>
>> Issues with this `getUniqueModuleId` function are:
>> (1)Since this function does not take either `Internal linkage` or
>> `WeakOnceODR linkage` global variables, so it is not able to return a
>> string for the following cases:
>> 1)
>>
>>
>>
>>
>>
>>
>> *class test {public:    test();    ~test();};static test t;  //Internal
>> linkage*
>>
>> 2)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *extern "C" int puts(const char *);template <typename = void>struct A {
>>  A() { puts("hello\n"); }  ~A() { puts("bye\n"); }  static A
>> instance;};template <typename T> A<T> A<T>::instance;template A<>
>> A<>::instance;   //WeakOnceODR linkage*
>>
>> (2) Even if we add our own version `getUniqueModuleId` to care about
>> above linkage types, the biggest issue here is content-based hashing won't
>> work for the identical-content internal linkage case.
>>
>>
>> *2. Source filename string as the module* *identifier*
>> The `source filename` string is set to the original module identifier,
>> which will be the name of the compiled source file when compiling from
>> source through the clang front end. [*
>> https://releases.llvm.org/10.0.0/docs/LangRef.html#source-filename
>> <https://releases.llvm.org/10.0.0/docs/LangRef.html#source-filename>* ]
>>
>> That means if we have multiple objects compiled with the same
>> command-line source file path, we have same module identifiers. The static
>> init functions are not guaranteed to be unique.
>>
>> Also, there's *Unique Names for Functions with Internal Linkage*
>> <https://reviews.llvm.org/D73307?vs=on&id=262801&whitespace=ignore-most#change-Qs7eGduOQs42>
>> patch, whose solution does not guarantee uniqueness either.
>>
>
> I just have a few thoughts.  I worked on the unique names patch for
> internal linkage functions.
>
> 1)  Low probability of collisions:  I was only interested in reducing the
> probability of internal linkage functions getting the same names.  In the
> context of PGO/FDO, this is useful because the profile information can be
> attributed to the right instance of the function.  While the Unique names
> solution does not guarantee uniqueness, it makes it really small in
> practice.
>
We would need more guaranteed uniqueness than this. It would not be good
for packaged static libraries to have symbols that collide by accident with
the user program or with other static libraries.


>
> 2) Name stability:  We do not want the symbol names to constantly change
> either and there should be some amount of stability.  This is because we
> generate profiles with one version of the source and use it to optimize a
> later version.  Name changes across versions could make the profiles for
> those functions useless.  In your case, how important is stability?
>
Stability is important for keeping the relative ordering of C++
initialization/destruction for non-locals reasonably the same between
builds.


>
> 3) Using the file system's attributes where possible:  Just spitballing
> here, how about using say inode number in the hash for the symbol with
> Linux and similar attributes for other file systems.  Looks like this could
> be kept stable and would handle the problem of identical source names.
>
I believe we would want stability to extend to having the original source
tree moved to another directory and a different source tree placed where
the original source tree was in the directory structure.


>
>
> Thanks
> Sri
>
>>
>> *3. Using the information around the compilation process itself*
>> Though using the information around the compilation process itself (PID,
>> timestamp) can give us unique module identifiers, but it could be
>> problematic for reproducibility.
>>
>> *4. source file full path + OutputFile name following -o  option*
>> Another thing hopeful is to use* the source file full path plus the
>> OutputFile name following -o option* as something to hash on or as a
>> suffix for static init functions on AIX.
>>
>> We didn’t find any precedent in LLVM to do so so far. And it requires us
>> to pass -o ’s OutputFile name from `FrontendOpts` to `llvm::Module` like we
>> pass each `Input` from `FrontendOpts.Inputs` to `llvm::Module` as
>> SourceFileName.
>> *https://llvm.org/doxygen/Module_8cpp_source.html#l00073 *
>> <https://llvm.org/doxygen/Module_8cpp_source.html#l00073>
>>
>>
>>
>> Any thoughts about what to hash on or encode into the unique ID we need?
>>
>> Please let me know if there are any questions as well. Your feedback is
>> appreciated.
>>
>> Regards,
>>
>> Xiangling Liao
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20200603/2cd3e966/attachment-0001.html>


More information about the cfe-dev mailing list