[cfe-dev] How to generate Unique Module identifier

Sriraman Tallam via cfe-dev cfe-dev at lists.llvm.org
Wed Jun 3 12:09:48 PDT 2020


On Wed, Jun 3, 2020 at 8:30 AM Xiangling Liao <xiangxdh at gmail.com> wrote:

> ping.
>
> ---------- Forwarded message ---------
> From: Xiangling Liao <xiangxdh at gmail.com>
> Date: Fri, May 29, 2020 at 3:15 PM
> Subject: [cfe-dev] How to generate Unique Module identifier
> To: <cfe-dev at lists.llvm.org>
> Cc: <hubert.reinterpretcast at gmail.com>
>
>
> Hi All,
>
> There have been recent discussions about how to generate unique module
> identifiers which can be embedded in AIX static init function names.
>
> On AIX, static init functions are sinit/sterm pairs looking like this:
>
>
> *__sinit<priority #>_<unique module identifier>__sterm<priority #>_<unique
> module identifier>*
>
> There is one sinit/sterm pair per priority number for each module.
>
> The AIX linker collects static init functions simply based on their name.
> So we need to guarantee that each module has its own unique sinit/sterm
> pairs. To achieve that, we need a unique module identifier which will be
> used as a part of static init function name as suffix.
>
> Our several thoughts about this so far are as follows:
>
> *1. `getUniqueModuleId` function to generate unique module identifier*
> *https://llvm.org/doxygen/ModuleUtils_8cpp_source.html#l00255 *
> <https://llvm.org/doxygen/ModuleUtils_8cpp_source.html#l00255>
>
> *“Produce unique identifier for a module by taking the MD5 sum of the
> names of the module's strong external symbols. However, if the module has
> no strong external symbols (such a module may still have a semantic effect
> if it performs global initialization), we cannot produce a unique
> identifier for this module, so we return the empty string.”*
>
> Issues with this `getUniqueModuleId` function are:
> (1)Since this function does not take either `Internal linkage` or
> `WeakOnceODR linkage` global variables, so it is not able to return a
> string for the following cases:
> 1)
>
>
>
>
>
>
> *class test {public:    test();    ~test();};static test t;  //Internal
> linkage*
>
> 2)
>
>
>
>
>
>
>
>
>
>
> *extern "C" int puts(const char *);template <typename = void>struct A {
>  A() { puts("hello\n"); }  ~A() { puts("bye\n"); }  static A
> instance;};template <typename T> A<T> A<T>::instance;template A<>
> A<>::instance;   //WeakOnceODR linkage*
>
> (2) Even if we add our own version `getUniqueModuleId` to care about above
> linkage types, the biggest issue here is content-based hashing won't work
> for the identical-content internal linkage case.
>
>
> *2. Source filename string as the module* *identifier*
> The `source filename` string is set to the original module identifier,
> which will be the name of the compiled source file when compiling from
> source through the clang front end. [*
> https://releases.llvm.org/10.0.0/docs/LangRef.html#source-filename
> <https://releases.llvm.org/10.0.0/docs/LangRef.html#source-filename>* ]
>
> That means if we have multiple objects compiled with the same command-line
> source file path, we have same module identifiers. The static init
> functions are not guaranteed to be unique.
>
> Also, there's *Unique Names for Functions with Internal Linkage*
> <https://reviews.llvm.org/D73307?vs=on&id=262801&whitespace=ignore-most#change-Qs7eGduOQs42>
> patch, whose solution does not guarantee uniqueness either.
>

I just have a few thoughts.  I worked on the unique names patch for
internal linkage functions.

1)  Low probability of collisions:  I was only interested in reducing the
probability of internal linkage functions getting the same names.  In the
context of PGO/FDO, this is useful because the profile information can be
attributed to the right instance of the function.  While the Unique names
solution does not guarantee uniqueness, it makes it really small in
practice.

2) Name stability:  We do not want the symbol names to constantly change
either and there should be some amount of stability.  This is because we
generate profiles with one version of the source and use it to optimize a
later version.  Name changes across versions could make the profiles for
those functions useless.  In your case, how important is stability?

3) Using the file system's attributes where possible:  Just spitballing
here, how about using say inode number in the hash for the symbol with
Linux and similar attributes for other file systems.  Looks like this could
be kept stable and would handle the problem of identical source names.


Thanks
Sri

>
> *3. Using the information around the compilation process itself*
> Though using the information around the compilation process itself (PID,
> timestamp) can give us unique module identifiers, but it could be
> problematic for reproducibility.
>
> *4. source file full path + OutputFile name following -o  option*
> Another thing hopeful is to use* the source file full path plus the
> OutputFile name following -o option* as something to hash on or as a
> suffix for static init functions on AIX.
>
> We didn’t find any precedent in LLVM to do so so far. And it requires us
> to pass -o ’s OutputFile name from `FrontendOpts` to `llvm::Module` like we
> pass each `Input` from `FrontendOpts.Inputs` to `llvm::Module` as
> SourceFileName.
> *https://llvm.org/doxygen/Module_8cpp_source.html#l00073 *
> <https://llvm.org/doxygen/Module_8cpp_source.html#l00073>
>
>
>
> Any thoughts about what to hash on or encode into the unique ID we need?
>
> Please let me know if there are any questions as well. Your feedback is
> appreciated.
>
> Regards,
>
> Xiangling Liao
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20200603/d96093f1/attachment-0001.html>


More information about the cfe-dev mailing list