[LLVMdev] debugloc metadata variation

Duncan P. N. Exon Smith dexonsmith at apple.com
Thu Oct 23 09:43:00 PDT 2014

> On 2014-Oct-23, at 09:19, David Blaikie <dblaikie at gmail.com> wrote:
> (sorry for the duplicate Fred, I failed at reply-all the first time)
> On Wed, Oct 22, 2014 at 6:33 PM, Frédéric Riss <friss at apple.com> wrote:
> > On Oct 22, 2014, at 4:57 PM, David Blaikie <dblaikie at gmail.com> wrote:
> >
> > Just working on some of the gmlt+fission debug info stuff and I came across a comment that might be relevant to reducing the number of distinct debugloc metadata nodes:
> >
> > "or some sub-optimal metadata that
> >   // isn't structurally identical (see: file path/name info from clang, which
> >   // includes the directory of the cpp file being built, even when the file name
> >   // is absolute (such as an <> lookup header)))"
> >
> > Seems that the file path/name isn't well canonicalized so as to allow metadata level merging when linking. Might be helpful to figure out that issue at some point.
> Incidentally I worked on an issue last week where the line table would get entries representing the same file, but where the file/dir split wasn’t done at the same place. I have a patch that remerges them at emission, but I was planing on investigating more the source of the duplication before I submit anything.
> The cases I’ve seen have one duplicated entry though, nothing that could have a visible impact on memory consumption.
> So the particular case where I think this arises in a way that might be measurable is if you have a build system that changes directories to build subprojects (like our make build system, if I understand correctly - but not our cmake build system, again, if I understand correctly):
> imagine a simple directory layout:
>   include/
>     foo.h
>   lib/
>     a/
>       a.cpp // includes foo.h and calls one inline function from it (or uses a type, etc) from some external function a()
>     b/
>       b.cpp // does the same thing as a.cpp, but with its own external function, b()
> if you run "clang++ -emit-llvm -S -Iinclude -c lib/a/a.cpp lib/b/b.cpp -g" you get two .ll files both with the obvious:
> !9 = metadata !{metadata !"include/foo.h", metadata !"/tmp/dbginfo/pathtest"}
> But if you do this instead: "cd lib/a; clang++ -emit-llvm -S -I../../include -c a.cpp -g; cd ../../lib/b; clang++ -emit-llvm -S -I../../include -c b.cpp -g" you get two different nodes:
> !9 = metadata !{metadata !"../../include/foo.h", metadata !"/tmp/dbginfo/pathtest/lib/b"}
> !9 = metadata !{metadata !"../../include/foo.h", metadata !"/tmp/dbginfo/pathtest/lib/a"}
> and now you're left with a situation in which almost all the metadata is different and any place you were relying on the standard metadata uniquing you won't get it :(

This might be fixed by making `MDFile` (or `DIFile`) first-class.  We
just need to canonicalize on creation.

    class MDFile {
      // Split the path at the right place.
      MDFile *get(LLVMContext &C, StringRef Path);

      // Convenience for callers, but the path gets canonicalized.
      MDFile *get(LLVMContext &C, StringRef File, StringRef Dir);

      StringRef getFilename() const;
      StringRef getDirectory() const;

Note that whether we continue to use `MDString` under the hood is an
implementation detail.

However, path canonicalization (in particular, eating "..") requires
a `stat()` to do correctly on *NIX, so the implementation would have
to cache lookups.  Doesn't seem hard though.

More information about the llvm-dev mailing list