[PATCH] D17028: [ThinLTO] Use MD5 hash in function index.

Tue Feb 9 21:37:15 PST 2016

tejohnson added inline comments.

================
Comment at: include/llvm/IR/FunctionInfo.h:151
@@ -151,1 +150,3 @@
+/// Map from function GUID to corresponding function info structures.
+typedef DenseMap<uint64_t, FunctionInfoList> FunctionInfoMapTy;
 
----------------
davidxl wrote:
> I am a little concerned with using DenseMap data structure here.
> 
> Both DenseMap and StringMap are implemented as open hashtab with quadratic probing. The load factor of both are guaranteed to < 0.75 -- that means there are lots of empty buckets in a large table. The main differences are:
> 
> 1) StringMap uses indirection -- each bucket contains pointer to the key-value pair object, so the bucket size is small. The copy is also cheaper when rehashing happens. However DenseMap is  key-value pair is embedded inside the bucket. That means if the value type has large size, it will cause lots of waste in memory.  The copy of dense map can also cause large overhead.
> 2) When resizing, DenseMap will round up the new size to the next power of 2 value -- this further increases the memory overhead.
> 
> There is another bad side effect is that if elements are inserted into map one by one, it will incur lots of reallocation operation (just like vector) unless the size of the map is known before hand and properly resized at the beginning.
> 
> I suggest using std::map if the size of the map is not known priori. If it is known, and if the map lookup happens after the map is created, it might be better to use a vector (pushback) followed by a sort after the map is populated.
Good point, I hadn't compared the relative overheads of these data structures.

Unfortunately, we don't know the size of this map ahead of time. E.g. when merging multiple indexes to create the combined index. And we do lookups while adding (e.g. when creating the combined index we may have multiple comdat with the same name/GUID). 

So it sounds like the best bet for now is to use std::map, and we can investigate alternatives that might involve restructuring the accesses if this turns out not to be efficient enough.

================
Comment at: include/llvm/IR/Module.h:200
@@ -197,1 +199,3 @@
 
+  /// Get the module's original source file name, either the same as the
+  /// ModuleID when compiling from source, or from name recorded in bitcode.
----------------
davidxl wrote:
> This comment is not quite clear in its meaning.
The point I'm trying to make is that if we are compiling this from bitcode, it comes from the new bitcode record, otherwise it is the name of the input file (which is the same as the ModuleID). E.g.:

clang -c foo.c -flto=thin // ModuleID = SourceFileName = foo.c
(foo.o is bitcode with new MODULE_SOURCE_FILENAME record = "foo.c")
clang foo.o -flto=thin // ModuleID = foo.o, SourceFileName = foo.c

================
Comment at: test/tools/gold/X86/thinlto.ll:27
@@ -26,3 +26,3 @@
 ; COMBINED-NEXT: <VALUE_SYMTAB
-; COMBINED-NEXT: <COMBINED_FNENTRY {{.*}} record string = '{{f|g}}'
-; COMBINED-NEXT: <COMBINED_FNENTRY {{.*}} record string = '{{f|g}}'
+; COMBINED-NEXT: <COMBINED_FNENTRY {{.*}}
+; COMBINED-NEXT: <COMBINED_FNENTRY {{.*}}
----------------
davidxl wrote:
> test [offset, guid] pattern here?
I'm afraid testing the specific offsets will be brittle, unless I just check for any integer in that field. I could test for the GUIDs, but would have to handle either ordering as the order in the combined index is not currently guaranteed. Let me go ahead and check for the GUIDs as that maintains the same level of checking as we had with the function names.


http://reviews.llvm.org/D17028