[llvm-branch-commits] [hmaptool] Implement simple string deduplication (PR #102677)
via llvm-branch-commits
llvm-branch-commits at lists.llvm.org
Fri Aug 9 13:32:20 PDT 2024
llvmbot wrote:
<!--LLVM PR SUMMARY COMMENT-->
@llvm/pr-subscribers-clang
Author: Shoaib Meenai (smeenai)
<details>
<summary>Changes</summary>
This reduces the size of the generated header maps significantly (35%
measured internally). Further savings are possible through tail
deduplication, but the additional complication isn't worth the gain IMO.
---
Full diff: https://github.com/llvm/llvm-project/pull/102677.diff
1 Files Affected:
- (modified) clang/utils/hmaptool/hmaptool (+23-8)
``````````diff
diff --git a/clang/utils/hmaptool/hmaptool b/clang/utils/hmaptool/hmaptool
index aa400e3dd64e9..2ca769a549bed 100755
--- a/clang/utils/hmaptool/hmaptool
+++ b/clang/utils/hmaptool/hmaptool
@@ -110,6 +110,24 @@ class HeaderMap(object):
yield (self.get_string(key_idx),
self.get_string(prefix_idx) + self.get_string(suffix_idx))
+class StringTable:
+ def __init__(self):
+ # A string table offset of 0 is interpreted as an empty bucket, so it's
+ # important we don't assign an actual string to that offset.
+ self.table = "\0"
+ # For the same reason we don't want the empty string having a 0 offset.
+ self.offsets = {}
+
+ def add(self, string):
+ offset = self.offsets.get(string)
+ if offset:
+ return offset
+
+ offset = len(self.table)
+ self.table += string + "\0"
+ self.offsets[string] = offset
+ return offset
+
###
def action_dump(name, args):
@@ -182,7 +200,7 @@ def action_write(name, args):
table = [(0, 0, 0)
for i in range(num_buckets)]
max_value_len = 0
- strtable = "\0"
+ strtable = StringTable()
for key,value in mappings.items():
if not isinstance(key, str):
key = key.decode('utf-8')
@@ -190,17 +208,14 @@ def action_write(name, args):
value = value.decode('utf-8')
max_value_len = max(max_value_len, len(value))
- key_idx = len(strtable)
- strtable += key + '\0'
+ key_idx = strtable.add(key)
prefix, suffix = os.path.split(value)
# This guarantees that prefix + suffix == value in all cases, including when
# prefix is empty or contains a trailing slash or suffix is empty (hence the use
# of `len(value) - len(suffix)` instead of just `-len(suffix)`.
prefix += value[len(prefix) : len(value) - len(suffix)]
- prefix_idx = len(strtable)
- strtable += prefix + '\0'
- suffix_idx = len(strtable)
- strtable += suffix + '\0'
+ prefix_idx = strtable.add(prefix)
+ suffix_idx = strtable.add(suffix)
hash = hmap_hash(key)
for i in range(num_buckets):
@@ -228,7 +243,7 @@ def action_write(name, args):
f.write(struct.pack(header_fmt, *header))
for bucket in table:
f.write(struct.pack(bucket_fmt, *bucket))
- f.write(strtable.encode())
+ f.write(strtable.table.encode())
def action_tovfs(name, args):
"convert a headermap to a VFS layout"
``````````
</details>
https://github.com/llvm/llvm-project/pull/102677
More information about the llvm-branch-commits
mailing list