[cfe-dev] [RFC] UBSan check data size reductions

Wed Apr 27 12:15:07 PDT 2016

Hi Richard,

On Wed, Apr 27, 2016 at 1:22 AM, Richard Smith <richard at metafoo.co.uk> wrote:
> I think the 0003 patch makes a lot of sense. I wish I'd done that in the
> first place! (You'll need a corresponding change to compiler-rt of course.)
>
> For the 0002 patch, could we use a different representation instead, to
> deduplicate the common suffixes? Something like:
>
>   struct Path { Path *parent; const char *content; };
>
> Split each path on each '/' (or some smarter heuristic), and produce:
>
>   Path p0 { nullptr, "/path/to/build/" };
>   Path p1 { &p0, "foo/" };
>   Path p2 { &p1, "bar/" };
>   Path foo_bar_baz_h { &p2, "baz.h" };
>   Path foo_bar_quux_h { &p2, "quux.h" };
>
> The downside of this is increased relocation size, which is already a
> significant problem for ubsan binaries. I'm not sure how much of that we can
> fix without teaching the linker about ubsan data.

I'm not sure it's worth it to add so much complexity when the maximum
savings of my filename patch (keeping only the last path component) is
between 7% and 20% of the rodata section.
It's also not possible to figure out a proper path component split
from inside one of clang's invocations, since we wouldn't know what
paths the program would have. It would have to be dealt with by having
clang tag the strings + an llvm pass to optimize them in a way we knew
UBSan understood (implying LTO is needed, btw).

If we were to do this per translation unit, the savings would most
likely be even smaller (or be worse than just having the whole file
names. e.g: Support/APFloat.cpp, Transforms/IPO/ConstantMerge.cpp
(both in llvm), and CGExpr.cpp (random files I tried this on) only
have checks in themselves, not in any header. So the only path emitted
is the one file's path).

Each filename is emitted once per translation unit anyway (or gets
merged to one instance per TU later), hence the not-so-big (but still
nice to have!) savings with the 0002-* patch.
Any asserts you might have might already be emitting the full pathname
anyway, too.

A quick check on clang (this was with a quick script, I might have
messed up) tells me that the Path complexity could probably halve (at
most. I got ~54% space usage) the string data for this (a pointer pair
per "split" in patch components (common prefix, different suffix)),
while stripping every path component but the last would make the
string data for filenames use ~27% of the space needed for the full
file names.

This is assuming the full path is only used by UBSan. If it's not, it
gets merged on the current solution, but gets added again on the Path
structure solution.

> The 0001 patch makes me wonder if we can move (most of) the ubsan data into
> a section that's not mapped into the program's address space on startup, and
> somehow lazily read it from the binary if a check fails. This will need some
> work to cope with our duplicate suppression mechanism (which writes back to
> the ubsan data), but if you're having address space issues, this could help
> substantially.
I'm unsure that would be a usable case for people that have these limitations.
In some systems, it might be impossible (for UBSan alone) to get the
path to the executable, either because that information isn't
available by default, or because we don't have the image in a
filesystem.
We'd also be mapping/unmapping file areas a lot (in systems that can),
which would be very noticeable.

I think I will go ahead with the 0003 patch and upload a proper
version of it soon (and the corresponding compiler-rt patch. That one
is done already :-) ).
As for the 0002 patch, I think I'll upload it too and we can keep the
"alternate version" discussion either here or in the other patch.

I'll probably only upload them tomorrow so feel free to comment here
instead of waiting for me to upload them if you have any comments.

Thank you,

  Filipe

> On Tue, Apr 26, 2016 at 9:49 AM, Filipe Cabecinhas via cfe-dev
> <cfe-dev at lists.llvm.org> wrote:
>>
>> Hi all,
>>
>> I’m proposing two ways to make ubsan instrumented code (really data…)
>> smaller.
>>
>> Why?
>> Our platform has limited resources and game teams have to play within
>> those to create their games. During debugging, they might have some
>> more relaxed limits, but they still have a budget they need to stick
>> to. And they tend to use everything they can!
>> The size of code+data counts towards that budget, and ubsan is very
>> happy to make code+data a lot bigger when enabled. This can end up
>> posing some problems and making ubsan harder to use than we would
>> like.
>> This might not be as useful for the usual operating systems people
>> use, but it might help other people with space constraints enable
>> ubsan more easily.
>>
>>
>> Biggest space “wasters”:
>> A quick objdump + grep found that not all check handlers are equal
>> (checks inserting >10k calls to handler):
>>
>> clang (OS X):
>> 793539 ___ubsan_handle_type_mismatch_abort
>> 24204 ___ubsan_handle_load_invalid_value_abort
>> (4127 ___ubsan_handle_builtin_unreachable)
>> …
>>
>> Game:
>> 837785 __ubsan_handle_type_mismatch
>> 20391 __ubsan_handle_load_invalid_value
>> 16037 __ubsan_handle_out_of_bounds
>> 11357 __ubsan_handle_add_overflow
>> …
>>
>>
>> Savings:
>> First, I made clang emit all the ubsan check data to a different
>> section (and the type_mismatch data, specifically, to another), to
>> ease accounting (patch attached: 0001-*.patch, provided so anyone can
>> check it out. I don’t think it’s useful for it to be in clang).
>>
>> With an estimate of 48B per type_mismatch (SourceLocation (char* + 2x
>> u32) + TypeDescriptor ref (uptr) + Alignment (uptr) + unsigned char ==
>> 8+2*4+8+8+1 == 33, which will be upped to 48 (!!) with padding (on
>> x86-64 we’re aligning all structures to 16B)), we end up with (for
>> type_mismatch checks’ static data):
>>
>> Before:
>> clang: ~47.73 MiB
>> game:  ~59.24 MiB
>>
>> After 0003-*.patch:
>> clang: ~31.82 MiB (~67%)
>> game:  ~39.49 MiB (~67%)
>>
>>
>> (These numbers don’t match 793539*48 bytes because we end up eliding
>> some checks. The relative savings in the data section will be the
>> same, though)
>>
>>
>> I also have a patch for minimizing ubsan source location data
>> (parametrizable), which would apply to all checks, but with string
>> merging, we end up not saving that much, in the size of cstring/rodata
>> sections. This patch allows the user to drop the first N path
>> components or only keep the last N patch components from the source
>> locations emitted by clang.
>>
>> Before:
>> clang: ~1.28 MiB
>> game:  ~10.75 MiB
>>
>> After 0002-*.patch:
>> clang: ~1.19 MiB (~93%)
>> game:  ~8.46 MiB (~79%)
>>
>>
>>
>> Summing up:
>> I’m proposing two patches, one saves a lot more than the other in the
>> cases I’ve seen:
>>   - Make static check data for type_mismatch 7 bytes smaller. This
>> check is *by far* the most emitted one. And its static data (per check
>> handler call) is 48B (with padding). Minimised version is 32B (with
>> padding). (0003-*.patch)
>>   - Add -fstrip-path-components-from-checks=N (do bike-shed…), which
>> tells ubsan to strip the first N path components when emitting
>> filename information (or only emit the last N, if N is negative).
>> (0002-*.patch)
>>
>> The data minimizing patch might not be implementable while keeping
>> library compatibility (as in: being able to use objects compiled with
>> an older version of clang on a more recent ubsan library. We might be
>> able to come up with a heuristic, though), but yields 1/3 savings in
>> type_mismatch static data (not accounting for file names nor type
>> descriptors).
>> The file name minimizing patch is simple enough, and wouldn’t imply
>> any changes to the sanitizer libraries.
>>
>> I know these patches are missing tests, this email is an RFC, to know
>> if there is interest in having this upstream. Proper patches (updated
>> with comments and tests) will follow depending on the response.
>>
>> Thank you,
>>
>>   Filipe
>>
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>