[llvm] [LLVM] add LZMA for compression/decompression (PR #83297)

Tue Mar 5 12:11:52 PST 2024

dwblaikie wrote:

> > Excuse the outlandish suggestion, but given:
> > > I do get the part that multiple GPU variants give us a lot of redundancy in the data to compress away.
> > 
> > 
> > Is there any chance of some sort of domain-specific compression,
> 
> The key domain-specific quirk we can exploit here is that we produce N very similar blobs (same code, with minor differences due to GPU-specific intrinsics, etc.) There's nothing particularly interesting about the individual blobs.
> 
> > especially that would be more resilient to the size of the kernels? (seems like increasing the compression level increases the compression window size, which has some cliff/break points for kernels of certain sizes, which seems unfortunately non-general - like it'd be nice to not have to push the compression algorithm so hard for smaller kernels, and it'd be nice if larger kernels could still be deduplicated)
> 
> One way to achieve that would be to interleave GPU blobs. Instead of `AAAAABBBBBCCCCC`, pack them as `ABCABCABCABC`. This way the compression window requirement will be reduced to cover only a slice, not the whole blob.

I was thinking something even more domain specific (like an actual domain specific compression scheme - not that it wouldn't be able to be further compressed by something generic - but encoding the data with less duplication to start with) - but I don't know enough about the structure/contents of these kernels to know what that'd look like. If I were speculating rampantly - maybe some kind of macro scheme to describe the architectural differences, that could be quickly stripped out when the arch specific version was needed on-device. (wonder if it'd be feasible to even compile for multiple targets simultaneously - keeping these differences in conditional blocks - rather than redundantly generating all the kernels then trying to figure out their commonalities/merge them again)

But I reaize this is all quite out of my depth and you folks who work on this stuff probably already know what's feasible or not here.

> Increasing compression window while keeping the rest of parameters at a lower compression level may work, too. At least on my experiments `zstd -9 --zstd=wlog=25` does not seem to affect compression time much. It still works much faster than `zstd -20`.

That sounds pretty promising (though perhaps still interesting to know how much the window size helps/hurts compared to the distribution of kernel sizes? Like do we have population data about kernel sizes? Does wlog=25 cover the 90% case? is the population widely distributed, or fairly tightly clustered? is it increasing over time, such that today wlog=25 is 90%, but in a year or two it'll be only 50%?)

https://github.com/llvm/llvm-project/pull/83297