[llvm] [Support] Add decodeULEB128AndInc/decodeSLEB128AndInc (PR #85739)

Thu Mar 28 11:53:14 PDT 2024

dwblaikie wrote:

> I agree that some LEB128 uses do pay less attention about potential buffer overrun.
> 
> Like the one mentioned in https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=SECURITY.txt
> 
> > Compiling untrusted sources can result in arbitrary code execution and unconstrained resource consumption in the compiler. As a result, compilation of such code should be done inside a sandboxed environment to ensure that it does not compromise the host environment.
> 
> Many of LLVM tools might prioritize efficiency over bounds checking (E.g. I believe that in a lot of llvm/lib/Object/ places do not check the bounds.) 

I don't think that's what they're doing - my understanding from the Apple folks (& part of the reason @lhames invented the `llvm::Error` stuff) is that they're quite interested in having, and have invested in it quite a bit, libObject being robust to corrupted/arbitrary inputs - for better investigation into problematic object files.

If they lack bounds checking it's because they were written pretty casually to begin with - which is often the case, and unfortunate (I can say the LLVM debug info APIs are similarly pretty casual). Not out of a performance concern (performance without correctness is... not usually what anyone wants) but just because folks weren't thinking about/prioritizing bad input robustness.

> We might not have the bandwidth to address every potential LEB128 decode call site with explicit `end` arguments or by switching entirely to `DataExtractor`.

I'm not suggesting we can/will/should prioritize paying down all the places we would crash on untrusted input - I think we might've done something differently if we'd thought about it from day 1 (adding fuzzers, etc) - much easier/cheaper to do on new code than to justify the cost of doing the cleanup after the fact.

But that's why I'm concerned about creating APIs that are hard to use safely. The previous/existing APIs that provide optional bounds checking are already pretty easy to misuse (they don't really push folks to being careful with these things). I think it's important to lower the cost of writing bounds safe code - and having APIs that don't provide a way to do it safely seems higher cost, because someone might reach for it & see no error handling and decide it's not worth it.

(this API feels close to `gets` (which, admittedly, is maximally unsafe - you can't control its input - but that's /usually/ the case with most of these uleb readings - except these sort of special cases where the input is built into the code/file, like tblgen))

https://github.com/llvm/llvm-project/pull/85739