[compiler-rt] [scudo] Update header without read-modify-write operation (PR #66955)

Wed Sep 20 21:20:42 PDT 2023

ChiaHungDuan wrote:

> This is risky.
> 
> This is a race detection mechanism where we detect if 2 chunks are modified concurrently. There is a window of time where 2 threads can free the same chunk, both headers will be read, pass the check and written as a free chunk header. If we do not detect this, then the free chunk will be added to 2 caches, and further badness will ensue.
> 
> I agree this is an expensive check, and I initially considered both options (cmpxchg vs read-store). If you look for "race on chunk header at address" in Google, you'll find stuff like [Instabug/Instabug-Android#435](https://github.com/Instabug/Instabug-Android/issues/435). So those races do happen in the wild, and while it would be complicated to exploit, I wouldn't think it to be impossible.
> 
> If you do want to remove this - at the cost of security - you might want to remove reportHeaderRace as well though.

Thanks, we really want to get your input.

We have measured cmpxchg, xchg and atomic-store. The former two show the same performance overhead (especially on platforms with more cores). We also simulated the behavior and the success rate is about 1 success from 2 millions attempts on x86_64 machine with 96 cores. Like you said, it's not impossible and seems not long to hit the case. 

I tried to study some double-free attacks and they usually try to override the metadata in the freelist. In Scudo, the way it stores the freelist is quite different from those cases. That gives me the thought that maybe the trade-off is acceptable. Besides, attacking on this may have lower value than compromise the checksum. When the checksum is compromised, then forfeit a double-free case will be easier. Even though, I'm still uncertain.

Another thought is to slightly harden the double-free protection. We can check the header status at allocation (still with atomic-load/atomic-store) as well. Even though we may only be able to check if it's allocated, but the benefit of this is that we rely on the mutex while accessing primary allocator and don't need to introduce additional synchronization. We still miss the case of accessing cache only but I guess it's a light enhancement. 

What do you think?

https://github.com/llvm/llvm-project/pull/66955