[PATCH] D22742: [WIP] Fix `-jobs=<N>` where <N> > 1 on macOS.

Tue Jul 26 11:35:28 PDT 2016

delcypher added a comment.

@kcc

It's been pointed out to me that my post might not have been particularly helpful so here is my second attempt at explaining the problem and possible solutions that I'm aware of.

The Problem
===========

On macOS the `-jobs=<N>` flag does not work as intended. Currently only one job will run at a time even if `<N>` is greater than 1 and the number of workers is great than 1. Why is it important to fix this? It's useful to take advantage of the multiple cores most machines have for fuzzing.

The root cause of this is calling `system()` from multiple threads. On macOS the implementation of `system()` contains a mutex that guards against multiple copies of `system()` running concurrently. This causes all worker threads apart from one to block when calling `ExecuteCommand()` because it calls `system()`.

You can see the implementation at https://github.com/unofficial-opensource-apple/Libc/blob/master/stdlib/FreeBSD/system.c which hopefully will make it clear what the problem is.

On Linux where Glibc is usually used, `system()`'s implementation is different. I was incorrect before when I said it was racey on Linux. I presumed that no lock was being used but provided Glibc is built with `_LIBC_REENTRANT` on (I'm not sure how to check if glibc was compiled with this. Any ideas?) then it looks like it is safe to call from multiple threads. You can see the implementation at

https://github.com/bminor/glibc/blob/2a69f853c03034c2e383e0f9c35b5402ce8b5473/sysdeps/posix/system.c

Possible solutions
==================

Here are a few ideas I have for solutions

- Provide a version of `ExecuteCommand()` that can be safely called and can be used concurrently for macOS that has the same semantics as `system()`. Basically this would involve implementing something that behaves similarly to glibc's `system()` for macOS. I think this would probably be the cleanest solution.
- Use the current patch I have now which reimplements `system()` but doesn't try to modify signal handlers but instead requires callers to change and restore the signal handlers. Although this works it leads to a very messy implementation so I don't think we should go with this approach.
- Don't use worker threads and `ExecuteCommand()` at all. Instead in the main thread just do fork and exec for the number workers we want and then do waitpid. When a child finishes if there are more jobs left we'd do another fork and exec. This would work but it means we leak implementation details (i.e. using fork, exec and waitpid) into other parts of LibFuzzer. This also happens to work because worker threads don't do anything complicated right now. If they were doing something more complicated then this single threaded strategy for spawning copies of LibFuzzer might not work.

What do you think? Do you have any ideas about how the problem should be fixed?

How to test for the bug
=======================

I've been thinking about ways to test for this bug. One way would be to start libfuzzer, sleep for a second or so and then check that the log files for the expected number of worker threads exist. This is fragile due to the sleep but I don't really know of a better way to test for this from lit. Does this seem reasonable to you?

https://reviews.llvm.org/D22742