[PATCH] D54889: Fiber support for thread sanitizer

Mon Dec 10 04:48:10 PST 2018

yuri marked 3 inline comments as done.
yuri added a comment.

Hi Dmitry,
I will try to answer your questions

In D54889#1316889 <https://reviews.llvm.org/D54889#1316889>, @dvyukov wrote:

> Some high-level comments:
>
> 1. Do we want to synchronize fibers on switch? If fibers use TLS data, or cooperative scheduling (e.g. mutex unlock directly schedules the next critical section owner), or pass some data from prev fiber to next fiber, in all these cases not synchronizing fibers on switch will lead to false positives. As far as I remember in my old experiments I concluded that it's almost impossible to get rid of false positives it tasks/fibers are not synchronized on switch.

In case of single-threaded scheduler, it is worth to synchronize fibers on each switch. Currently I do it in client code using __tsan_release()/__tsan_acquire(), but it is possible to add flag for __tsan_switch_to_fiber() that will do it.

In case of multithreaded scheduler, client probably has own sychronization primitives (for example, custom mutex), and it is client's responsibility to call corresponding TSAN functions.

> 2. This introduces slowdown into the hottest runtime functions (branch and indirection in cur_thread). I think check_analyze.sh test will fail. Kinda hard to say, because it's broken already at the moment, but this perturbs runtime functions enough:

I have checked parts of the patch separately and found:

- Part of patch related to HACKY_CALL does not affect code of critical functions and benchmark results
- Part of patch related to proc() does not affect code of critical functions and benchmark results
- The only part that affects critical functions is implementation cur_thread(). I tried to minimize its effect in new version of a patch.

> 3. I think instead of __tsan_get_current_fiber we could use pointer 0. It prevents things like switching to context of another thread, which won't end well. And just simplifies API surface.

For us possibility to use context of another thread is required feature. We use it to call into library which use fibers from external finber-unaware code.
It is on-par with Windows API for fibers - ConvertThreadToFiber().

> 4. What if a fiber calls pthread_exit?

I think, it's enough to add check and abort until someone will really need support for it

> 5. Can we combine __tsan_set_fiber_name with __tsan_create_fiber?

It will not solve our requirements. It is convenient to set fiber name from the fiber itself and even change it sometimes, analogously to pthread_setname_np().

> 6. I think it's better to add flags argument to __tsan_create_fiber/__tsan_switch_to_fiber to not end up with __tsan_create_fiber2 and __tsan_create_fiber3.

OK

> 7. Do we want to support nested fibers/tasks? Does GCD have nested tasks? I think TBB/OpenMP does. Won't begin/end fiber/task be a more generic API for this? E.g. fiber switching can be expressed as stop running the prev fiber (which switches back to the physical thread context), then start running new fiber.

GCD has no nested task. GCD tasks can start other tasks, but can't switch or wait.

For me current API looks more generic then what you suggest.
Also, switch to physical thread context may be problematic in some case (see answer to question 3).

> 8. How does this play with setjmp/longjmp? In particular with fiber switching based on setjmp/longjmp? jmp_bufs is part of ThreadState, won't we always try to do longjmp from a wrong ThreadState in such case?

My idea is when setjmp/longjmp is used for fiber switch, client should disable interceptors before call and enable them after it. This procedure can be simplified by flag of __tsan_switch_to_fiber().

> 9. Looking at 1, 2, 3, 7 and 8, it seems that we may want to switch some smaller part (subset) of ThreadState. E.g. introduce TaskContext that contains only the part that we need to switch on fiber switches and then have: ThreadState -> TaskContext (of physical thread, always linked) -> TaskContext (current fiber context) or: ThreadState -> TaskContext (of physical thread, always linked) -> TaskContext (of a task) -> TaskContext (of a nested task) E.g. clock can stay in ThreadState (if we want to synchronize them) which will solve the performance problem, jmp_bufs can stay which will solve part of setjmp problem. This may resolve the hacky_call problem as well.

I prefer to do it:

- There is already similar separation ThreadState/Process. If something is not needed in ThreadState, it can be moved to Process.
- In case of single-threaded scheduler the main thing to switch is shadow stack. In case of multithreaded scheduler nearly everything should be switched. I do not think it is good idea to do two different implementations for these cases.
- I tried to be on-par with Go sanitizer, which uses ThreadState for each goroutine and single Process per physical thread

================
Comment at: lib/tsan/rtl/tsan_rtl_thread.cc:62
   // Can't increment epoch w/o writing to the trace as well.
-  TraceAddEvent(args->thr, args->thr->fast_state, EventTypeMop, 0);
+  TraceAddEvent(args->thr, args->thr->fast_state, EventTypeMop, 0, false);
   ReleaseImpl(args->thr, 0, &sync);
----------------
dvyukov wrote:
> args->thr is parent (current) thread (?), so why can't we use hacky call?
You are right, only TraceAddEvent() call inside OnStarted() is a problem

================
Comment at: lib/tsan/rtl/tsan_rtl_thread.cc:115
   // TraceAddEvent will reset stack0/mset0 in the new part for us.
-  TraceAddEvent(thr, thr->fast_state, EventTypeMop, 0);
+  TraceAddEvent(thr, thr->fast_state, EventTypeMop, 0, false);

----------------
dvyukov wrote:
> I think we should execute start on the fiber itself. This is what we do for normal threads. This should resolve the hacky call problem (right?) and also prevent fibers from breaking in future.
> We can do this either by briefly switching to the fiber context and back when we create a new fiber. Or, start it lazily on first switch.
Start indeed can be executed on the first call to __tsan_switch_to_fiber(), but this solution is not suitable for destroying fiber, because it is normal for fiber to cycle in infinite loop and be destroyed externally.
Do you think it is safe to temporary switch to fiber context for calling ThreadStart and/or ThreadFinish without actually switch to fiber stack?

================
Comment at: lib/tsan/rtl/tsan_rtl_thread.cc:138
     // Can't increment epoch w/o writing to the trace as well.
-    TraceAddEvent(thr, thr->fast_state, EventTypeMop, 0);
+    TraceAddEvent(thr, thr->fast_state, EventTypeMop, 0, false);
     ReleaseImpl(thr, 0, &sync);
----------------
dvyukov wrote:
> Fibers created detached. How does it happen that fibers affect this?
OK, I missed this

Repository:
  rCRT Compiler Runtime

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D54889/new/

https://reviews.llvm.org/D54889