[Openmp-commits] [PATCH] D108046: tsan: Add new annotation functions to promote all accesses to atomic

Sat Aug 14 03:52:15 PDT 2021

protze.joachim added a comment.

In D108046#2944952 <https://reviews.llvm.org/D108046#2944952>, @dvyukov wrote:

>> Semantically I could also annotate Mutex semantics. But, from my understanding this introduces HappensBefore edges into the analysis, where I don't want to have them. They would hide even more data races than what I try to solve.
>
> I see. This makes sense.
>
> Re performance, what matters more now is performance of the new runtime that is slowly being upstreamed in pieces:
> https://github.com/dvyukov/llvm-project/pull/3

This is interesting. I'm curious to see whether the new runtime is similarly sensitive to contentious shared read accesses by the application.
With the current runtime I observed 100-1000x runtime overhead, if all threads on a node concurrently read the same data (like a vector in vector-matrix multiplication).

> This change itself can be ported to the new runtime, but it still adds instructions to fast paths. And more important it turns what's being compile-time constant into a non-compile time constant (we heavily rely on inlining so kAtomic is known to compiler to be 0 statically, so any branches are eliminates and any derived values are known). I think it will have more negative impact on the new runtime we also trace atomic flag:
>
>   ev->isAtomic = !!(typ & kAccessAtomic);
>
> and use it to create vector consts:
>
>   const m128 access_read_atomic = _mm_set1_epi32((typ & (kAccessRead | kAccessAtomic)) << 30);
>
> if kAccessAtomic is not a const, compiler will need to create several globals and load one of the other. Also for non-atomic writes this is 0, so clearing the register.
> I also afraid of long-term effect. Say if we add "if (kAtomic)" (some slow path for atomics only, which does not affect normal accesses at all), with this change it becomes problematic (not only it adds a branch, it also affects register allocation, code layout, etc).
>
> So I would like to consider any alternative as much as possible first.

I see. So, I'll do some further experiments. Casting and treating the memory accesses like atomics came just up as an idea yesterday.
In fact we are looking for a solution, that would work with gcc compiled code in the same way to cover Fortran codes as well. So, we prefer to not modify the compiler.

> I see that OpenMP reductions are always tied to a single variable (?):
> https://www.openmp.org/spec-html/5.0/openmpsu107.html

The OpenMP reduction clause can take a list of variables ("one or more list items") and always reduces the thread-local copies of the variables into a single copy of the variables. Besides the pre-defined reduction operations, the application can also define own reduction operations.
A user-defined reduction could be maxloc(struct{double,size_t}) to find the maximum value and it's index. Each thread would first find the local maximum and the index, and the reduction just combines the results of all threads. The OpenMP runtime takes care of the synchronization between the reduction steps, but again, this synchronization should not impact the analysis of the application.

> If at least the modified variables are known, then I think it's possible to do the following: ignore all reduction accesses as it is done now + emit additional (fake) tsan atomic store annotations for the target variables/memory locations.

For a portable implementation, libarcher performs the HB annotations and ignoreWrite annotations in OpenMP tool callbacks defined in the OpenMP standard. At this point, we don't know the memory range of the reduction variables.
The semantic of the reduction callbacks is "all reduction-related memory accesses are performed between reduction-begin/end callbacks". These callbacks do not even reflect the synchronization semantic and can be outside of the critical path, i.e., outside of the locked region.

> I think it should have the same effect w/o long-term tsan runtime tax. Unfortunately tsan atomic hooks do the actual atomic operation as well, but for starters you can try __atomic_fetch_add(0).
> Will this work? Do I miss anything?

I agree that it should not impact the critical path for the majority of cases.
Another possible use case that I see for this kind of annotation is an heuristic "lockset" analysis approach: Assume that no pair of memory accesses under lock has a race, but don't derive HB from the locking. This would allow to find data races currently hidden by the HB introduced from locked regions.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D108046/new/

https://reviews.llvm.org/D108046