[llvm-bugs] [Bug 27986] New: Address Sanitizer deadlocks when used by SCHED_FIFO threads on x86 (not 64) when afined to a single CPU

Thu Jun 2 20:35:59 PDT 2016

https://llvm.org/bugs/show_bug.cgi?id=27986

            Bug ID: 27986
           Summary: Address Sanitizer deadlocks when used by SCHED_FIFO
                    threads on x86 (not 64) when afined to a single CPU
           Product: compiler-rt
           Version: 3.8
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P
         Component: compiler-rt
          Assignee: unassignedbugs at nondot.org
          Reporter: nat1192 at gmail.com
                CC: llvm-bugs at lists.llvm.org
    Classification: Unclassified

Created attachment 16457
  --> https://llvm.org/bugs/attachment.cgi?id=16457&action=edit
Simple example to reproduce the issue

Using Address Sanitizer can cause the program to deadlock on allocations when
the following conditions are met:

1. Run the application on an x86, 32-bit platform on Linux (I don't know if
multi-lib compiles would reproduce this if compiled with -m32)
2. Have the threads in the application use the SCHED_FIFO scheduling policy.
3. Vary the priority of the threads.
4. Force all the threads in the application to use the same CPU.

The reason this seems to happen is that SizeClassAllocator32 is using a spin
lock to guard some internal data. Spin locks behave quite badly when they
interact with SCHED_FIFO threads, especially when those SCHED_FIFO threads
can't migrate CPUs.

Take this hypothetical example:
1. Thread 1 has high priority, thread 2 has low priority.
2. Thread 1 goes to sleep
3. Thread 2 decides to allocate, so it will take the spin lock.
4. The kernel interrupts Thread 2 in order to run some SCHED_OTHER processes.
Thread 2 still holds the spin lock, as it was interrupted before it was
finished.
5. While the other process was running, Thread 1 finished its timed sleep (so
it gets scheduled).
6. Thread 1 is running now and decides to allocate. It tries to take the spin
lock, but thread 2 still owns it.
7. Thread 1 tries to sched_yield() after a while (as that's how the spin lock
for the sanitizers are implemented). However thread 1 still has higher priority
than thread 2, so it's immediately scheduled to run again by the kernel.
8. "Deadlock" has occurred, as thread 1 will keep spinning on the lock and
thread 2 can never run because it's lower priority than thread 1.

This can be seen a bit more clearly in a stack trace of the provided example
application. Once the program stops printing the "Alive" messages, you can have
GDB interrupt the program and see these two threads (or something similar):

Thread 6 (Thread 0xab4feb40 (LWP 1520)):
#0  0xb7fdad91 in __kernel_vsyscall ()
#1  0xb7cdf217 in syscall () from /usr/lib/libc.so.6
#2  0x08118679 in __sanitizer::internal_sched_yield() ()
#3  0x0806466b in __sanitizer::StaticSpinMutex::LockSlow() ()
#4  0x08064834 in __sanitizer::SizeClassAllocator32<0ul, 4294967296ull, 16ul,
__sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul,
__sanitizer::FlatByteMap<4096ull>,
__asan::AsanMapUnmapCallback>::AllocateBatch(__sanitizer::AllocatorStats*,
__sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul,
4294967296ull, 16ul, __sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul,
__sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback> >*, unsigned
long) ()
#5  0x08064c32 in
__sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul,
4294967296ull, 16ul, __sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul,
__sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback>
>::Refill(__sanitizer::SizeClassAllocator32<0ul, 4294967296ull, 16ul,
__sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul,
__sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback>*, unsigned
long) ()
#6  0x08067760 in __asan::Allocator::Allocate(unsigned long, unsigned long,
__sanitizer::BufferedStackTrace*, __asan::AllocType, bool) ()
#7  0x0806378b in __asan::asan_memalign(unsigned long, unsigned long,
__sanitizer::BufferedStackTrace*, __asan::AllocType) ()
#8  0x0812f543 in operator new(unsigned int) ()
#9  0x08132205 in dumb_thread (arg=0xbffffa60) at asan_fifo.cpp:26
#10 0x0806e7bf in asan_thread_start(void*) ()
#11 0xb7de42f1 in start_thread () from /usr/lib/libpthread.so.0
#12 0xb7ce37ce in clone () from /usr/lib/libc.so.6

Thread 5 (Thread 0xabcffb40 (LWP 1519)):
#0  0x08064863 in __sanitizer::SizeClassAllocator32<0ul, 4294967296ull, 16ul,
__sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul,
__sanitizer::FlatByteMap<4096ull>,
__asan::AsanMapUnmapCallback>::AllocateBatch(__sanitizer::AllocatorStats*,
__sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul,
4294967296ull, 16ul, __sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul,
__sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback> >*, unsigned
long) ()
#1  0x08064c32 in
__sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul,
4294967296ull, 16ul, __sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul,
__sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback>
>::Refill(__sanitizer::SizeClassAllocator32<0ul, 4294967296ull, 16ul,
__sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul,
__sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback>*, unsigned
long) ()
#2  0x08067760 in __asan::Allocator::Allocate(unsigned long, unsigned long,
__sanitizer::BufferedStackTrace*, __asan::AllocType, bool) ()
#3  0x0806378b in __asan::asan_memalign(unsigned long, unsigned long,
__sanitizer::BufferedStackTrace*, __asan::AllocType) ()
#4  0x0812f543 in operator new(unsigned int) ()
#5  0x08132205 in dumb_thread (arg=0xbffffa5c) at asan_fifo.cpp:26
#6  0x0806e7bf in asan_thread_start(void*) ()
#7  0xb7de42f1 in start_thread () from /usr/lib/libpth

Even if I allow the program to resume and then interrupt it again, these
threads don't appear to make any forward progress.

The fix (or at least one fix I can think of) is to not use spin locks. Or at
the very least have the spin lock devolve into a blocking lock after a certain
number of tries.

Note when running the provided example that you need to run it as root (to have
permissions to create SCHED_FIFO threads) and running the application will
likely slow one CPU on your system down to a crawl. I recommend running it in a
VM. Also you might have to tweak some of the numbers to reproduce it on your
system. After running for a few seconds to a minute you should see the 'Alive'
messages stop. I compiled and tested this in a 32-bit VM of ArchLinux with both
Clang 3.8 and GCC 6.1.1. I compiled with 'clang++ asan_fifo.cpp -o test
-fsanitize=address -pthread'.

Also I understand that the example is a bit convoluted. It's a slimmed down
version of a real-world application that is much larger, and it takes several
days of constant running for this bug to normally manifest itself.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20160603/c0de8a0b/attachment-0001.html>