[llvm-bugs] [Bug 46825] New: clang DirectoryWatcherTest unittest hangs on some filesystems, but not others

via llvm-bugs llvm-bugs at lists.llvm.org
Thu Jul 23 09:58:35 PDT 2020


https://bugs.llvm.org/show_bug.cgi?id=46825

            Bug ID: 46825
           Summary: clang DirectoryWatcherTest unittest hangs on some
                    filesystems, but not others
           Product: Test Suite
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: lit
          Assignee: unassignedbugs at nondot.org
          Reporter: bfriesen at lbl.gov
                CC: daniel at zuster.org, llvm-bugs at lists.llvm.org

Greetings,

I am very close to having a working x86 flang buildbot - everything about the
bot's workflow succeeds except for a single lit test, 'DirectoryWatcherTest'.
This test reads and writes some temporary files/directories in TMPDIR.

On the system where this buildbot is running (a Cray XC40), I have observed the
following behaviors (all results tested on commit 724bf4ee23a of llvm-project):

On "login nodes", the test runs without any issues. TMPDIR is set to `/tmp`,
which is a ramdisk:

```
cori10:DirectoryWatcher> ./DirectoryWatcherTests 
[==========] Running 8 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 8 tests from DirectoryWatcherTest
[ RUN      ] DirectoryWatcherTest.InitialScanSync
[       OK ] DirectoryWatcherTest.InitialScanSync (7 ms)
[ RUN      ] DirectoryWatcherTest.InitialScanAsync
[       OK ] DirectoryWatcherTest.InitialScanAsync (8 ms)
[ RUN      ] DirectoryWatcherTest.AddFiles
[       OK ] DirectoryWatcherTest.AddFiles (8 ms)
[ RUN      ] DirectoryWatcherTest.ModifyFile
[       OK ] DirectoryWatcherTest.ModifyFile (16 ms)
[ RUN      ] DirectoryWatcherTest.DeleteFile
[       OK ] DirectoryWatcherTest.DeleteFile (29 ms)
[ RUN      ] DirectoryWatcherTest.DeleteWatchedDir
[       OK ] DirectoryWatcherTest.DeleteWatchedDir (0 ms)
[ RUN      ] DirectoryWatcherTest.InvalidatedWatcher
[       OK ] DirectoryWatcherTest.InvalidatedWatcher (7 ms)
[ RUN      ] DirectoryWatcherTest.InvalidatedWatcherAsync
[       OK ] DirectoryWatcherTest.InvalidatedWatcherAsync (16 ms)
[----------] 8 tests from DirectoryWatcherTest (91 ms total)

[----------] Global test environment tear-down
[==========] 8 tests from 1 test case ran. (91 ms total)
[  PASSED  ] 8 tests.
cori10:DirectoryWatcher> strace -f -e open ./DirectoryWatcherTests  2>&1|grep
open
[pid 49990] open("/tmp/dirwatcher-adcb34/watch",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
open("/tmp/dirwatcher-adcb34", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
open("/tmp/dirwatcher-adcb34/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC)
= 4
[pid 49995] open("/tmp/dirwatcher-0cf5ec/watch",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 7
open("/tmp/dirwatcher-0cf5ec", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
open("/tmp/dirwatcher-0cf5ec/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC)
= 4
[pid 49990] open("/tmp/dirwatcher-ab6ab0/watch",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
open("/tmp/dirwatcher-ab6ab0", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
open("/tmp/dirwatcher-ab6ab0/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC)
= 4
[pid 49990] open("/tmp/dirwatcher-f06c15/watch",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
open("/tmp/dirwatcher-f06c15", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
open("/tmp/dirwatcher-f06c15/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC)
= 4
[pid 49990] open("/tmp/dirwatcher-3204e2/watch",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
open("/tmp/dirwatcher-3204e2", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
open("/tmp/dirwatcher-3204e2/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC)
= 4
[pid 49990] open("/tmp/dirwatcher-390f43/watch",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
[pid 49990] open("/tmp/dirwatcher-390f43/watch",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 7
open("/tmp/dirwatcher-390f43", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
[pid 49990] open("/tmp/dirwatcher-34f29d/watch",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
[pid 49990] open("/tmp/dirwatcher-34f29d",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
open("/tmp/dirwatcher-34f29d/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC)
= 4
[pid 50018] open("/tmp/dirwatcher-f385d3/watch",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 7
[pid 49990] open("/tmp/dirwatcher-f385d3",
O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
open("/tmp/dirwatcher-f385d3/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC)
= 4
```

However, the same test on an XC "compute node" hangs:

```
nid00049:DirectoryWatcher> ./DirectoryWatcherTests  
[==========] Running 8 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 8 tests from DirectoryWatcherTest
[ RUN      ] DirectoryWatcherTest.InitialScanSync
[       OK ] DirectoryWatcherTest.InitialScanSync (5 ms)
[ RUN      ] DirectoryWatcherTest.InitialScanAsync
[       OK ] DirectoryWatcherTest.InitialScanAsync (8 ms)
[ RUN      ] DirectoryWatcherTest.AddFiles
[       OK ] DirectoryWatcherTest.AddFiles (8 ms)
[ RUN      ] DirectoryWatcherTest.ModifyFile
[       OK ] DirectoryWatcherTest.ModifyFile (0 ms)
[ RUN      ] DirectoryWatcherTest.DeleteFile
[       OK ] DirectoryWatcherTest.DeleteFile (16 ms)
[ RUN      ] DirectoryWatcherTest.DeleteWatchedDir
/path/to/llvm-project/clang/unittests/DirectoryWatcher/DirectoryWatcherTest.cpp:247:
Failure
Value of: WaitForExpectedStateResult.wait_for(std::chrono::seconds(3)) ==
std::future_status::ready
  Actual: false
Expected: true
The expected result state wasn't reached before the time-out.
/path/to/llvm-project/clang/unittests/DirectoryWatcher/DirectoryWatcherTest.cpp:250:
Failure
Value of: TestConsumer.result().hasValue()
  Actual: false
Expected: true
Expected initial events: 
Expected non-initial events: 
WatchedDirRemoved 
WatcherGotInvalidated 
Expected but not seen non-initial events: 
WatchedDirRemoved 
WatcherGotInvalidated 
```

The program hangs on that last line and never returns. lldb shows the
following:

```
nid00049:DirectoryWatcher> lldb ./DirectoryWatcherTests 
(lldb) target create "./DirectoryWatcherTests"
Current executable set to './DirectoryWatcherTests' (x86_64).
(lldb) run
Process 19658 launched:
'/path/to/build/tools/clang/unittests/DirectoryWatcher/DirectoryWatcherTests'
(x86_64)
[==========] Running 8 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 8 tests from DirectoryWatcherTest
[ RUN      ] DirectoryWatcherTest.InitialScanSync
[       OK ] DirectoryWatcherTest.InitialScanSync (8 ms)
[ RUN      ] DirectoryWatcherTest.InitialScanAsync
[       OK ] DirectoryWatcherTest.InitialScanAsync (16 ms)
[ RUN      ] DirectoryWatcherTest.AddFiles
[       OK ] DirectoryWatcherTest.AddFiles (16 ms)
[ RUN      ] DirectoryWatcherTest.ModifyFile
[       OK ] DirectoryWatcherTest.ModifyFile (16 ms)
[ RUN      ] DirectoryWatcherTest.DeleteFile
[       OK ] DirectoryWatcherTest.DeleteFile (16 ms)
[ RUN      ] DirectoryWatcherTest.DeleteWatchedDir
/path/to/llvm-project/clang/unittests/DirectoryWatcher/DirectoryWatcherTest.cpp:247:
Failure
Value of: WaitForExpectedStateResult.wait_for(std::chrono::seconds(3)) ==
std::future_status::ready
  Actual: false
Expected: true
The expected result state wasn't reached before the time-out.
/path/to/llvm-project/clang/unittests/DirectoryWatcher/DirectoryWatcherTest.cpp:250:
Failure
Value of: TestConsumer.result().hasValue()
  Actual: false
Expected: true
Expected initial events: 
Expected non-initial events: 
WatchedDirRemoved 
WatcherGotInvalidated 
Expected but not seen non-initial events: 
WatchedDirRemoved 
WatcherGotInvalidated 
Process 19658 stopped
* thread #1, name = 'DirectoryWatche', stop reason = signal SIGSTOP
    frame #0: 0x00002aaaaace1339 libpthread.so.0`__pthread_cond_destroy + 105
libpthread.so.0`__pthread_cond_destroy:
->  0x2aaaaace1339 <+105>: cmpq   $-0x1000, %rax            ; imm = 0xF000 
    0x2aaaaace133f <+111>: jbe    0x2aaaaace1320            ; <+80>
    0x2aaaaace1341 <+113>: leal   0xb(%rax), %ecx
    0x2aaaaace1344 <+116>: cmpl   $0xb, %ecx
(lldb) bt
error: DirectoryWatcherTests :: Class '_Alloc_hider' has a base class
'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>
>::allocator_type' which does not have a complete definition.
error: DirectoryWatcherTests :: Try compiling the source file with
-fstandalone-debug.
* thread #1, name = 'DirectoryWatche', stop reason = signal SIGSTOP
  * frame #0: 0x00002aaaaace1339 libpthread.so.0`__pthread_cond_destroy + 105
    frame #1: 0x000000000040b666 DirectoryWatcherTests`(anonymous
namespace)::VerifyingConsumer::~VerifyingConsumer(this=0x00007fffffff5f78) at
DirectoryWatcherTest.cpp:100:8
    frame #2: 0x000000000040d62a
DirectoryWatcherTests`DirectoryWatcherTest_DeleteWatchedDir_Test::TestBody(this=0x000000000082f8e0)
at DirectoryWatcherTest.cpp:430:1
    frame #3: 0x000000000054a634 DirectoryWatcherTests`void
testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test,
void>(object=0x000000000082f8e0, method=21 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00, location="the test body")(), char const*) at gtest.cc:2402:10
    frame #4: 0x0000000000537a82 DirectoryWatcherTests`void
testing::internal::HandleExceptionsInMethodIfSupported<testing::Test,
void>(object=0x000000000082f8e0, method=21 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00, location="the test body")(), char const*) at gtest.cc:2455:12
    frame #5: 0x0000000000525a26
DirectoryWatcherTests`testing::Test::Run(this=0x000000000082f8e0) at
gtest.cc:2474:5
    frame #6: 0x00000000005262bb
DirectoryWatcherTests`testing::TestInfo::Run(this=0x000000000082dfc0) at
gtest.cc:2656:11
    frame #7: 0x0000000000526854
DirectoryWatcherTests`testing::TestCase::Run(this=0x000000000082d660) at
gtest.cc:2774:28
    frame #8: 0x000000000052be25
DirectoryWatcherTests`testing::internal::UnitTestImpl::RunAllTests(this=0x000000000082d1e0)
at gtest.cc:4649:43
    frame #9: 0x000000000054dc34 DirectoryWatcherTests`bool
testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl,
bool>(object=0x000000000082d1e0, method=50 bb 52 00 00 00 00 00 00 00 00 00 00
00 00 00, location="auxiliary test code (environments or event listeners)")(),
char const*) at gtest.cc:2402:10
    frame #10: 0x00000000005396e2 DirectoryWatcherTests`bool
testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl,
bool>(object=0x000000000082d1e0, method=50 bb 52 00 00 00 00 00 00 00 00 00 00
00 00 00, location="auxiliary test code (environments or event listeners)")(),
char const*) at gtest.cc:2455:12
    frame #11: 0x000000000052bb2a
DirectoryWatcherTests`testing::UnitTest::Run(this=0x000000000081a218) at
gtest.cc:4257:10
    frame #12: 0x000000000051df31 DirectoryWatcherTests`RUN_ALL_TESTS() at
gtest.h:2233:46
    frame #13: 0x000000000051df1a DirectoryWatcherTests`main(argc=1,
argv=0x00007fffffff6608) at TestMain.cpp:50:10
    frame #14: 0x00002aaaabc5334a libc.so.6`__libc_start_main + 234
    frame #15: 0x000000000040a54a DirectoryWatcherTests`_start at start.S:120
```

I have been told that the /tmp ramdisk on "compute nodes" is configured
slightly different than the /tmp ramdisk on "login nodes," which may explain
the difference in behavior. But I don't know what the difference in
configuration actually is.

I also tried setting TMPDIR to different file systems, including Lustre and
GPFS. I found that when TMPDIR is a GPFS file system, the test succeeds on any
kind of node (login and compute), but when TMPDIR is a Lustre filesystem, it
hangs on both login nodes and compute nodes, in the same way as when TMPDIR is
set to the /tmp ramdisk. 

So to summarize:

/tmp ramdisk on XC login nodes: PASS
Lustre on XC login nodes: (hangs)
GPFS on XC login nodes: PASS

/tmp ramdisk on XC compute nodes: (hangs)
Lustre on XC compute nodes: (hangs)
GPFS on XC compute nodes: PASS

Any ideas how to make this test less sensitive to the kind of file system it's
running on?

Thanks.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20200723/04b63341/attachment-0001.html>


More information about the llvm-bugs mailing list