[Lldb-commits] [lldb] [lldb] Multithreading lldb-server works on Windows now; fixed gdb port mapping (PR #100670)

Mon Jul 29 02:09:19 PDT 2024

================
@@ -324,6 +324,18 @@ Status PipePosix::ReadWithTimeout(void *buf, size_t size,
         bytes_read += result;
         if (bytes_read == size || result == 0)
           break;
+
+        // This is the workaround for the following bug in Linux multithreading
+        // select() https://bugzilla.kernel.org/show_bug.cgi?id=546
+        // ReadWithTimeout() with a non-zero timeout is used only to
+        // read the port number from the gdbserver pipe
+        // in GDBRemoteCommunication::StartDebugserverProcess().
+        // The port number may be "1024\0".."65535\0".
+        if (timeout.count() > 0 && size == 6 && bytes_read == 5 &&
+            static_cast<char *>(buf)[4] == '\0') {
+          break;
+        }
----------------
labath wrote:

I'm sorry, but I find this extremely hard to believe. Pipes and the ability to read them until EOF are as old as UNIX. We're not doing anything special here. It's going to take a lot more than a reference to a 20 year old `REJECTED INVALID` bug to convince me this is a kernel issue.

Also note that the situation mentioned in that bug is different from what (I think) you're describing here. In their case, the pipe is NOT closed on the *other* side. The pipe is closed on the side that's doing the `select`ing:

```
Consider a multithreaded program (pthreads). One thread reads from
a file descriptor (typically a TCP or UDP socket). At some time,
another thread decides to close(2) the file descriptor while the
first thread is blocked, waiting for input data.
```

In other words, this is your basic race in the application code, and it's the applications (ours) responsibility to fix it. While I'm not a kernel maintainer, I think I have a pretty good idea why they said the application is buggy, and why they didn't want to fix it -- it's because fixing it probably will not make the application correct.

The problem there is that the application has calls (`select` (or friends) and `close`) on two threads with no synchronization between them. Now if select happens to run first, then it's not completely unreasonable to expect that `close` will terminate that `select`, and the kernel could in theory make sure it does that (apparently, some operating systems do just that). The problem is what happens if `select` does not run first. What if `close` does ? In this case, `select` will return an error (as the bug reporter expects), but ***only*** if the FD hasn't been reused in the mean time. And since linux always assigns the lowest FD available (I think POSIX mandates that), it's very likely that the very next operation (perhaps on a third thread) which creates an fd will get the same FD as we've just closed. If that happens, then the select will NOT return an error (the kernel has no way to know that it's referring to the old fd) and will happily start listening on the new fd. Since you usually aren't able to control all operations that could possibly create a new FD, this kind of pattern would be buggy except in extremely limited circumstances.

https://github.com/llvm/llvm-project/pull/100670