[llvm] [llvm][Support][Memory] Add memfd based fallback for strict W^X Linux systems (PR #98538)

Fri Oct 11 07:48:09 PDT 2024

minipli-oss wrote:

> > [...]
> 
> I'm suggesting use of a `SimpleSegmentAlloc` for users who need executable code and are currently using `sys::Memory` directly.
> 
> Support for the memfd approach should be restricted to a `JITLinkMemoryManager` or `Mapper`.

I'll comment on that separately, need to hack some more code to convince myself either way.

> > > > [...]
> > 
> > There are several other issues with dual mappings. The probably obvious one is that one has to keep them in sync (what gets written to one mapping needs to be read back when executed from the other). The easiest way to achieve this is to created `MAP_SHARED` file-based mappings. The file can also be a memfd which simply gets mapped twice. However, using `MAP_SHARED` has the unfortunate side-effect that these mappings will not only be kept in-sync for the creating process but also for forked child processes which is actually bad as a child can tamper with mappings in the parent (`MFD_CLOEXEC` for the memfd won't help, as the mapping has a reference to the underlying file that won't be invalidated).
> 
> What do you mean by "tamper with mappings in the parent"? If you just mean the writes to shared memory in the child will be visible in the parent process I would consider that desirable behavior. The existing SharedMemoryMapper already works this way.

No, it'll be a security hole, no less. Take the below example which implements the dual-mapping scheme. It populates the mapping with some _shellcode_ which simply calls a function. The code gets initialized to call the function `good()`. Afterwards the program forks a child process that drops privileges and modifies _its_ mapping to call `bad()` instead. But as the mapping is shared with the parent process, it'll call `bad()` now too.

```c
#define _GNU_SOURCE
#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <err.h>

static pid_t parent_pid;

#define msg(fmt,...) \
    printf("%6s (uid: %d): " fmt, getpid() == parent_pid ? "parent" : "child", \
           geteuid(), ## __VA_ARGS__)

static void good(void) {
    msg("all good...\n");
}

static void bad(void) {
    msg("!!! PWNED !!!\n");
}

static void call_code(const void *code) {
    msg("calling func@%p...\n", code);
    ((void (*)(void))code)();
}

static int pwn(void *code, long *fptr) {
    call_code(code);

    if (mprotect(code, 0x1000, PROT_READ | PROT_WRITE))
        err(1, "mprotect(rw-)");

    msg("patching code at %p...\n", fptr);
    *fptr = (long)bad;

    if (mprotect(code, 0x1000, PROT_READ | PROT_EXEC))
        err(1, "mprotect(r-x)");

    call_code(code);

    return 0;
}

int main(int argc, char *argv[]) {
    static char shellcode[] = {
#if defined(__x86_64__)
        // movabs $0, %rax
        0x48, 0xb8, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        // jmp *%rax
        0xff, 0xe0
#elif defined(__i386__)
        // mov $0, %eax
        0xb8, 0x00, 0x00, 0x00, 0x00,
        // jmp *%eax
        0xff, 0xe0
#else
#error Unsupported architecture!
#endif
    };
    static const int sc_len = sizeof(shellcode);
    unsigned int fptr_offset = sizeof(long) == 8 ? 2 : 1;
    long *fptr = (long *)(shellcode + fptr_offset);
    enum { DUAL_MAP, DIRECT } mode = DIRECT;
    void *addr;
    int flags;
    int fd;

    // make shellcode[] call good()
    *fptr = (long)good;

    parent_pid = getpid();
    if (argc > 1 && strcmp(argv[1], "-d") == 0)
        mode = DUAL_MAP;

    fd = memfd_create("code", MFD_CLOEXEC);
    if (fd < 0)
        err(1, "memfd_create()");

    if (mode == DUAL_MAP) {
        // need to size file in advance
        if (ftruncate(fd, sizeof(shellcode)) < 0)
            err(1, "ftruncate()");

        flags = MAP_SHARED;
        addr = mmap(NULL, sc_len, PROT_READ | PROT_WRITE, flags, fd, 0);
        if (addr == MAP_FAILED)
            err(1, "mmap(rw-)");
        msg("created writable mapping at %p\n", addr);

        memcpy(addr, shellcode, sc_len);
    } else {
        flags = MAP_PRIVATE | MAP_POPULATE;

        if (write(fd, shellcode, sc_len) != sc_len)
            err(1, "write(shellcode)");
    }

    addr = mmap(NULL, sc_len, PROT_READ | PROT_EXEC, flags, fd, 0);
    if (addr == MAP_FAILED)
        err(1, "mmap(r-x)");
    msg("created %s executable mapping at %p\n",
        (flags & MAP_SHARED) ? "shared": "private", addr);

    call_code(addr);

    switch (fork()) {
    case -1: err(1, "fork()");
    case  0: setuid(getuid()); exit(pwn(addr, addr + fptr_offset));
    default: wait(NULL);
    }

    call_code(addr);

    return 0;
}
```

Compiling the code and making sure it has the SUID-bit set:

```sh-session
minipli at x1:~/tmp$ gcc -O2 -o shared_exec shared_exec.c
minipli at x1:~/tmp$ sudo chown 0:0 shared_exec
minipli at x1:~/tmp$ sudo chmod u+s shared_exec
minipli at x1:~/tmp$ ls -l shared_exec
-rwsr-xr-x 1 root root 16848 Okt 11 12:00 shared_exec
```

Calling the program with `-d` makes it use the dual-mapping scheme which **requires** using shared mappings to keep them in-sync. However, it also has the unfortunate side effect of sharing these among child processes, so we get the following behavior:

```sh-session
minipli at x1:~/tmp$ ./shared_exec -d
parent (uid: 0): created writable mapping at 0x7c0a706f8000
parent (uid: 0): created shared executable mapping at 0x7c0a706f7000
parent (uid: 0): calling func at 0x7c0a706f7000...
parent (uid: 0): all good...
 child (uid: 1000): calling func at 0x7c0a706f7000...
 child (uid: 1000): all good...
 child (uid: 1000): patching code at 0x7c0a706f7002...
 child (uid: 1000): calling func at 0x7c0a706f7000...
 child (uid: 1000): !!! PWNED !!!
parent (uid: 0): calling func at 0x7c0a706f7000...
parent (uid: 0): !!! PWNED !!!
```

The parent process initially calls `good()` through the code mapping. The forked child process modifies the code to call `bad()` instead, which can be seen by the last message printed by the (unprivileged) child process. However, as it is a shared mapping, the parent process will now call `bad()` too. A clear breach of privilege boundaries.

Running the program without extra arguments makes it use private mappings instead -- like I'm proposing:

```sh-session
minipli at x1:~/tmp$ ./shared_exec 
parent (uid: 0): created private executable mapping at 0x76bbc26ad000
parent (uid: 0): calling func at 0x76bbc26ad000...
parent (uid: 0): all good...
 child (uid: 1000): calling func at 0x76bbc26ad000...
 child (uid: 1000): all good...
 child (uid: 1000): patching code at 0x76bbc26ad002...
 child (uid: 1000): calling func at 0x76bbc26ad000...
 child (uid: 1000): !!! PWNED !!!
parent (uid: 0): calling func at 0x76bbc26ad000...
parent (uid: 0): all good...
```

Even though the code mapping was modified in the child process to call `bad()`, the parent's mapping was left as-is and is still calling `good()`. And this should very much be the _by-default_ behavior to avoid the otherwise security hole potential.

> 
> > [...] There's only a need to have a `RW-` → `R-X` transition -- no need to modify once generated and _finalized_ code later on. That simply vanishes the need to have two mappings as only one is used at a given time, allowing to simply replace the writable mapping with an executable one _in place_.
> 
> Being able to rewrite code opens up the possibility of improving performance. E.g. rewriting jumps to bypass stubs and go directly to function bodies once the function has been complied.

As long as all of these rewrites happen before the code gets used, i.e. before `finalize()`, that should be fine. The lazy evaluation of symbols should help in this regard, as an executable mapping, strictly speaking, is only needed on first use.
This also mirrors what I've observed during my tests -- executable mappings only created on the first use of a JIT'ed function.

> 
> ORC's memory manager interface was designed to be flexible enough to accommodate all of these options. Keeping dual RW- / R-X mappings as an option would be a really good outcome here.

No, not for the security hazard it creates. I can already hear all the CVEs rattling that this will cause. We can't do that -- not without users of that API explicitly opting-in for such a behavior.

> 
> > Using a dual-mapping approach also requires more syscalls, slowing down JIT code generation further. As the mapping has to be file-based, the size needs to be set in advance (unlike in my current version where it's implicitly updated by simply calling `write()` on the memfd).
> 
> I'm suggesting that it should be reasonably easy to make the dual-mapping optional, and that some clients will be willing to pay that cost for the combination of security and runtime flexibility that it offers.
> 

Given the security implications, I'm not sure we should provide that option at all -- given, that the status quo doesn't have it either.

> > A dual-mapping scheme therefore provides no benefits over the replacement-mapping approach, which would even better mirror the _visible semantics_ of the current `mprotect()` based implementation -- simply change the protection bits for a given virtual address range.
> 
> I'd say that the current semantics of Memory aren't especially helpful, given that LLVM's JIT aims to work transparently out-of-process: The working assumption is that the address range being written isn't the same as the one being executed. That's why implementing this as a mapper makes it so much more useful.
> 
> > I therefore would vote against a dual-mapping approach and keep the replacement-mapping scheme, reusing the same VA range.
> 
> It's possible that there's something that I'm missing, but so far I haven't heard a compelling reason not to do this with a `JITLinkMemoryManager` or `Mapper`, and there are several benefits to doing it at that level.

I need to refresh my memory here. I've tried to go that route before but reverted back to the much simpler solution which will cover all use cases. There were some road blocks but I can't remember them right now. I'll reevaluate and comment back on that part once I've all the details back.

https://github.com/llvm/llvm-project/pull/98538