Seccomp filter incorrectly tries to act on strings, allowing sandbox bypass
I am filing this on behalf of cypherpunks who cannot use our trac because of its javascript requirements for posting some stuff.
Currently, the Tor seccomp sandbox attempts to whitelist internal strings. This is a bug, because seccomp is incapable of whitelisting memory contents, only register contents (i.e. the pointer itself). That is, when you call open(path, mode)
, the first argument is actually an address, such as 0x368a227dec3
. When you attempt to use the seccomp filter to whitelist this, you are not whitelisting the string, only the pointer to the string. The only reason it works is because compilers deduplicate identical strings, allowing two strings that are the same (e.g. the string given to seccomp for whitelisting, and the string being given to a syscall) point to the same location in .rodata
.
A demonstration showing deduplication:
cat $ hello.c
#include <stdio.h>
void main(int argc, char *argv[])
{
char *s1 = "Hello, world!";
char *s2 = "Hello, world!";
char *s3 = argv[1];
printf("%p\t%s\n%p\t%s\n%p\t%s\n", s1, s1, s2, s2, s3, s3);
}
$ gcc hello.c
$ ./a.out "Hello, world!"
0x64a4b457b4 Hello, world!
0x64a4b457b4 Hello, world!
0x3eb7d22b37d Hello, world!
$ readelf -x .rodata a.out
Hex dump of section '.rodata':
0x000007b0 01000200 48656c6c 6f2c2077 6f726c64 ....Hello, world
0x000007c0 21002570 0925730a 25700925 730a2570 !.%p.%s.%p.%s.%p
0x000007d0 0925730a 00 .%s..
Despite the strings being identical in all cases, only the first two, which were defined internally, point to the same data. Only when the string is accessed at runtime does the value change. This means that hardcoding a string with a seccomp sandbox may seem to work, but provides no protection, as all an attacker would have to do is write their own strings to those addresses (changing the permissions of .rodata
first if it is stored there).
A demonstration which shows how a whitelisted string can be bypassed:
$ cat seccomp.c
#include <seccomp.h>
#include <unistd.h>
#include <string.h>
#include <stdlib.h>
void main(void)
{
char *s = malloc(14);
memcpy(s, "Hello, world!\n", 14);
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_TRAP);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 1,
SCMP_A1(SCMP_CMP_EQ, (scmp_datum_t)s));
seccomp_load(ctx);
write(1, s, 14);
memcpy(s, "Hallo, world!\n", 14);
write(1, s, 14);
}
$ gcc seccomp.c -lseccomp
$ ./a.out
Hello, world!
Hallo, world!
See "Why does seccomp-filter is not enough?" [sic] from https://lwn.net/Articles/698226/, which explains why an LSM (currently in development) is required to provide seccomp with the ability to filter paths and other memory-resident values:
A seccomp filter can access to raw syscall arguments which means that it is not possible to filter according to pointed data as a file path. As demonstrated the first version of this patch series, filtering at the syscall level is complicated (e.g. need to take care of race conditions). This is mainly because the access control checkpoints of the kernel are not at this high-level but more underneath, at LSM hooks level. The LSM hooks are designed to handle this kind of checks. This series use this approach to leverage the ability of unprivileged users to limit themselves.
It may seem like it would be enough to ensure the buffers are read-only and disallow mprotect()
accessing that address. Unfortunately, other syscalls like brk()
can bypass this. In the end, it's extremely difficult to ensure that a memory region cannot be written to by malicious code without disabling virtually all memory-related syscalls.
The entirety of src/common/sandbox.c:sandbox_intern_string
is poorly thought out and relies on incorrect and dangerous assumptions. This results in 6 syscalls so far which devs believe to be filtered but which in fact can be bypassed:
- chown - pathname argument
- chmod - pathname argument
- open - pathname argument
- openat - pathname argument
- rename - oldpath and newpath arguments
- stat64 - pathname argument
The solution is to use the syscalls before enabling the sandbox, or whitelisting them (excluding arguments in memory) and eventually blacklisting them in a stage 2 sandbox.