Or you can use the open + O_PATH + *at syscall family which can be used to get a handle on a resolved directory relative to which you can manipulate with no re-traversal happening for different operations on that handle.
This combo exists exactly to avoid these kinds of issues.
Another way would be temporarily joining the container's mount namespace to obtain the source handle. But that can't really be done in go since goroutines don't play nicely with per-thread operations.
Edit: After looking through the go standard library it seems that there is an impedance mismatch. Go just does not expose the necessary pieces to do this properly. A dedicated docker-cp tool in C or Rust could probably handle this better. I could be wrong though, maybe it's just not part of the stdlib.
Just using O_PATH won't completely save you. O_PATH can definitely help, especially if you actually check that the path is what you think through readlink(/proc/self/fd/$n), but doing it safely is quite a bit harder than just using *at(2). As mentioned in TFA, I am working on some kernel patches (which will probably result in an openat2(2)) which allow for restriction of path resolution so you can block symlink crossings entirely or force symlinks to resolve to the dirfd you passed.
Yeah, I have read about the precursors of those patches. But this has been simmering for a long time.
The point is that one could do this correctly today with the right rituals. openat2 wouldn't save you if you were still doing plain realpath+open across a security boundary even though openat+procfs or setns are available.
The new syscall implementation would still need a fallback impl. for older kernels after all.
> The new syscall implementation would still need a fallback impl. for older kernels after all.
Yup, this is why I'm planning on getting a sane API available in <https://github.com/cyphar/filepath-securejoin> which projects can use so that the correct thing is done with both old and new kernels. Right now it has a (slighly) improved version of the code Docker has, but I'm rewriting it.
It should be noted that there are lots of examples of interfaces which are incredibly hard to make safe without openat2 -- such as O_CREAT.
It would probably require a redesign of how docker interacts with the host filesystem though, and obviously relies on something that isn't yet in the mainline Linux.
It's certainly possible to call nsenter in Go but even so, unlike other namespaces, mount namespaces affect the entire process rather than threads, and the thread must not have launched any threads yet at the time that nsenter is called.
That said, docker does support `docker exec` which gives you a shell inside the container's namespaces and cgroups. I'd imagine they could do something similar and just not call exec once they've entered the container. This would be similar to calling `docker exec $containerid cat /path/to/file`
The problem is that doing this for every file operation becomes ludicrously expensive -- in Go you cannot use vfork and the only (safe) fork primitive is fork+exec. You could have a surrogate process in the container but then the code is significantly more complex (all of the operations need to be fully implemented in the surrogate process and you'll need to figure out a safe way of doing IPC without the container being able to attack it).
You also definitely wouldn't want to be running container runtime code inside all of the namespaces in the container -- this is hard to do safely and should be avoided (there are at least two CVEs related to the joining phase of just runc found in the past few years).
So, `docker exec <container id> cat /path/to/file` is definitely something that works, is as safe as `docker exec`, and doesn't involve running any runtime code inside the container. The expense is that you have cat buffering the data and then the runtime reading the data from a pipe and re-buffering that data. Is that really ludicrously expensive? It doesn't seem any worse than SCP, which is just SSH executing the SCP binary as the command and communicating over the stdin/stdout channels.
Doing it with cat is just an example, since cat may not be present inside the container. Instead of cat, and in the absence of fork-without-exec in go, you could execute a small C program injected similarly to `/dev/init` (or via memfd_create) that uses sendfd over a unix socket created by socketpair to pass back a file descriptor to the runtime so that the data is only buffered once.
A privileged user in the container could ptrace(2) the process and start messing with its output. If you have an IPC protocol (like the sendfd you bring up later) then you've now opened windows for container runtime attacks. Yes, you could double (and triple) check the fd is not malicious but so many programs already don't do this properly -- depending on it being done properly is not an ideal solution.
So, you don't want to join all the namespaces, only the mount namespace. But if you join the monut namespace then MS_MOVE might start messing with your resolution. So really, what you wnt to do is to just pivot_root(2) into the container's rootfs -- this is what "docker cp" already does today and it has a bug because it pivot_root(2)s into an attacker-controlled path. And all of these solutions will require a fork+exec with Go.
> Is that really ludicrously expensive?
If you're doing it once for every single filesystem syscall (every stat(2), readlink(2), open(2), getdents(2)) and so on then yes it gets very expensive. As I mentioned, the other solution is to run entire operations inside a pivot_root(2) but then your runtime code gets more complicated (Docker already does this and it's already hard enough to understand how things like chrootarchive actually work at the end of the day). Not to mention that (as this bug shows) even if you run inside a pivot_root(2) you might make a mistake that entirely negates the security benefits.
To be clear, docker exec is not currently vulnerable to this attack, right?
Currently, docker exec (somehow safely) fully enters the container's namespaces and cgroups, then exec's a command inside the container. My suggestion was basically to have a statically compiled C binary that executes in the fully untrusted context, which things can ptrace and manipulate all they want. The thought was that the C binary would open the file descriptor from inside the untrusted context so that it is incapable of doing anything privileged and then send the file descriptor back over the inherited unix socket via sendfd. I'd imagine the only way this could be vulnerable is if sendfd is vulnerable somehow since this means 100% of the path resolution happens from a fully isolated context.
The performance argument makes plenty of sense, but it sounds like it'd be solvable by just doing a classic tar pipe where tar (or similar) is running in the fully untrusted context and writing its output to a pipe (with no unix sockets involved). You'd just need to get that statically compiled tar binary into the container, similar to how `/dev/init` is done. Would this be unreasonable? `kubectl cp` is already doing an actual tar pipe via docker exec, the missing bit is that it fails if tar does not exist inside the container, so you'd need to inject it in. This would fully removed the complexity of chrootarchive and any path checking, and you'd be able to rely entirely on the security constraints of docker exec.
My point wasn't that docker exec is vulnerable, it's that you were to write a script like:
% docker exec ctr tar cf - some/path | tar xf -
it would be vulnerable to attack, because the container process could ptrace(2) the tar process and then change its output to cause the host-side tar to start overwriting host files.
My point is that you have to be careful if you're designing a system where the security depends on running a process inside the container and trusting its output -- ideally you wouldn't run any process inside the container at all so that an attacking process can have no impact on it. And that's true even if we assume that the "tar" is not a binary that the container has direct control of.
This concern also applies if you aren't using tar and are instead using some other technique. Hell, even if you don't have an IPC system you could still probably attack a process that is just trying to do a single open(2) call -- you just need to make it pass a different fd.
My argument is that the kernel gives us namespaces, seccomp, selinux, apparmor, etc for isolation and attempting to implement all of the path resolution and permission checking from a privileged context outside of the container defeats all of that and requires reimplementing all of those guards from userspace, which feels futile. By using tar, you're left with serialized path strings and file contents rather than file descriptors, and it should be far easier to sanitize those strings than deal with the linux filesystem API.
I definitely recognize that the container process could ptrace the tar process, and with kubectl cp, it's even directly using whatever tar binary is in the container so tar could easily be malicious from the start, but what it can never do is break out of the container onto the node when the tar file is not being extracted onto the node using the docker daemon's prvileges, which is extremely important for multi-tenant environments.
If you executed your example command as root on the node, then yes, a vulnerability in the node's tar implementation could allow a malicious tar file to take over the node at extraction time, but tar does guard against this by default, as do standard posix user permissions: the tar extraction can happen in a completely unprivileged context.
I do view tar's extraction as a valid attack surface since modern tar implementations are complex, however, that would require a tar CVE and there's no reason that `docker cp`'s output target handling is any less vulnerable to the same problems. I really think the most important thing to guard against is at input time.
"kubectl cp" has had security bugs in the past[1] that are very in-line with what I just outlined (I didn't know this beforehand -- but I would've guessed it was vulnerable if they hadn't seen this issue before). In fact the fix in [1] doesn't look entirely complete to me -- it seems to me you could further mess with the output.
I agree that we should use security in depth (and all of those kernel facilities are great), but actually joining the container itself is not a good idea -- you need to treat it as the enemy. I am not in favour of implementing them all in userspace, this is why I'm working on new kernel facilities to restrict path resolution.
You don't necessarily need IPC. You only have to spawn a single-threaded child process that can call setns to switch back and forth between mount namespaces. You can also obtain file descriptors in batches to reduce the overhead.
I wonder how an O_PATH handle from a different namespace behaves once you switch back. If *at lookups are performed under its original namespace you only have to obtain it once.
O_PATH operations are performed under the original namespace, and you don't even need to join the mount namespace -- you could just do an O_PATH of /proc/$pid/root which pipes you directly to the root of the mount namespace.
We came up with this idea before on LKML and I'm trying to remember what the issue with this solution was. It's definitely better than the current method by Docker (and actually, I might be able to get this to work within Docker more simply). It wouldn't work with rootless containers, but you could fix that somewhat trivially by switching to the userns of the container's pid1. Since /proc/self doesn't exist, obvious attacks through that such as /proc/self/root won't work.
There is a somewhat esoteric problem, which is that you could trick the process into opening an O_PATH stashed by a bad container process -- but right now there is a pretty big flaw in how the kernel handles this problem anyway that I'm also trying to solve upstream.
Sorry, I only just remembered what the issue with this solution was. Absolute symlink components in the path will "escape" the root. So it actually doesn't help overall -- you still need to verify after you've done the open. My kernel patchset for RESOLVE_IN_ROOT will scope absolute paths too -- so you definitely could combine it with /proc/$pid/root but that's not enough by itself.
Ok, so getting files from the root O_PATH without verification is not enough. But that still leaves joining the mount namespace to obtain individual fds, which can be done safely and in bulk.
But that starting point has to be initially checked somehow. Say we are user "bob" we open "/foo/bar/xyzzy" (for later use with openat), and, say, "bar" is writable to "mallory", then that isn't a path we can trust in the first place.
O_PATH is a Linux Kernel 4.x thing; can't find it in POSIX. Won't compile on non-Linux POSIX operating systems or older kernels. It seems to be an optimization; if we omit it, we just get a "fatter" file descriptor that takes more cycles to set up.
You can perform the verification after having obtained the dirfd. E.g. by doing a reverse lookup through procfs to see where it actually points. Or by walking the .., making sure you never leave the container root.
It'll certainly be easier to get right with the newly proposed syscalls. But you can also get it right with the current ones.
> O_PATH is a Linux Kernel 4.x thing; can't find it in POSIX.
I don't think that is relevant to docker which relies on many linux-specific APIs anyway.
I think another way might be to have a set of file descriptors you know correspond to safe paths before doing anything, as you say (they could be dirfds which let you traverse downwards). Then when the `docker cp` command is executed, it does not open any new files, it simply looks up the corresponding fd from a table mapping paths to fds. If the file no longer exists, or it doesn't have a fd matching that path, then error out. The construction could be based on where you're copying to/from on the host system.
That way, even if the symlink somehow gets resolved to a bad path, it will refuse to read it because it does not exist from the point of view of the `docker cp` command. I.e. make the command use a "principle of least authority" where it cannot even see files outside of a certain set of paths, let alone be tricked into opening them by a symlink.
This combo exists exactly to avoid these kinds of issues.
Another way would be temporarily joining the container's mount namespace to obtain the source handle. But that can't really be done in go since goroutines don't play nicely with per-thread operations.
Edit: After looking through the go standard library it seems that there is an impedance mismatch. Go just does not expose the necessary pieces to do this properly. A dedicated docker-cp tool in C or Rust could probably handle this better. I could be wrong though, maybe it's just not part of the stdlib.