For anyone who hasn't seen this before. There is a pretty good gVisor Architectu...

prattmic · on May 15, 2019

> From what I understand, basically a user-space program that wraps your container and intercepts all system calls. You can then allow/deny/re-wire them (based on a config).

gVisor actually intercepts and implements the system calls in the user-space kernel. Two specific goals of gVisor are that (1) system calls are never simply allowed and passed through to the host kernel, and (2) you don't need to write a policy configuration for your application; just put your application inside gVisor and go. These are significant differences over simply using something like seccomp on its own (what the architecture guide calls "Rule-based execution").

Some of this is covered in our security model: https://gvisor.dev/docs/architecture_guide/security/#princip...

saagarjha · on May 15, 2019

Reimplementing system calls is non-trivial, especially ones that have complex interactions with others (for example, the system calls related to process management). How do you prevent errors when translating this, and how do you implement features that ostensibly require calls to the OS anyways?

prattmic · on May 16, 2019

For sure, implementing Linux is no easy task, and there is no magic bullet. For compatibility testing, we have extensive system call unit tests [1] and also run many open source test suites. Language runtime tests (e.g., Python, Go, etc) are particularly useful. We also perform continuous fuzzing with Syzkaller [2].

> how do you implement features that ostensibly require calls to the OS anyways?

gVisor's kernel is a user-space program, so it can and does make system calls to the host OS. Some examples:

* An application blocks trying to read(2) from a pipe. gVisor ultimately implements blocking by waiting on a Go channel. The Go runtime will ultimately implement this with a futex(2) call to the host OS. * An application reads from a file that is ultimately backed by a file on the host (provided by the Gofer [3]). This will result in a pread(2) system call to the host.

The purpose here isn't to avoid the host completely (that's not possible), but to limit exposure to the host. gVisor can implement all the parts of Linux it does on a much smaller subset of host system calls. Anything we don't use is blocked by a second-level seccomp sandbox around the kernel. e.g., the kernel cannot make obscure system calls, or even open files or create sockets on the host (those operations are controlled by an external agent).

[1] https://github.com/google/gvisor/tree/master/test/syscalls/l...

[2] https://github.com/google/syzkaller

[3] https://gvisor.dev/docs/architecture_guide/overview/

mav3rick · on May 16, 2019

How is this different than a nicerUI over a seccomp filter for your container?

WestCoastJustin · on May 15, 2019

Awesome, thanks. I need to dig into this a little and just run a few demos / labs. This makes sense though. I really like your comments on this thread too (https://news.ycombinator.com/item?id=16976392).

bradfitz · on May 15, 2019

App Engine now uses gVisor too:

https://cloud.google.com/blog/products/gcp/introducing-app-e...

blaisio · on May 16, 2019

It's basically the same thing as Wine - Wine provides the Windows API and implements it using Linux syscalls. Gvisor implements the Linux API using Linux syscalls, but with an extra authorization layer. I think people are just so gung ho about VMs that they forgot this was possible and easy (I did).

This is also similar to what Microsoft is doing in Windows with the WSL. This is another example of how we're really just in a big technology cycle. Dynamically typed -> statically typed -> dynamically typed; bare metal -> API wrappers -> VMs -> containers -> API wrappers. Soon we'll probably be back to bare metal.

ryacko · on May 16, 2019

There are web hosts offer Raspberry PIs, but they tend to be more expensive than VMs. I'm guessing colocation costs are dominant.

thesandlord · on May 15, 2019

> So, Google's using something like this internally too for their own workloads

A public example of this is Cloud Run [1, 2]

[1] https://news.ycombinator.com/item?id=19616832 [2] https://cloud.google.com/run/docs/reference/container-contra...

WestCoastJustin · on May 15, 2019

Ah, cool, thanks. I didn't know they were running that under the hood. Yeah, I've checked out Cloud Run via a screencast I did on it a few weeks back [1]. I really like the concept and am looking forward to seeing the evolution of it!

[1] https://sysadmincasts.com/episodes/69-cloud-run-with-knative

spyspy · on May 15, 2019

The newest generation of AppEngine runs on this as well. In fact Cloud run and 2nd Gen GAE are exactly the same under the hood afaik. It allowed Google to ditch the custom APIs and toolchains they forced apps to use in order to keep their infra secure. Fun fact: Cloud Run and GAE both run code in Google's main search clusters, rather than their separate Google Cloud infra.

cameronbrown · on May 15, 2019

> Cloud Run and GAE both run code in Google's main search clusters, rather than their separate Google Cloud infra

What's the reasoning behind this?

asciimike · on May 15, 2019

Cloud Run/App Engine PM

Run and GAE run directly on Borg (which is the shared infrastructure that underpins all Google services, including Cloud products), rather than on VMs.

Search/Ads/Maps/etc. run on Borg as well, but there's significant isolation between all those products.

derefr · on May 15, 2019

That's the "what", but what's the "why"? Why run these in the main Borg cluster, rather than running them in the (separate, if I'm understanding you) Borg cluster that GCP uses as its substrate?

Is it that the GCP Borg cluster is just big enough for GCP's control-plane, and then the rest of GCP is all Borg-less VM hypervisor boxes (running ESXi or what-have-you), so these gVisor-on-Borg workloads wouldn't have anywhere to "live" in the GCP cluster?

If that is the issue, then I would have (naively) expected the solution to that to be adding a second, GCP-scale data-plane Borg cluster per zone, just for client workloads; rather than inviting these client workloads to co-mingle with Google's own workloads in the non-GCP part of the DC.

tweenagedream · on May 16, 2019

Isolation is often done in software, Google has invested a lot of effort in making sure the distinct services that they run e.g. Youtube transcoding on the same machine search is running on don't interfere with each other. Whether through cpu constrains or some other priority levels. These are features of borg.

https://scholar.google.com/scholar?lr&ie=UTF-8&oe=UTF-8&q=La...

jkaplowitz · on May 16, 2019

I know nothing about the decisions behind where Cloud Run and GAE run, but even customer GCE VMs run on top of Borg, not just the control plane. GAE predates most or all of GCP, and there weren't separate GCP clusters when it got launched.

(Used to work for Google including the GCP team, but haven't worked for them for over 4 years and I'm not speaking for them now. I'm reasonably sure this is all already public info.)

spyspy · on May 16, 2019

likely just due to the age of appengine vs. all other gcp products

sayhello · on May 16, 2019

Used to work on 2nd gen AppEngine.

I helped ship the runtimes!

Yes, this is due to age. GAE and Run depend on pieces of infrastructure going back a long time.

equalunique · on May 15, 2019

My introduction to gVisor was the talk by Emma Haruka Iwao at InfoQ NY 2018: https://www.youtube.com/watch?v=Ur0hbW_K66s

I learned of the talk because Brian Cantrill referenced it during his very deep dive into operating systems, C, and Rust given later at the some event: https://www.youtube.com/watch?v=HgtRAbE1nBM

roryrjb · on May 15, 2019

Isn't this the point of seccomp on Linux and pledge on OpenBSD (and others I'm sure I'm just more familiar with these two), but without this much overhead? Also I'd be interested to know, based on this quote in the post "There’s a saying among security experts: containers do not contain" how Solaris/illumos' Zones and FreeBSD Jails compare.

WestCoastJustin · on May 15, 2019

> re: seccomp

This thread has a good answer: https://news.ycombinator.com/item?id=16976392

> "containers do not contain"

Is sort of troll bait. They do contain. That is why everyone is using them. Sure, there will be exploits to break out of them, just like with VMs, and even CPU bugs now.

Here is a good example of someone who broke out of a container on the play-with-docker.com site using a custom kernel module [1]. This allowed a container escape but you could say this was a bug since that wasn't the intent. So, you'd patch it. So, I get the joke in that people are extremely creative and will find ways around everything.

[1] https://www.cyberark.com/threat-research-blog/how-i-hacked-p...

013a · on May 15, 2019

That's fair, but at the same time: If the end-state is "containers should contain, they're secure, any insecurities are bugs" then why do we see so many defense-in-depth strategies like gVisor pop up which provide legitimate value to consumers?

At what point are we just reinventing the VM hypervisor, but worse because every single one of these systems already has a VM hypervisor running somewhere? It seems likely to me that in the not-so-distant future the "Container" terminology won't actually mean anything because we'll figure out the engineering difficulty behind merging the best parts of VMs with the best parts of Containers, and managed systems like Fargate or even GKE don't really need both a VM hypervisor and a Container hypervisor when they're so similar.

lima · on May 15, 2019

gVisor is a special kind of hypervisor, basically - it has a production-ready KVM backend.

The main difficulties with VM-backend containers are storage passthrough and memory overcommit.

wahern · on May 16, 2019

Memory overcommit is addressed by virtio memory ballooning (https://www.linux-kvm.org/page/Projects/auto-ballooning). Even OpenBSD supports this as both guest and host.

For storage, there's already virtio block devices, not to mention PCI passthrough. But if you mean direct file system access, virtio-fs (https://virtio-fs.gitlab.io/) is just about ready to roll.

There's still the issue that you're running an entire extra kernel. Not sure that's much slower than using Go; it's probably faster if what was described about bouncing on futexes elsethread is true.

gVisor sounds like the kind of solution that makes sense for Google but not something that would survive in the wider community. The concept sounds great, but using Go sounds horrible, though I'm sure Go made prototyping the concept super simple--specifically goroutines reify execution flow in a nice way, but so would stackful coroutines in C or even Rust, which is easy to implement if you don't need to worry about deep recursion.

d1zzy · on May 16, 2019

One big problem with KVM based VMs that gVisor fixes is that KVM is a (complex) piece of host kernel software. There have been many security incidents in the past related to KVM and there will be more for sure. With gVisor the "virtualization logic" runs purely in user space (and may itself be further isolated, like any other regular user space process, within the host environment). This means that any bugs in gVisor will, at most, impact the isolation unit where it runs in the host space, as opposed to KVM where bugs in KVM would impact the entire machine (including other customer workloads on that machine).

The non-security related issues you listed, specialized interfaces to allow I/O to bypass the generic hardware virtualization layer, are IMO hacks (even the name of "para-virtualization" given to such mechanisms should be a tell). Because it would be to inefficient to do almost any I/O we care about to perform fast (network and storage) through the overall machine virtualization interface, we poke holes in that interface, specialized ones, that will allow us to carry requests and replies from the guest to the host more directly/efficiently. As a software engineer that seems like a hack. When something like gVisor comes along which provides much better security for the host environment and allows to quickly handle syscall level I/O by design I much prefer that approach over a VM. The drawback of gVisor is one similar to Wine: having to write bug for bug compatibility with the ABI supported (Linux x64 in this case). However, different from Wine, the Linux ABI surface is extremely small vs to what Wine has to reimplement to run even the simplest Windows applications and, most of all, with gVisor there's direct access to the source code of the ABI that it needs to implement making development much easier than something like Wine.

ian-lewis · on May 17, 2019

There's a blub on "why Go?" on the website if you're interested.

https://gvisor.dev/docs/architecture_guide/ > gVisor is written in Go in order to avoid security pitfalls that can plague kernels. With Go, there are strong types, built-in bounds checks, no uninitialized variables, no use-after-free, no stack overflow, and a built-in race detector. (The use of Go has its challenges too, and isn’t free.)

Using a memory-safe language was a conscious design decision. https://twitter.com/LazyFishBarrel/status/112900096574140416...

pjmlp · on May 16, 2019

I kind of agree with you, but actually like that gVisor exists and is written in Go.

As it kind of proves a point about systems level software being written in Go.

I would rather see VM/unikernels take off instead.

wahern · on May 16, 2019

I can't find a link to the thread (in 5 minutes of Googling), but IIRC in a thread discussing the 2nd iteration[1] of a patch to fix the recent runc container breakout exploit this year one of the developers responsible for the patch flat out stated that Go was a poor choice for runc and has resulted in too much pain and ugly hacks. For example, because namespaces are per thread in Linux and you can't control how Go threads (goroutines, cgo stacks) migrate across kernel threads, the most basic task of simply creating and entering a namespace is complicated. (Neither Go nor Linux are amenable to providing a mechanism to alleviate the issue.) And then there's the issue of memory management--too much bloat and lack of fine-grained control as compared to a managed memory environment. These things don't usually matter, but when they do matter they really matter. They can become the primary source of complexity.

[1] The original fix was to exec runc from a memfd-backed copy so writing to /proc/self/exe in the container didn't poison the binary outside the container. But the change in memory usage broke some existing workloads in the wild which had low memory resource limits. I think the second iteration used O_TMPFILE on tmpfs, or at least that was what was under discussion.

lima · on May 15, 2019

> Here is a good example of someone who broke out of a container on the play-with-docker.com site using a custom kernel module [4]. This allowed a container escape but you could say this was a bug since that wasn't the intent. So, you'd patch it. So, I get the joke in that people are extremely creative and will find ways around everything.

That one was an extremely obvious misconfiguration - running with --privileged=true. There dozens of ways to abuse that, probably much easier than using a custom kernel module.

Yes, containers do contain, but the attack surface is MUCH larger than a virtual machine or something like gVisor. Just look at the constant stream of Linux local privilege escalations.

ian-lewis · on May 17, 2019

Rather that the contain/don't-contain dichotomy, what's more important is gVisor's design principle that it always has 2 layers of isolation from the host and doesn't rely on any one bug in the Linux kernel, sentry, or elsewhere in order to break out of the sandbox. This leaves you less exposed to 0-day attacks and lags in patching kernels.

You can't get that from normal Linux containers due to their fundamental design.

WestCoastJustin · on May 15, 2019

Well, it happened and on a pretty popular site too. So if they got it wrong how many other people do. This is a core reason folk should check out gVisor. Not sure why the downvotes as this is a pretty good example use-case?

lima · on May 15, 2019

gVisor has unsafe modes of operation, too. What I'm saying is that this is not a good example of "Container breakout", as it was just a misconfiguration, not an exploit.

"people are extremely creative and will find ways around everything" is not an excuse - it's a matter of risk management and threat modelling.

Escaping from a VM or gVisor is much, much harder than escaping from Linux namespaces ("containers") due to the MUCH smaller amount of attack surface/amount of exposed code. Using Linux containers in an untrusted multi-tenant environment is very dangerous, especially if you're a high profile cloud provider, which is why all of these projects exist.

raesene9 · on May 15, 2019

So, play with Docker are trying to do something very niche and not something that almost anyone else would try in production, which is running Docker inside Docker, which they do in order to produce the very cool service they do.

Their breach isn't really a good indicator as I can't think of any/many reasons that most companies would try and do that...

windexh8er · on May 16, 2019

> Their breach isn't really a good indicator as I can't think of any/many reasons that most companies would try and do that...

There are a bunch of legitimate reasons to run Docker in Docker. The most obvious is in a build pipeline. For example Jenkins does Docker builds in containers all the time.