It's an absolutely wonderful article - perhaps one of the best I've seen regardi...

thrownaway2424 · on June 30, 2015

What are the tradeoffs? Is there any downside to this userspace network stack, or is it pure win?

Edit: other than needing the thousand-dollar NIC, I mean.

jandrewrogers · on June 30, 2015

Kernel bypass for networking and disk I/O (common in good databases) with userspace implementations has one large downside: it requires an enormous amount of sophisticated software engineering to access the integer factor gains in performance possible.

This requires the creation of a lot of code to reimplement the kernel functionality you need by a software engineer of atypical skill to ensure a robust and performant outcome. In other words, it is expensive and requires a skill set that is relatively rare to use well. However, if you have significant server CapEx and OpEx it can totally be worth it to reduce your required server footprint.

BTW, you can do this with many common Ethernet chipsets, including the ubiquitous Intel ones, but they are all a bit different to work with.

usefulcat · on June 30, 2015

Having used it, it will generally consume all available CPU cycles on the core(s) on which it runs. So presumably it is doing a lot of busy-waiting. Which, if low latency is your top priority, may be just fine.

neomantra · on June 30, 2015

OpenOnload's spinning behavior is highly tunable and disabled by default. The default profile might appear to take some more CPU cycles, but really it's the same CPU cycles that the kernel would be using to operate the network stack (but across the kernel/userspace barrier and possibly on another core/socket/numanode, etc, introducing latency).

It's a great article, but I'm surprised that the author introduces spinning and task-setting without noting CPU isolation (either with cgroups or isolcpus). Not doing so could introduce problems.

jnordwick · on June 30, 2015

No more local tcp dump. No more kernel routing, copying, logging, or filtering of packets. As someone who's hacked on OoL and other kernal bypass stacks a little, it is a huge win if you are latency sensitive especially for just thrashing the core by constant polling for absolute lowest latency.

scurvy · on July 1, 2015

There are a lot of applications where it's a huge win. Security applications (snort, bro, flow monitoring) leverage onload/bypass with multiple receive queues to massively scale up.

Create a receive queue for each core on the NUMA node, spin up parallel processes (1 for each core and pin to the core), then aggregate the data. Wire rate monitoring with a $700 NIC and a commodity PC. Pretty amazing considering what companies used to pay for such performance.

scurvy · on July 1, 2015

FWIW, we use the Myricom NICs with Sniffer10G licenses. One neat thing you can do with them is sniff traffic from the same queue across multiple processes. You no longer have to mirror the traffic for multiple procs to read from the NIC.

kasey_junk · on June 30, 2015

In my experience there is a pretty high operational burden to it. You spend a lot more time configuring, testing, diagnosing errors, etc.

pjc50 · on June 30, 2015

It's open-source but patent-encumbered, which is a strange state. It's not a user-space TCP stack but a "card-space" one: it's offloading all the work into the card.

neomantra · on June 30, 2015

It's not offloading ALL the work onto the card. With 'offload' people often think about feature like computing checksums in hardware, but that's not a big deal on modern hardware. OpenOnload definitely leverages the cards capabilities, notably packet buffer handling which is exposed in their ef_vi API (there's a comment about this on the article's page); the flow steering (mentioned in the article) is a big help too. OpenOnload is a user space network stack built on top of ef_vi.

But so much the win come from being userspace and tunable. OpenOnload also accelerate pipes, loopback UDP/TCP and epoll, all of which are unrelated to their hardware. Indeed, having unaccelerated fds in the same epoll set as accelerated fds kills performance because it needs to ask the kernel for the status of the unaccelerated fds.

That said, I think you need to have a SolarFlare card for OpenOnload to work at all, even if you just use it for local communication. I've not tried it though.

wmf · on June 30, 2015

I don't think that's the case for Solarflare; they call it "onload" for a reason.