It's an absolutely wonderful article - perhaps one of the best I've seen regarding latency tweaking on a linux stack - but I don't understand why so little about the final step, "Since we have a Solarflare network card handy, we can use the OpenOnload kernel bypass technology to skip the kernel network stack all together:" was discussed. That was like the major punch line of the entire article, and it was given a minor "oh, by the way, " sort of treatment.
Great article regardless. I'm wondering what the latency would have been if they had started with that step.
Kernel bypass for networking and disk I/O (common in good databases) with userspace implementations has one large downside: it requires an enormous amount of sophisticated software engineering to access the integer factor gains in performance possible.
This requires the creation of a lot of code to reimplement the kernel functionality you need by a software engineer of atypical skill to ensure a robust and performant outcome. In other words, it is expensive and requires a skill set that is relatively rare to use well. However, if you have significant server CapEx and OpEx it can totally be worth it to reduce your required server footprint.
BTW, you can do this with many common Ethernet chipsets, including the ubiquitous Intel ones, but they are all a bit different to work with.
Having used it, it will generally consume all available CPU cycles on the core(s) on which it runs. So presumably it is doing a lot of busy-waiting. Which, if low latency is your top priority, may be just fine.
OpenOnload's spinning behavior is highly tunable and disabled by default. The default profile might appear to take some more CPU cycles, but really it's the same CPU cycles that the kernel would be using to operate the network stack (but across the kernel/userspace barrier and possibly on another core/socket/numanode, etc, introducing latency).
It's a great article, but I'm surprised that the author introduces spinning and task-setting without noting CPU isolation (either with cgroups or isolcpus). Not doing so could introduce problems.
No more local tcp dump. No more kernel routing, copying, logging, or filtering of packets. As someone who's hacked on OoL and other kernal bypass stacks a little, it is a huge win if you are latency sensitive especially for just thrashing the core by constant polling for absolute lowest latency.
There are a lot of applications where it's a huge win. Security applications (snort, bro, flow monitoring) leverage onload/bypass with multiple receive queues to massively scale up.
Create a receive queue for each core on the NUMA node, spin up parallel processes (1 for each core and pin to the core), then aggregate the data. Wire rate monitoring with a $700 NIC and a commodity PC. Pretty amazing considering what companies used to pay for such performance.
FWIW, we use the Myricom NICs with Sniffer10G licenses. One neat thing you can do with them is sniff traffic from the same queue across multiple processes. You no longer have to mirror the traffic for multiple procs to read from the NIC.
It's open-source but patent-encumbered, which is a strange state. It's not a user-space TCP stack but a "card-space" one: it's offloading all the work into the card.
It's not offloading ALL the work onto the card. With 'offload' people often think about feature like computing checksums in hardware, but that's not a big deal on modern hardware. OpenOnload definitely leverages the cards capabilities, notably packet buffer handling which is exposed in their ef_vi API (there's a comment about this on the article's page); the flow steering (mentioned in the article) is a big help too. OpenOnload is a user space network stack built on top of ef_vi.
But so much the win come from being userspace and tunable. OpenOnload also accelerate pipes, loopback UDP/TCP and epoll, all of which are unrelated to their hardware. Indeed, having unaccelerated fds in the same epoll set as accelerated fds kills performance because it needs to ask the kernel for the status of the unaccelerated fds.
That said, I think you need to have a SolarFlare card for OpenOnload to work at all, even if you just use it for local communication. I've not tried it though.
Great article regardless. I'm wondering what the latency would have been if they had started with that step.