Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How to achieve low latency with 10Gbps Ethernet (cloudflare.com)
142 points by jgrahamc on June 30, 2015 | hide | past | favorite | 39 comments


There have been a spate of articles lately that sum up to one fact: out of the box, all Linux distributions are tuned for nothing in particular. With some configs you can reduce your latency by an order of magnitude, increase your MySQL ops by an order of magnitude, etc. This is one reason why I, as an application-level programmer, am glad to work at a company big enough to have someone else dealing with the platform on my behalf. I don't want to spend ten years slowly discovering traps like default window size tuned for T1 speed, default min_rto tuned for interplanetary traffic, not enough PIDs to support a 96-core machine, etc.


> all Linux distributions are tuned for nothing in particular.

Which is good, because often tuning means making trade-offs. A lot of those things like polling and interrupt shuffling have side effects and might for example negatively impact other scenarios -- presumably not everyone out there is running single client-server ping-pong with small UDP packets.

Yes a lot of parameters have retained default values from back in the day when system characteristics were different. Switching defaults drastically might break existing code that relied on them. Perhaps those are better updated during major distro version upgrades (RHEL 6->7 for ex).

In general though, I would say Linux is tuned for average throughput not latency. Sometimes they go hand in hand, sometimes they don't. Over the years there have been patches and forks related to operating on low latency environments (stock trading, media processing, hardware control, graphics in desktop environment). Those range from parameters tweaks in /proc to kernel patches, to almost full kernel bypass via things like DPDK for Ethernet for example.


Sure, they are all tradeoffs of some kind or another. It's crazy to think that one person deploying a linux box on AWS would be aware of them all, though. For example what is the default file-max on Ubuntu? Why?


Well you have different use cases and you have a large panel of nobs and switches you can adjust. You need someone who knows what the knobs do and what effect they have.

So perhaps there is a use case for wizard like configuration tool for intermediate users. It can ask you questions like -- how many clients do you have? How many concurrent connections. Would you rather have higher average throughput (say for downloading files) or lower packet latency (say for audio processing). Would you rather use your memory for a file cache for for your application data. And so on.

Next step up -- a self tuning system. It monitors your usage pattern then decides what settings to use. Monitors again, and adjusts a bit and so on.


I've always been somewhat wary of the self tuning systems. The challenge is that the runtime profile of a well-utilized machine is not always predictable especially if it is multi-tenant and/or running a mixture of different workloads in a container like abstraction. I think a level higher is easier to manage and understand, so, we're dealing with and managing multiple machines or multiple containers of vm's as processing units. As a whole, that seems more manageable as a self tuning system.


Tuning profiles could be suggested and an human would apply them. But yeah, that was more of a pie in the sky idea.


Yeah, you forgot the default file descriptor limits that are an order of magnitude too low for running any kind of modern web service. At least it doesn't require a patch and recompile anymore like it used to back in the 90s when I was running big IRC servers.


Nitpick: Maybe you are thinking of Windows wrt buffer sizes? Linux TCP out of the box has window scaling enabled since 2004 or so. "sysctl net.ipv4.tcp_rmem" tells me max buffer size is 6+ MB on my old 4GB laptop, which should be good to saturate 10 a gigabit pipe with up to 50 ms RTT... Mininum retransmission tuning might get you something on your own network if it's congested but would probably be bad to crank down for general internet usage, given things like bufferbloat and large delays caused by wireless L2 retransmissions.


I agree, and Linux's default state is a matter I raise when people talk to me about how awesome DevOps and Full Stack developers are. They may be able to launch a VM and get a web stack up and running with yum, but that's really not reasonable for production usage


The way we work this is that DevOps builds the systems, and the performance testers tune the configs based on their testing. Performance testing is really not the right word for what they do; it's really performance engineering.

Regardless of whether you use DevOps or not, if scalability is an issue, someone needs to be doing performance testing and sizing to have a rough idea of how many users per node you can support. It's part of the QA cycle; and that doesn't go away with DevOps.


Linux TCP window tuning is in fact off, agree with thrownaway. It's by the way a trick that the CDN providers use since they tune their stacks to increase the download speed. But if you could as well do the same tuning yourself.


How can one learn to do this? Are there good books/classes on the topic? Learning from mentors seems pretty vulnerable to cargo culting/bad science but I don't know where to even start.




It's an absolutely wonderful article - perhaps one of the best I've seen regarding latency tweaking on a linux stack - but I don't understand why so little about the final step, "Since we have a Solarflare network card handy, we can use the OpenOnload kernel bypass technology to skip the kernel network stack all together:" was discussed. That was like the major punch line of the entire article, and it was given a minor "oh, by the way, " sort of treatment.

Great article regardless. I'm wondering what the latency would have been if they had started with that step.


What are the tradeoffs? Is there any downside to this userspace network stack, or is it pure win?

Edit: other than needing the thousand-dollar NIC, I mean.


Kernel bypass for networking and disk I/O (common in good databases) with userspace implementations has one large downside: it requires an enormous amount of sophisticated software engineering to access the integer factor gains in performance possible.

This requires the creation of a lot of code to reimplement the kernel functionality you need by a software engineer of atypical skill to ensure a robust and performant outcome. In other words, it is expensive and requires a skill set that is relatively rare to use well. However, if you have significant server CapEx and OpEx it can totally be worth it to reduce your required server footprint.

BTW, you can do this with many common Ethernet chipsets, including the ubiquitous Intel ones, but they are all a bit different to work with.


Having used it, it will generally consume all available CPU cycles on the core(s) on which it runs. So presumably it is doing a lot of busy-waiting. Which, if low latency is your top priority, may be just fine.


OpenOnload's spinning behavior is highly tunable and disabled by default. The default profile might appear to take some more CPU cycles, but really it's the same CPU cycles that the kernel would be using to operate the network stack (but across the kernel/userspace barrier and possibly on another core/socket/numanode, etc, introducing latency).

It's a great article, but I'm surprised that the author introduces spinning and task-setting without noting CPU isolation (either with cgroups or isolcpus). Not doing so could introduce problems.


No more local tcp dump. No more kernel routing, copying, logging, or filtering of packets. As someone who's hacked on OoL and other kernal bypass stacks a little, it is a huge win if you are latency sensitive especially for just thrashing the core by constant polling for absolute lowest latency.


There are a lot of applications where it's a huge win. Security applications (snort, bro, flow monitoring) leverage onload/bypass with multiple receive queues to massively scale up.

Create a receive queue for each core on the NUMA node, spin up parallel processes (1 for each core and pin to the core), then aggregate the data. Wire rate monitoring with a $700 NIC and a commodity PC. Pretty amazing considering what companies used to pay for such performance.


FWIW, we use the Myricom NICs with Sniffer10G licenses. One neat thing you can do with them is sniff traffic from the same queue across multiple processes. You no longer have to mirror the traffic for multiple procs to read from the NIC.


In my experience there is a pretty high operational burden to it. You spend a lot more time configuring, testing, diagnosing errors, etc.


It's open-source but patent-encumbered, which is a strange state. It's not a user-space TCP stack but a "card-space" one: it's offloading all the work into the card.


It's not offloading ALL the work onto the card. With 'offload' people often think about feature like computing checksums in hardware, but that's not a big deal on modern hardware. OpenOnload definitely leverages the cards capabilities, notably packet buffer handling which is exposed in their ef_vi API (there's a comment about this on the article's page); the flow steering (mentioned in the article) is a big help too. OpenOnload is a user space network stack built on top of ef_vi.

But so much the win come from being userspace and tunable. OpenOnload also accelerate pipes, loopback UDP/TCP and epoll, all of which are unrelated to their hardware. Indeed, having unaccelerated fds in the same epoll set as accelerated fds kills performance because it needs to ask the kernel for the status of the unaccelerated fds.

That said, I think you need to have a SolarFlare card for OpenOnload to work at all, even if you just use it for local communication. I've not tried it though.


I don't think that's the case for Solarflare; they call it "onload" for a reason.


NUMA pin the DMA buffers too. I didn't see that in the article. With the app I worked on getting the PCI bus/CPU/memory/interrupt all pinned to the "right" node gave us a good 30% performance improvement.

Put another way, identify which socket the PCIe bridge and device are attached to, then make sure the DMA buffers, and interrupt handler are pinned to that node.

Also, don't touch the buffers from the other socket. Strangely enough touching the DMA buffers from the other node just once was enough to cause a performance fall off.


Probably because the cache line changes from Exclusive to Owned in MOESI.


Yah, that is the obvious choice, but it wasn't as clear cut. Especially with the behavior we were seeing where the performance problems would remain long after the lines were flushed from the CPU caches. I spent a fair amount of time looking at Intel docs on snoop filters and the like and never really satisfied myself that I knew exactly what was causing the problem.


Yeah, with previous generations of chipsets/CPUs, I couldn't saturate quad-10GbE receive streams unless I bound the buffers to particular nodes and IRQs to corresponding cores.


10 GbE is often good enough for most use-cases. For same datacenter same-floor gear and if there is a pressing requirement for low latency, deploying 1-2 generations old Infiniband (IB) gear is likely a smarter approach because the cost point approaches 10 GbE and the latency is sub-microsecond-scale (0.5-0.7 us vs 20-30 for 10 GbE, which is a factor of 40-60x). Deploying OFED can be a pain unless using a distro like Rocks, don't mix-match different vendors' gear (stick to all Mellanox or all QLogic) and be sure the specific bios versions of everything are supported. For Linux Rocks clusters support / deployment, we've used the cool & helpful folks over at StackIQ (was Cluster Corp).


Really important point: this is measuring baseline latency of sending one packet at a time. They're not measuring latency under load where queueing delays are possible. The picture there usually looks pretty different and some of the settings they're changing here may be harmful or irrelevant to that case.


Also why upstream smart batching is so important. If you only have 1 packet shovel it off. But maybe wait 1-2 micros and see if you can combine 10 packets (up to MTU) to decrease overall latency.


It depends some on what layer you're talking about. Nagle's algorithm makes a lot of sense in a lot of situations. But if you've got a complete packet that's ready to encapsulate in an ethernet frame and send over the wire, you only want to try for batching if you're actually hitting a CPU limit from interrupt frequency or something like that. Unless you're dealing with WiFi where extreme amounts of packet aggregation are necessary to get close to the theoretical throughput.


Well if you use smart batching ala ( http://mechanical-sympathy.blogspot.com/2011/10/smart-batchi... ), you don't really have the same kind of hit that you would with Nagle's or a more simple minded batching algo.


Totally. Throughput is an enemy of latency. If you have some other applications on that RX queues or CPU's you can't expect any latency numbers.


The extreme measures (bypassing kernel etc) allowed them to shave ~20 us off the latency, which corresponds to a few kilometers. So this could indeed make a difference for latency-bound apps talking to eachother inside a data center.


step one should be 'we move drivers to userspace and bypass kernel completelly'

edit: finished reading, its the last step :/ and ~halves best case they were able to achieve with tweaks.


That's still not easy enough to do. UDP is simple enough but doing TCP in those drivers is still a pain. There are DPDK with LWIP or with some BSD tacked on top but so far I didn't find something that I can just take for a ride in that area.

For the extreme cases you really want to control the packet allocation for receive and make the buffers your app uses be also used for the network transmit so you get a real zero-copy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: