Great article. There are just a few things that I would add to it.
1. The metrics comment could not be more true. You cannot think hard enough about the actual problems you will encounter and what metrics you will have. Have a common way to pull information about each service, and to analyze it in some common place.
2. Standardize, standardize, standardize. When someone is answering a page at 2 AM and has to deal with a component of your system that they don't really know, the more it resembles other components, the better. Think hard about how you can get every service to be written in such a way that the same critical information is available. Where is it documented? Who do I page? Where is the monitoring? Yes, random third party components won't follow your standards, and will be hard to integrate. But standardization is a good thing.
3. You need to be able to send canary requests through that trace through your whole infrastructure. You should be able to flag a front end request, and have every single request that it generates through your entire system be logged somewhere so that you can see a breakdown of what happened. Hide this so that nobody can use it to take your site down, but build the capacity for yourself. (Standardization will make such a system much easier to build!)
4. Randomly canary a small fraction of your traffic. There is tremendous value in having a random sample of traced traffic. When you're trying to understand how things work, there is nothing like taking an actual request going through a complex system, and seeing what it did. Furthermore if you've got intermittent problems for a small fraction of users, being able to look at a random slow request really, really, really helps you track down issues that otherwise would be virtually impossible to replicate.
One of the most shameful moments of my career happened on the Saturday night before WWDC in 2010. I'd never worked on a system with more than one component or more than 20,000 lines of code. Drunk, I managed to stumble from Jeff's hot-tub to his roof and told him that Twitter's scaling problems were "stupid" and that "there were no good reasons why Twitter should go down".
I've apologized to him before, but after having worked on Minefold for the last 2 years, I feel like I need to apologize again. The work that Jeff and the others at Twitter have done has been amazing. I was also a massive cock.
Until you experience the unrelenting hunger for your data by the gnawing mouths of the multitudes, it is hard to understand what a big job it is to hold them back. I have almost had panic attacks from (comparatively) much smaller user-bases than that.
Also, that was perhaps the biggest apologetic act I have ever witnessed...good job.
There's a growing area where I see a lot of people starting to get involved with distributed system who didn't have to deal with it before: rich in-browser apps with their own permanent storage.
Once you have a Javascript application with state speaking over one or more APIs to your backend services, you're in the domain of distributed system design. Especially if you use the application cache and support offline operation. I frequently see people who are used to more traditional web applications underestimate the system design challenges this causes.
(Those challenges are totally worth solving, because the model is very powerful.)
Since Couch has a master-master replication protocol the database in the browser can just sync with the one on the server. Then the network gets disconnected but that's alright. When it gets re-connected continuous updates pick up where they left off, conflicts are resolved and it sort of works a distributed system. Where a browser is just another node.
People mean different things when they say "local storage". Some of the storage APIs prompt the user, some don't, some only prompt if you ask for more than a certain amount of storage space. The various browsers are not consistent, especially when it comes to quota management.
For someone building a consumer web app that needs to painlessly convert users in volume, I can see where it would still be a pain. For me, it's not so bad -- my app is B2B, and having an install step as simple as clicking "Allow this app to use up to 500MB on your computer?" is actually a huge improvement over the "enterprise" junk we compete with.
Having done distributed systems for many years the only thing I would like to add is that all systems are not equal so some of the things mentioned in this article might not work. It was my experience that the real key to making a distributed system work was to identify the critical paths and segment and encapsulate the functionality of different components into the smallest possible implementations. Then define robust messaging between the components to allow for coordination. In my case we were making control systems for Naval systems so we had a much different set of requirements then websites, but it was a massively distributed system ( over 10K nodes ) with a strong requirement for redundancy and latency ( no ethernet here just field bus networks ). I even wrote and published a paper on it some time back. Point is this is not the only path there are other ways to implement distributed systems and there is still much to be discovered in the field.
Shameless plug It's validating that we mentioned over half of these points in an interview with startup founder Eric Lindvall on his Papertrailapp.com log management app: https://peepcode.com/products/up-lindvall
that is an awesome article. And if you're paying attention a great way to interview potential engineers for your org who will be asked to build distributed systems. At Google one of the engineers was asked if the system they were building was 'mostly reliable' and his answer was great, he said "No, I assume that every computer this runs on is trying to bring down the service, and the service tries to dodge in such a way that it stays up anyway."
I'm working on a mobile app that does a key exchange with a server before allowing a server-based registration or login. It's nowhere near as complex as your average distributed system.
That said, I've run into a scary amount of the things mentioned in this article in my tiny little use case. Just trying to ensure a decent user experience (timing out a comms check after two seconds rather than waiting up to 60 seconds when the phone switches from networked to disconnected) in an async message exchange needs some crazy orchestration. Keeping the code clean means refactoring stuff I thought I had nailed two months ago.
I've been programming for a long time, but this stuff humbles me. And happily, I love it.
Very glad that Jeff wrote this up and he deservers mad credit for documenting something that has been frustrating me for years. People don't realize how hard doing this right is and discount the hard work that goes into making large systems scale in a stable manner.
It isn't just load testing but more that the whole system should be considered suspect. If you don't act defensively at all steps you will be hosed by something you thought will never happen. Just had a good talk about this last weekend. Memory, TCP, and all other rock-solid things can and will have issues in large systems.
This is perhaps a point one cannot get to theoretically just sort of thinking about. This realization comes after observing the effects in practice. It is interesting to observe distributed systems, especially the ones that rely on asynchronous messages, thing go wrong and queues start filling up. Or even weirder machines start to synchronize for some reason. Like the size of the queues will oscillate in resonance of some sorts.
One way out is to reduce asynchronicity but that comes with serious performance penalties.
I view this the opposite way around. Message queues and asynchronicity are an alternative to backpressure as described.
If I synchronously hit some service then I need to build facilities into the service to say that it's too heavily loaded, and perhaps handle those responses on the client.
If I pop some job onto a message queue or asynchronous endpoint, things will just slow down for a while. Which is normally as good a way as any as soaking up load.
As someone already replied, message queues under constant load can grow unbounded.
> If I synchronously hit some service then I need to build facilities into the service to say that it's too heavily loaded, and perhaps handle those responses on the client.
Most of the time, you have to have timeouts when there is a synchronous API across the network.
In a large system ideally you'd want to reflect the level of loading back to the input source so the source can slow down sending the data. Think of TCP, tcp works this way on a small scale. One way to fix the problem is to leverage that to actually open a TCP stream and send data that way. The sender will slow down accordingly.
Now if you know that your load is not constant, so there are periods of high activity when inputs are generated then you can try and absorb (and amortize) those high volume peaks using asynchronous queues.
Arrival rate and dequeue/processing time are the keys here. Unbounded queues are easy.. Just limit the queue size. Of course this isn't what the user of the service wants and leads to a very bad experience
No. That's only one example, depending on how the website is setup, but they're not limited to that. A distributed system is one that requires coordination among multiple machines to execute a task or set of tasks. A few examples:
- A website using a load balancer to offset service to multiple machines
- A hadoop cluster running a series of mapreduce tasks
Great article.
I've just recently started exploring distributed systems (specifically with Riak Core/Erlang) and I hit upon a solution to a problem I've been working on (I think) while reading.
It was a great article but this sentence had me a bit confused:
"Well, a typical machine at the end of 2012 has 24 GB of memory, you’ll need an overhead of 4-5 GB for the OS, another couple, at least, to handle requests, and a tweet id is 8 bytes."
My Desktop has like 40 chrome processes, is running GeoServer on Tomcat (currently idle, but whatever), and the entire Unity desktop environment. I'm currently at 2.7GB. What kind of server OS needs 4-5GB?
Yes. Depending on load and services, servers require a great deal more in buffers and cache than your desktop. Figure out the memory allocation for a single TCP connection on a default Linux 3.0 kernel and multiply that times (at least) 10k. Considering you probably want greatly expanded limits and bulk transfer, you might want to multiply that times 10 or 100. That's just TCP buffers for active sessions.
I initially replied to a comment, but I'll post this as a direct reply: Robert Morris teaches a course on Distributed Systems that covers a lot of great stuff; a lot of which is available online:
Great article; concise explanation of most of the points I would have expected to be there. Especially some of the more subtle things like backpressure everywhere, measure with percentiles, and the importance of being partially available.
Great article. Also I would like ton mention a point : implementing a "chaos monkey". As Netflix team nails it "We have found that the best defense against major unexpected failures is to fail often." [1]
1. The metrics comment could not be more true. You cannot think hard enough about the actual problems you will encounter and what metrics you will have. Have a common way to pull information about each service, and to analyze it in some common place.
2. Standardize, standardize, standardize. When someone is answering a page at 2 AM and has to deal with a component of your system that they don't really know, the more it resembles other components, the better. Think hard about how you can get every service to be written in such a way that the same critical information is available. Where is it documented? Who do I page? Where is the monitoring? Yes, random third party components won't follow your standards, and will be hard to integrate. But standardization is a good thing.
3. You need to be able to send canary requests through that trace through your whole infrastructure. You should be able to flag a front end request, and have every single request that it generates through your entire system be logged somewhere so that you can see a breakdown of what happened. Hide this so that nobody can use it to take your site down, but build the capacity for yourself. (Standardization will make such a system much easier to build!)
4. Randomly canary a small fraction of your traffic. There is tremendous value in having a random sample of traced traffic. When you're trying to understand how things work, there is nothing like taking an actual request going through a complex system, and seeing what it did. Furthermore if you've got intermittent problems for a small fraction of users, being able to look at a random slow request really, really, really helps you track down issues that otherwise would be virtually impossible to replicate.