The part of this post that really blew my mind: We host our status site on Herok...

wfarr · on Sept 14, 2012

At the time of the outage, the status site was seeing upwards of 30,000/req minute.

AS we scaled up dynos, we would see temporary performance improvements until the status site would stop responding again. In the short term, this led to us massively increasing dynos as quickly as we could as it appeared that CPU burn was a significant cause of the slowness (at the time). This was in part caused by all the dynos repeatedly crashing. That's how we ended up going from 8 previously to 90.

Once the database problem for the status site was identified and resolved, we began scaling down dynos to a smaller number.

ashray · on Sept 15, 2012

What prevented you from just caching the status page and then refilling the cache manually every X seconds ? I'm sure a status that is a few seconds old given the system wide meltdown wouldn't have been an unreasonable compromise ?

erichocean · on Sept 15, 2012

Or memcache, with one worker dyno dedicated to updating it, cron-like.

adgar · on Sept 15, 2012

30,000req/minute is 500qps. That's... just not a lot for a large service.

mbell · on Sept 14, 2012

Anyone tested S3's static page hosting under heavy load? I would think you could just update the static file as a result of some events fired by your internal monitoring process.

dustym · on Sept 15, 2012

We use S3 behind 1 second max-age cloudfront to serve The Verge liveblog. It's been nothing but rock solid. We essentially create a static site and push up JSON blobs. See here:

http://product.voxmedia.com/post/25113965826/introducing-syl...

sophiebits · on Sept 15, 2012

This is really interesting -- thanks for sharing. It seems to me that you could probably have nginx running on a regular box and then CloudFront as a caching CDN to avoid the S3 update delay.

dustym · on Sept 15, 2012

Probably could figure that out, yeah. But we didn't want to take any chances given how important it was to get our live blog situation under control.

[edit]

Which is to say, we wanted a rock solid network and to essentially be a drop in a bucket of traffic, even at the insane burst that The Verge live blog gets.

donavanm · on Sept 15, 2012

Could you say more about using both the Cache-Control and a query string of epoch time? In particular the query string has me puzzled. On it's face it seems to decrease your cache hit ratio, with no/little benefit. Im assuming the epoch time is the clients local time. The clock skew across the client population increases the number of cache keys active at any one time. The incrementing query string also forces a new cache key once per second. Those would force a cache miss and complete request to S3 even when content has not changed. It's even worse with the skew as you now force a cache miss per second for each unique epoch time in your client clocks. Without the query string the cache could do a conditional GET for live.json. That would save latency & bytes as the origin could respond with a 304 instead of the complete 200.

dustym · on Sept 15, 2012

Great point. I don't speak for the guys that made the decision to append the timestamp to the query, but I assume our concern is in intermediate network caches that don't honor low TTLs. Though I don't know how founded that is, we won't ever have to deal with the issue if we take control of it with the url string.

It'd be interesting to see how wide the key space is due to clock skew. I suppose we could specify some number and consider it a global counter that is incremented every second, then when someone comes in for the first time they can by synced in with the global incrementing counter. That counter is used to ensure a fresh cloudfront hit.

I think at end of the day, these issues haven't been a huge concern for a one month emergency project, but they are good points.

WestCoastJustin · on Sept 14, 2012

S3 is great for static content. I was taking the AWS ops course and the instructor mentioned some very large organizations redirect their site to S3 when under DDOS so they can remain on-line. In fact, he said that AWS recommended this solution to them?! Can you fathom someone who is under DDOS, and you tell them, hey, just redirect that our way ;)

fierarul · on Sept 15, 2012

You pay for the bandwidth on AWS. Of course they would be glad to redirect a DDOS their way. It's pure gold for AWS.

moe · on Sept 15, 2012

"Heavy load"?

30 kRPM is 500 hits/sec. Nginx will serve >2000/sec from a m1.small. For S3 that is about the equivalent of a mosquito fart.

biot · on Sept 14, 2012

Use Jekyll and push the site to S3:

https://github.com/mojombo/jekyll/wiki

https://github.com/laurilehmijoki/jekyll-s3#readme