And when EC2 falls over, like it tends to do a few times a year? Hosts fall over...

mjolk · on Feb 3, 2017

> And when EC2 falls over, like it tends to do a few times a year?

Multi-AZ, multi-region complete failures are very, very rare. How often do you get a failure in your data center per year (that you notice)?

> You're going to be writing a lot of the same fallover code if you're running on someone else's hardware, so why rent?

The answer is in the question -- when rented things fall down and go boom™, your code runs and someone gets a text message with the receipt.

When a handful of the "wrong" disks decide to revert to air-blocking bricks or your upstream network provider has an outage, you're lucky if it's something you can fix by heading to the data center. I promise that AWS or Google is better at running a DC, and unless you're trying to enter the hosting business, I wouldn't advise spending the time and money to meet their uptime and features.

I've only managed data storage in the scale of many petabytes (and this was a handful of years ago) and honestly, I think it required at least 20 hours a week of babysitting by various staff. At Snap's scale and traffic patterns (viral content, lots of writes, so on), I imagine this is a very non-trivial spend on scaling, staffing, tech implementation.

At 2bb over 5 years, maybe Snap would benefit from rolling their own -- hiring 50 great hackers at a mildly conservative 250k/head (say 200k average + benefits + taxes + employee support costs (HR, payroll, recruiting, legal, etc)), eating a year or two of transition costs off their cloud hosting providers, then probably saving a bit of money even after hardware, bandwidth, facility, insurance costs. Hell, maybe they'd even open source some software and recruiting would get easier after conference talks of how they did it. Or maybe they get bought by Google or Facebook in a year. Snap's in the business of selling ads and getting more eyes on those ads. Whatever enables growth and doesn't serve as a distraction or speedbump is a "fine" decision.

cookiecaper · on Feb 3, 2017

>Multi-AZ, multi-region complete failures are very, very rare. How often do you get a failure in your data center per year (that you notice)?

First, if you don't notice some random/unexpected EC2 instance failures, you don't have a big EC2 deployment. Even though there is a lot of pomp and circumstance around the cloud, when it comes down to it, your instances are still on a physical server in a datacenter somewhere and they can, and sometimes do, fail. In that case, as in every other robust production deployment, your application (hopefully) performs an automatic and graceful failover to its standbys. The location of the standbys is usually an configuration value. Not seeing any unique value proposition here for "the cloud".

The point is that even when you're using EC2, you still have to set all of that up. Contrary to popular belief, EC2 is not a panacea that can magically make your software reliable and redundant. It's just a nice interface that makes it easy to rent servers from Amazon.

The only benefit you get from EC2 is that someone paid by Amazon has to go pull the box, but your company could hire such a guy in-house for _much_ less than it's paying Amazon.

The onus is still on the developers to figure out all of the application stuff that's necessary to accommodate failover and make sure that everything plays nice with each other, and getting that working right is by far the most time-consuming part of deploying a high-availability application.

So EC2 doesn't add any extra resilience; it's just outsourcing the job of pulling a server to an Amazon employee/contractor instead of YourEmployer employee/contractor. If your company is big enough (and at Amazon's prices, you don't have to be very big at all to be "big enough"), that doesn't make sense.

I know EC2 et al are popular because people like buzzwords, but that doesn't make it good business (or does it? Investors love cloud because it keep capex low, and because investors are buzzword-driven like everyone else; saying "cloud" will make them like you more and want to give you more money).

For companies that are still in the garage (literally in the garage), shelling out $20/mo for a couple of cheap VPSes from something like DigitalOcean is going to be just fine. But once you get bigger than that, there's no way to avoid paying attention to this stuff, even if paying Amazon tons of money creates a false psychological connection that makes you think they're doing the work for you.

>The answer is in the question -- when rented things fall down and go boom™, your code runs and someone gets a text message with the receipt.

Let me fix that for you: when things fall down and go boom, if your code is written and your deployment is configured to support it, your product continues to work, and someone, somewhere, has to get a broom and sweep up some ashes.

Whether or not cloud is a reasonable proposition is primarily a question of whether it makes more sense for that someone who sweeps up the ashes to be on the corporate payroll of YourEmployer or YourCloudProvider.

>I've only managed data storage in the scale of many petabytes (and this was a handful of years ago) and honestly, I think it required at least 20 hours a week of babysitting by various staff. At Snap's scale and traffic patterns (viral content, lots of writes, so on), I imagine this is a very non-trivial spend on scaling, staffing, tech implementation.

EC2 is not a silver bullet. It's just an interface to allow you to rent servers from Amazon. EC2 users still have to babysit stuff, just not the hardware (though they still have to monitor resource usage, clean up disk space, and be prepared for things to blink offline with 0 notice -- again, all the normal things; only difference is that your hardware jockey is accessed through EC2's web support interface instead of Slack/cell).

>At 2bb over 5 years, maybe Snap would benefit from rolling their own -- hiring 50 great hackers at a mildly conservative 250k/head (say 200k average + benefits + taxes + employee support costs (HR, payroll, recruiting, legal, etc))

Vastly overallocating here.

>Hell, maybe they'd even open source some software and recruiting would get easier after conference talks of how they did it.

Unnecessary, there's already tons of great open-source software to handle HA deployments (usually, this is the software underneath the commercial UI that makes everything work; it's surprising how much "revolutionary" commercial software is just glue code and a point-and-click wrapping around an OSS workhorse).

Of course, once you get unicorn-scale, everything has to go custom and/or highly modified because no out of the box solutions can handle the load, and that will be the case whether their hardware is hosted by Google or not. Again, "cloud" does very little to relieve workload for all non-hardware employees.

And the added benefit of being a trendy tech company is that after your company creates some extremely specialized solution, you can open-source it and watch with an uncomfortable mix of amusement and horror as 90%+ of other companies's tech departments contort themselves into pathetic, desperate architecture pretzels so that they can become cool by abandoning a stable, proven, mature stack for your company's experimental, sputtering, duct-taped abomination that requires a PhD to even get to compile.

This pattern has become so commonplace that reciting any specific example feels trite. You can probably name 12 off the top of your head. Hadoop in particular is a victim of many gross offenses of this type.

>Snap's in the business of selling ads and getting more eyes on those ads. Whatever enables growth and doesn't serve as a distraction or speedbump is a "fine" decision.

Sure, but they don't have to set massive gobs of money on fire for no reason along the way. But then, I guess they wouldn't be part of the Silicon Valley family if they didn't.

vgt · on Feb 3, 2017

Snap is using appengine, which transparently manages scale, availability, resiliency, deployment, and so forth. It's a higher level of service than ec2. Thus many of the valid concerns you describe do not apply to snap, or are at least minimized.

( Work at Google cloud)

paulddraper · on Feb 3, 2017

> First, if you don't notice some random/unexpected EC2 instance failures, you don't have a big EC2 deployment.

The parent didn't claim they don't happen, just that (1) they were rare (a point you agree with, given the minimum usage needed to notice them) and (2) multi-AZ, multi-region failures nearly non-existent.

> The point is that even when you're using EC2, you still have to set all of that up.

It takes literally minutes to set up an ELB and Autoscaling group across five availability zones. How long does the non-cloud version of that take?

mjolk · on Feb 3, 2017

> First, if you don't notice some random/unexpected EC2 instance failures, you don't have a big EC2 deployment. ...Not seeing any unique value proposition here for "the cloud".

Because when something fails, you don't have to care about the "why" as long as you can replace it. I see about 4 instances needing a maintenance per month per 1000. That's reasonable enough to not demand someone be full-time focused on making sure that only the good lights blink on the hardware.

> The point is that even when you're using EC2, you still have to set all of that up. Contrary to popular belief, EC2 is not a panacea that can magically make your software reliable and redundant.

You're making a strawman by suggesting people think it's a panacea. The advantage is that a lot of the work, maintenance, and feature improvements for 'infrastructure as code' is handled for you. Cloud hosting means writing the software layer and being done, no managing the infrastructure services, facilities, hardware, business relationships involved with rack/stack.

> It's just a nice interface that makes it easy to rent servers from Amazon.

To be fair, it's a _very_ nice interface.

> I know EC2 et al are popular because people like buzzwords, but that doesn't make it good business (or does it? Investors love cloud because it keep capex low, and because investors are buzzword-driven like everyone else; saying "cloud" will make them like you more and want to give you more money).

If you think cloud hosting is popular because of op-ex or buzzwords, I think you're out of touch. EC2 and Google Cloud are popular because they let you focus on getting shit done, even when you have variadic workloads that are uptime dependent.

> For companies that are still in the garage (literally in the garage), shelling out $20/mo for a couple of cheap VPSes from something like DigitalOcean is going to be just fine. But once you get bigger than that, there's no way to avoid paying attention to this stuff, even if paying Amazon tons of money creates a false psychological connection that makes you think they're doing the work for you.

They _are_ doing a lot of work for you. You say $20 is the point that it makes more sense to self-host. I'll be charitable and round that up to $100, but even at that price, there is _no way_ you'll be able to get something as fault tolerant or low-cost as a cloud hosted solution. Do you really think that for $100 a month you can self-host geo-close servers with redundancy to the point that you don't have to think about it? Keep in mind that "two is one and one is none" when planning your hardware purchase.

> Vastly overallocating here.

No, that's conservative for a major US city (e.g. where Snap would be doing the hiring). Have you tried to pull a handful of really good system hackers out of thin air recently? Even if you can get them, they're not cheap, and you'd need a sizable team to pull off the highly-redundant world-wide install that Snap needs for its growth projections. It starts off expensive to hire good tech and gets more spendy the longer you're fishing.

And that's even ignoring the costs on productivity (for that and other employees) when an employee isn't happy or decides it's time to leave -- staffing also takes money and attention to maintain.

> And the added benefit of being a trendy tech company is that after your company creates some extremely specialized solution, you can open-source it and watch with an uncomfortable mix of amusement and horror as 90%+ of other companies's tech departments contort themselves into pathetic, desperate architecture pretzels so that they can become cool by abandoning a stable, proven, mature stack for your company's experimental, sputtering, duct-taped abomination that requires a PhD to even get to compile.

You seem like you're speaking from personal experience. Having a working infrastructure that isn't a barrier to growth isn't trendy or sexy, it's a base competency for any internet-reliant business model.

> Sure, but they don't have to set massive gobs of money on fire for no reason along the way. But then, I guess they wouldn't be part of the Silicon Valley family if they didn't.

This isn't setting "massive gobs of money on fire for no reason", this is going with a high-performance datacenter that someone else maintains. They clearly have something very big in mind and I doubt they made a multi-$bb commitment without asking themselves "are we lighting this money on fire?"