1. AWS beats all others when it comes to security [1].
2. EC2 is just one item in the package called AWS. Hence, if you just build something more than just "web-app with *db" at the back-end, say a full blown platform, then I know no other option for you to get the full stack integrated API for Data-warehouse, DNS, Load-balancing, auto-scaling, billing, etc.
3. Speed is sometimes over-rated. You should be be speedy where it matters more. That is, how fast can you redeploy your entire cloud from scratch in case of a disaster should be more interesting to you than if a webpage takes 20 more ms to get to the browser. In our case, at AWS, it is a matter of < 20 minutes.
1. Agreed. IAM is really, really, nice, and I find myself missing it greatly when I'm on other platforms.
2. We use quite a few more things than EC2 for our systems.
- SQS eliminates the need to build / manage a queuing system.
- DynamoDB/SimpleDB eliminate the need to build / manage a distributed data store.
- OpsWorks eliminates the need for a DevOps team (mostly).
- ELB eliminates the need to build / manage a load balancer.
- SES seamlessly takes care of out-bound mail.
- Direct Connect gives us a way to extend our DC tools into the "cloud".
- And to top it all off, I can bring up any/all of these services at a moments notice, run some experiments, and then shut them down when I'm done.
I don't think there are very many vendors that can help us do these things with this much flexibility. Yea, AWS can be expensive, but we feel like its worth it.
> 3. Speed is sometimes over-rated. You should be be speedy where it matters more. That is, how fast can you redeploy your entire cloud from scratch in case of a disaster should be more interesting to you than if a webpage takes 20 more ms to get to the browser. In our case, at AWS, it is a matter of < 20 minutes.
Personally, I'd rather have 20ms shaved off my users time than a 20 minute disaster recovery time. Disasters happen perhaps once a year (on AWS, possibly less on dedicated servers), people are loading pages every day.
That's perhaps depend on the type of your customers/users, and their priorities.
If you have "users", then you might be right, as no harm will be done, if once in a few years, their free service will be shutdown for 18 hours.
However, if your customers are running core and critical parts of their business on your system, this part becomes a significant factor in the equation.
My nontechnical boss doesn't know about the 20ms difference (thinks her computer is slow?) but an outage is visible like the difference between day and night.
At 2s difference it's fair to say that a user is going to notice. Maybe even at 200ms. But at 20ms? I'm not sure that counts.
I know it's just a fabricated number but the point is that the server you're running on won't make a difference to the user experience. And, probably in the case of the differences were talking in these machines, a user would never notice it.
20ms could mean the difference between a single digit rank in the App Store and a three digit rank. I see it every day - if my systems drop by 50ms average, I see a drop in rank.
And 20ms? Try more like 300ms+ difference. Hell, sometimes a full second or more for some sites. Anyone who says total disaster recovery time is more important than total latency isn't running anything remotely to scale.
Yes and yes, we're using a CDN. Neither of those change that EC2 instances are slow and have unreliable performance, and the network is abstracted to the point where you can't effectively control packet flow.
EC2 will not become faster. However, at least, for web traffic, CPU and IO are not the only factors.
Network is, in fact, a major player in the latency, and by being globally distributed, configuring Route53 appropriately, and integrating CloudFront CDN, a given web-app gets a boost that I doubt a faster computer can beat.
We experimented with EC2 in the early stages to use as a possible load failover, and no matter what we did, we could never get better than 150ms ping times even between internal zones.
The EC2 network also has mysterious packet filtering on it that prevented IPSec tunnels from working correctly.
I've managed to leverage CDN and globally distributed servers for a fraction of the cost of Amazon services just fine, and I have the added benefit of 100% full control of all aspects of it, including the network.
A good analogy is that EC2 is like an interpreted language versus compiled - you can get a lot done and it's easier to get started, but if you're really serious about performance, you need to program in C.
This is not entirely correct. AWS does offer AWS GovCloud which, provides an environment that enables agencies to comply with HIPAA regulations [1]. You have to be a US government organization to use it though.
Updated: AWS also has a whitepaper on Creating HIPAA-Compliant Medical Data Applications with AWS [2]. Looks like this is support on the standard non GovCloud stack.
The trouble with HIPAA requirements is that they're not clearly defined and are open to a variety of interpretations.
Our experts advise a safe, CYA approach and mandate a BAA agreement is in place with every partner touching sensitive patient data, even if encrypted and protected on multiple levels. Thus far Amazon is not accommodating to such a request.
Other's have their own opinions and, in the end, we all weigh the risks vs rewards (including Amazon itself - I'm sure they've plenty of reasons of operating in their present gray area).
I worked for a major hospital once and they were all about the CYA agreements. The funny thing was the HIPPA is more a state of mind, not a 100 point punch list. So you're really just practicing CYA more than anything else.
I don't believe you need to be a US government organization to use the GovCloud region. I think you just have to be a US corporation or person and pay through the nose. It's only available directly via signing an actual contract, not a la carte like normal AWS services.
As of March 2013 (two years past those publish dates), Amazon has still not agreed to the legal "Business Associate Agreement" provisions of HIPPA that would permit you to use their services to store Protected Health Information. They said they are considering it, but this has been the status for quite some time. Rackspace, on the other hand, has agreed (for a surcharge).
According to Amazon, their employees are not allowed to access to your data, so you don't need to sign a business associate agreement with them to be HIPAA compliant. I imagine this is similar to how sending patient information through the post office is not considered a disclosure to the post office.
Actually the HMO I worked for did. Every vendor such as ISP's, Colo's, and some API suppliers had to sign the CYA agreement. Most of them are aghast when you ask them to sign. Basically they have to take on all of the liabilities. I've never seen it have to be exercised however.
Do you sign business associate agreements with your colo facility, ISP, and landlord? They also are physically capable of accessing your data, even though they are legally or contractually forbidden from doing so.
The orgs that I have worked with draw the line somewhere between colo and ISP. Anyone with potential access to unencrypted network traffic or whom is operating equipment containing affected data. Usually the lawyers can agree to contractural terms for the landlord without a BAA
I'm not arguing that it makes sense, just that it happens.
And there are plenty of health care companies that evaluate AWS and decide they don't need a BAA, due to the way the system is constructed. This is a 'your legal team' issue, not a global issue (ie: it's an issue, but not a blanket problem for everybody).
It's not just a question of speed. If your machines are slow, that means you need more machines to handle your throughput, which means you are paying for that 20 ms slowdown in actual dollars.
Sure, but the whole thing is predicated on the 20ms slowdown coming from a slow machine, not network latency. And that's a pretty good assumption. Due to RAM limitations and abysmal performance, I could maybe push 15 concurrent requests on a c1.medium running a Rails app in Passenger with a non-CoW Ruby. Forking is terribly slow on EC2. An m1.small was out of the question.
I'm working on a web service (build on top of Scala and the JVM) that's handling between 1500 and 3000 reqs per second per c1.medium instance, with an average time per request of under 15ms. This is real traffic, with the web service receiving between 16,000 and 30,000 total requests per second during the day. A c1.xlarge can do 7000 reqs per second or even more, but for the moment I felt like the difference in pricing is too big and it's cheaper and safer just starting more c1.medium instances (with auto-scaling based on latency), but in case we'll need more RAM, then we'll probably switch to c1.xlarge.
If scalability matters, you should have picked a better platform. Ruby/Rails/Passenger is a terrible platform for scalability / performance. And even if AWS is slower than other solutions, the first problem you have is your own heavy-weight app and the platform you've chosen. 15 concurrent requests per second makes me chuckle.
I just wanted to add -- since you're not the first to point out the Rails part -- that I've also run a 42 node Cassandra cluster on a m1.xlarges and did a fair bit of CPU-bound operations (encryption and compression) on hundreds of TB of data on cc2.8xlarge. I just used the Rails one as an example.
In the case of Cassandra, disk I/O was a constant issue. So, we grew the cluster much larger than would be necessary on another provider. We also lost instances pretty regularly. If we were lucky, Amazon would notify us about degraded hardware, but usually the instance would stay up but do things like drop 20% of its packets. Replacing a node in Cassandra is easy enough, but you quickly learn how much their I/O levels impact network performance as well. Nowadays Cassandra has the ability to compress data to reduce network load, but you then run into EC2's fairly low CPU performance.
The CPU-bound application I mentioned wasn't so bad, but we paid heftily for that ($2.40 / hour - some volume discount). At the high end the hardware tends not to be over-subscribed.
Performance, price, and reliability were all issues in all cases. Those are not EC2's strong suits and haven't been for a while.
I don't entirely disagree. All I can say is REE and Rails 2.3 were far lighter weight and faster than Ruby 1.9 and Rails 3.2. Given it's a 3.5 year old app, the landscape was pretty different back then. I looked at Lift and didn't like it. Django was still in a weird place. And ultimately Rails looked like the best option for a variety of reasons.
Things evolve and whole hog rewrites are difficult. Nowadays we run in JRuby and things are quite a bit better. But we can't run on anything smaller than an m1.large. The low I/O and meager RAM in a c1.medium preclude its use. (BTW, that's where a lot of the original 15 came from -- with a process using 100 MB RAM and only 1.7 GB available, it's hard to squeeze much more out of that).
But the larger point is with virtually any other provider you can pick a configuration that matches the needs of your app (rather than the other way around), don't have to fight with CPU steal, don't have to fight with over-subscribed hardware, and don't have to deal with machine configurations from 2006. Yeah, Rails is never going to outperform your Scala web service. But if the app would run just fine on the other N - 1 providers, then it's disingenuous to gloss over the execution environment as well.
Run it on top of JDK 7 and use the CMS garbage collector, as JRuby (and Scala) tend to generate a lot of short-term garbage and experiment with the new generation proportion (something like -XX:+UseConcMarkSweepGC -XX:NewRatio=1 -XX:MaxGCPauseMillis=850). You can also profile memory usage (make sure you're not stressing the GC, as that can steal away CPU resources) and for that I believe you can use Java profilers (like YourKit which is pretty good).
Also, try to do more stuff async, like in another thread, process or server. Use caching where it's easy, but don't over do it, as dealing with complex cache invalidation policies is a PITA.
That's one way to look at it. Another is when this app started 3.5 years ago, Rails & the app had a drastically different performance profile and Amazon didn't have super-over-subscribed hardware. Not that it matters much, but there's nothing convenient about having to engineer around EC2. And doubling your capacity or constantly upgrading instance sizes is not cheap, nor a scalable solution in any practical sense.
Pick your language though. With terrible forking performance, any process-based execution environment is going to have similar issues. And I found running a servlet container on anything smaller than an m1.large to be an utter waste. 1.7 GB RAM isn't enough for many JVM-based apps and threading could easily overwhelm the system. Anything less than high I/O capacity just can't keep up.
In regards to speed, I think one place EC2/EBS fails (I'm speaking based on other people's experiences) is the consistency. 200ms is better than 100ms if you get 200ms every time, but from what I've read, a lot of AWS services are all over the board. This makes infrastructure very hard to provision and predict.
2. EC2 is just one item in the package called AWS. Hence, if you just build something more than just "web-app with *db" at the back-end, say a full blown platform, then I know no other option for you to get the full stack integrated API for Data-warehouse, DNS, Load-balancing, auto-scaling, billing, etc.
3. Speed is sometimes over-rated. You should be be speedy where it matters more. That is, how fast can you redeploy your entire cloud from scratch in case of a disaster should be more interesting to you than if a webpage takes 20 more ms to get to the browser. In our case, at AWS, it is a matter of < 20 minutes.
1: https://aws.amazon.com/compliance/ and https://aws.amazon.com/security/