Amazon EC2 + S3 Doesn't Cut it for Real Applications

ed · on July 9, 2007

The title is misleading. The post explains that EC2+S3 does not provide a good solution for relational databases -- but it's not supposed to (this is a common misconception).

The author makes a good point but never discusses one of the bigger issues surrounding EC2 database instances -- network latency. The architecture of EC2 is such that multiple EC2 instances are not guaranteed to be on the same local network, meaning that you might end up with a database server in Boston and your Apache server in Seattle. This gets really inefficient if your back-end code relies on multiple DB queries.

However, EC2+S3 are great at what they're designed for -- providing SUPPLEMENTAL computing and storage resources. This is ideal for web applications which need to process file data in the background. Amazon appears to be working on a service which would be the database analog of S3. Only after that's released would we be able to judge Amazon as a complete app-hosting solution.

joshwa · on July 9, 2007

So then what is the answer to scaling relational databases? Getting rid of it and writing your own storage engine?

epi0Bauqu · on July 9, 2007

What are you doing? For most Internet startups, PostgreSQL will be fine. All you have to do is use indexes correctly and then keep their size under the memory allocated to postgres. That can really get you a long long way.

steve · on July 9, 2007

"keep their size under the memory allocated to postgres."

Assuming that you can do that, explain to me how this doesn't totally wipe away any scalability/performance advantage gained by using slow sql databases.

Well, yes, I guess it depends what you're doing. A speed increase of ~100x and increased simplicity made the crucial difference in my decision to drop the database for most of its traditional uses in my app.

epi0Bauqu · on July 9, 2007

I don't really understand your comment, but I'll just try to just restate more clearly and then maybe it will answer it anyway. If anyone has a specific situation, I'd be happy to speculate on it.

By keeping the indexes under the memory allocated to postgres, I mean the total memory that can be set aside to the DB on the machine. If you look at the pg data files you will find the indexes are just large flat files like BerkeleyDB or whatever homegrown file system thing you want to create.

So you do not need to increase the shared buffers to some crazy amount. In fact, you will have a negative performance result if you do that. The kernel will automatically cache these files upon repeated use. So if you have a machine with 4GB of memory and you can allocate 3GB to the db, then these indexes will just remain cached in memory forever. And you can easily get relatively cheap machines now with 16GB and higher, so this gets you really far.

If you only run sql queries that are indexed appropriately and the indexes are in memory, then I assure you postgres will not be the bottleneck in your app. This is not hard to do; it just takes a little postgresql.conf tweaking and data model forethought. In particular, you turn of sequential scans and decrease the index tuple cost and run select explain on all your queries to ensure that the indexes are always being used. As for the data model, use the smallest data types for indexes, and numeric types whenever possible. That will keep the size down to a minimum. PostgreSQL takes care of the rest.

Again, it depends on what you are doing, but this configuration will usually tie or be significantly better than some home grown file system thing when you get some scale because, like I said, postgres primarily uses the file system memory cache as its memory cache. But if you have some file system process that is accessing a ton of different files you will get an I/O bottleneck just from the lookups alone. And if you use a big flat file, you might as well stick with postgres because of its reliability, security, concurrency, network and backup features.

That being said, I still wouldn't store big static blobs in the db. I would just put them on another machine and let apache serve them up statically and put references to them in the db (indexed by some number id). That way you get the best of both worlds.

joshwa · on July 9, 2007

I'll answer my own question:

http://www.slideshare.net/Blaine/scaling-twitter

cache, partition, denormalize.

wensing · on July 9, 2007

The strongest accusation I can find in this article is that: "it still isn't fool proof".

To what degree do 'real applications' have to be fool proof?

@joshwa: Thanks for point out the comments--I'm glad I read them, particularly this portion:

And even if the server software were to blow a gasket and call it quits w/o any warning and w/o rebooting itself, and from a more realistic viewpoint, even if the hardware were to fail, that still leaves the remaining slave servers who are at least within a handful of bytes of being accurate at the point the master server (if, in fact, it was the master server that gave up the ghost) called it quits. In this regard I fail to see a "this isn't capable of handling real-world applications" and instead "it's possible that something bad could happen, so prepare for the worst yet hope for the best and know that, within reason, it's far more likely that the best scenario (i.e. your data layer remains persistent as it should) will be what actually takes place 99.9999999999% of the time.

Yeah, so as someone using EC2 + S3 for our web enterprise, I am not really worried.

joshwa · on July 9, 2007

good stuff in the comments there...