RAIDing together multiple EBS volumes feels like a massive hack to me. I can't h...

andrewvc · on March 18, 2011

One other huge downside of raiding EBS volumes is you can't use EBS's snapshotting features as you cannot guarantee a perfect sync (you could use LVM yourself however).

Honestly, since EBS vols are supposedly not tied to a single disk, the raiding should be done on Amazon's end. That it isn't is telling.

saurik · on March 18, 2011

You have to snapshot at the system level anyway if you want a consistent snapshot: otherwise the filesystem (or your database) could have been reordering and delaying writes that end up not being part of the "consistent snapshot". This is simply not a RAID-specific issue, nor is it a problem with EBS (as it is generally easy to use LVM, xfs, and/or PostgreSQL to handle that part of the job).

agmiklas · on March 19, 2011

This is something I've never quite understood. Best practice guides say you need to do a "flush all tables" in MySQL and then do a filesystem freeze (possible in XFS) before you can use a snapshot system like the ones built into EBS or LVM. If you don't, you apparently stand a good chance of getting an inconsistent snapshot, even if the snapshotting mechanism itself is (like EBS and LVM) "point in time" consistent.

Why is all this necessary? If the system (i.e. DB + FS + block device) are all working as they should, then once a commit returns, the data should be on disk. If it's not, you have no guarantee data that you thought was committed will still be there after a kernel panic or power outage.

In that case, no amount of xfs-freeze or table flushing during a snapshot is going to save you from the fact that your DB is one kernel panic away from losing what the rest of your system believed were committed transactions.

saurik · on March 19, 2011

In the specific case of a database server that actually has correct fsync semantics that the user has not disabled for some crazy performance reason, you are correct. However, there are many use cases that people want consistent snapshots across, like "apt-get install", that do not use a write barrier for every atomic-feeling operation.

(In fact, with a good database solution, like PostgreSQL, the RAID issue of the parent post is also solved: put your write-ahead or checkpoint logs on a single device, as its linear writes will easily swamp network I/O on an EBS, and use RAID only for backend storage, where you need random I/O.)

parasubvert · on March 19, 2011

This is one reason why Oracle is still the gold standard. when entering hot backup mode, which is what you do during a snapshot, it logs the FULL BLOCKS that are changed. Failures and inconsistencies can be replayed from the archive logs.

Of course this means you can quickly blow out your log archival , so it's meant to be a transitory mode:

saurik · on March 19, 2011

PostgreSQL has this exact same feature.

andrewvc · on March 18, 2011

True. However, for some cases where you don't mind losing some data due to a recovery process EBS snapshots are 'good enough'. Additionally, with a database like CouchDB with a 'crash only' design, it should work for some cases as well.

enjo · on March 18, 2011

We use EBS snapshots as a last-resort backup. They're really convenient that way. We have a more robust backup system, but in the unlikely event that something goes wrong at least we have those snapshots, even if they're not perfect.

WALoeIII · on March 18, 2011

xfs_freeze

In fact there is a handy package called ec2-consistent-snapshot (https://launchpad.net/ec2-consistent-snapshot) that will manage this for you!

bluegene · on March 18, 2011

May be I'm missing something here; Why there's even a discussion about RAID at the EBS level? When Amamzon says, "Amazon EBS volumes are designed to be highly available and reliable" and if we have to talk about RAID then the issue is on Amazon's end

anaisbetts · on March 18, 2011

I think most people are doing RAID-0 to get more perf out of EBS volumes

gregburek · on March 18, 2011

It also seems that in 2008 adding mirroring also hurt performance. I'm going to dive into this tonight to see if things have changed at all with these benchmarks.

"His results show a single drive maxing out at just under 65MB/s, RAID 0 hitting the ceiling at 110MB/s, RAID 5 maxxing out about 60MB/s, and RAID 10 “F2″ at under 55MB/s."

Summary source: http://www.nevdull.com/2008/08/24/why-raid-10-doesnt-help-on...

Data source (google cache): http://webcache.googleusercontent.com/search?q=cache:Vscz-VX...

bmurphy · on March 18, 2011

Yes. Except, anybody who is doing RAID-0 over an EBS volume for perf reasons is ASKING for trouble.

You need to do RAID-10. EBS volumes CAN and DO fail.

acdha · on March 18, 2011

I wish I had more than one upvote for this: swimming against a trend like that never works out well.