Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Correlated failures are common in drives.

This is why, when I was building DIY arrays for startups (around the same time Backblaze published their first pod design [1]), I went through the extra effort of sourcing disks from as many different vendors as possible.

Although it was somewhat more time consuming and limited how good a price I could get and how fast the delivery could be, it meant that, for any given disk drive size, it meant I could build an array as large as 12 where no 2 drives were identical in model and manufacturing batch [2].

Of course, it's still a vanishingly rare risk, and "nobody" cares about hardware any more. It does help to remember, at least once in a while, that, on some level, cloud computing really is "someone else's servers" and to hope that someone else still maintains this expertise.

[1] though I used SuperMicro SAS expander backplane chasses for performance reasons

[2] and firmware from the factory, although this is somewhat irrelevant, as one can explicitly load specific firmware versions, and, IIRC, the advantages of consistent firmware across drives, behind a hardware RAID card, outweighed the disadvantages



I've been doing a poor man's version of this for home videos using 3 hard drives and rsync. It's easy to replace a drive and they are not likely to go out at the same time. But one thing that bugs me is that unless the drive fails hard (e.g. noticed by SMART or unable to read at all) how do I know the data on the drive is not corrupted without reading it? Are there best practices to continuously compare the replicas in the background? Does that impact durability of the drives?


> how do I know the data on the drive is not corrupted without reading it? Are there best practices to continuously compare the replicas in the background?

I assume you're talking about already-written sectors becoming unreadable or a similar failure. Unfortunately, I don't think you can. This is what I believe the "patrol read" feature of RAID cards is meant to address.

Fortunately, however, I don't believe there's evidence that if the data is readable, it would ever be different from what had been written, so comparison isn't needed. The main exception to this is the case of firmware bugs that return sectors full of all-zeros.

> Does that impact durability of the drives?

I haven't read the studies (from Google, mostly, IIRC) in a while, and I'm not sure if they've released anything lately for more modern drives [1]. However, I believe you'll find an occasional "patrol read" won't noticeably reduce drive life/durability.

[1] Especially for something like SMR, whose tradeoffs would seem particularly attractive for something like this archival-like use case.


"I don't believe there's evidence that if the data is readable, it would ever be different from what had been written, so comparison isn't needed. The main exception to this is the case of firmware bugs that return sectors full of all-zeros."

Comparison is needed to address misdirected writes and bit rot in the very least, see "An Analysis of Data Corruption in the Storage Stack" [1]. You can't count on your drive firmware or RAID firmware to get this right. You need bigger end-to-end checksums, and you need to scrub.

[1] - http://www.cs.toronto.edu/~bianca/papers/fast08.pdf


Thanks! I had either missed that paper or had taken away more of the message that these errors are more likely to be from events like misdirected writes, cache flush problems (hence the high correlation with systems resets and not-ready-conditions), and firmware bugs (on-drive and further up the stack), rather than bit-rot.

Still:

> On average, each disk developed 0.26 checksum mismatches.

> The maximum number of mismatches observed for any single drive is 33,000.

Considering the latter can represent 132K on a modern, 4K-sectored drive, that's a remarkable amount of data loss, enough to warrant a checksumming higher up (such as in the filesystem).. in theory [1].

However, the fact that this was NetApp using their custom hardware as the testbed makes me wonder if the data are skewed, and if the numbers would be nearly this bad from a more "commodity" setup, such as at Google. The paper alludes to this when referring to the extra hardware for the "nearline" disks, and I'm always suspicious of huge discrepancies in statistics between "enterprise" and other disks, even more so when there's a drastic difference in comparison methodology.

It would be interesting to see if there are any numbers for more modern drives, especially as the distinction between "enterprise" and "consumer" drives is disappearing, if only because demand for the latter is disappearing.

[1] In practice, an individual isn't aware of the 16.5K/132K loss risk, which is vanishingly small compared to other risks, anyway, and businesses don't tend to care and have survived OK anyway.


Use ZFS, it can perform periodic integrity checks.


I've never done so outside of FreeNAS appliances, partly because I remain persuaded that offloading the RAID portion to a card is more cost-effective and higher-performance, especially on otherwise RAM- and/or IO-constrained servers, and partly because the ZFS support under Linux has, historically, been less than ideal.

Higher-level checksum failures are, however, a situation, where I would appreciate an integration between filesystem and RAID, as I'd want a checksum error to mark a drive as bad, just like any other read error.

Do you happen to know if ZFS does that?



Unfortunately, no, it doesn't really say how ZFS behaves when an error is encountered.

This is super-disturbing and a dealbreaker, if it's still true:

> The scrub operation will consume any and all I/O resources on the system (there are supposed to be throttles in place, but I’ve yet to see them work effectively), so you definitely want to run it when you’re system isn’t busy servicing your customers.

I browsed a little of the oracle.com ZFS documentation but couldn't find much in the way of what triggers it to decide that a device is "faulted" other than being totally unreachable.


> I went through the extra effort of sourcing disks from as many different vendors as possible.

This is very good advice!

If you already built your array, consider advice: "replace a bad disk with a different brand, whenever possible".

Over time, you naturally migrate away from the bad vendors/models/batches. After following this practice, it seems ridiculous to me now to keep replacing the same bad disks with the same vendor+model.


Although I wouldn't go so far as to insist on switching brands (especially since, as another commenter pointed out, there has been so much consolidation, there remain only 3), I agree that replacing with at least a different model, or, failing that, a different batch, is a best practice for an already-built homogenous array.

Some of this can also be achieved ahead of time if one has multiple arrays with hot spares, by shuffling hot spares around, assuming there's some model diversity between the arrays but not within them.

I doubt I'll ever again have the luxury of being able to perform this kind of engineering, however. Even a minor increase in cost or cognitive/procedure complexity or a decrease in convenience just serves to encourage a "let's move everything to the cloud" reaction, so I keep my mouth shut.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: