> Correlated failures are common in drives. This is why, when I was building DIY...

foobarian · on July 18, 2018

I've been doing a poor man's version of this for home videos using 3 hard drives and rsync. It's easy to replace a drive and they are not likely to go out at the same time. But one thing that bugs me is that unless the drive fails hard (e.g. noticed by SMART or unable to read at all) how do I know the data on the drive is not corrupted without reading it? Are there best practices to continuously compare the replicas in the background? Does that impact durability of the drives?

mmt · on July 18, 2018

> how do I know the data on the drive is not corrupted without reading it? Are there best practices to continuously compare the replicas in the background?

I assume you're talking about already-written sectors becoming unreadable or a similar failure. Unfortunately, I don't think you can. This is what I believe the "patrol read" feature of RAID cards is meant to address.

Fortunately, however, I don't believe there's evidence that if the data is readable, it would ever be different from what had been written, so comparison isn't needed. The main exception to this is the case of firmware bugs that return sectors full of all-zeros.

> Does that impact durability of the drives?

I haven't read the studies (from Google, mostly, IIRC) in a while, and I'm not sure if they've released anything lately for more modern drives [1]. However, I believe you'll find an occasional "patrol read" won't noticeably reduce drive life/durability.

[1] Especially for something like SMR, whose tradeoffs would seem particularly attractive for something like this archival-like use case.

_urga · on July 18, 2018

"I don't believe there's evidence that if the data is readable, it would ever be different from what had been written, so comparison isn't needed. The main exception to this is the case of firmware bugs that return sectors full of all-zeros."

Comparison is needed to address misdirected writes and bit rot in the very least, see "An Analysis of Data Corruption in the Storage Stack" [1]. You can't count on your drive firmware or RAID firmware to get this right. You need bigger end-to-end checksums, and you need to scrub.

[1] - http://www.cs.toronto.edu/~bianca/papers/fast08.pdf

mmt · on July 18, 2018

Thanks! I had either missed that paper or had taken away more of the message that these errors are more likely to be from events like misdirected writes, cache flush problems (hence the high correlation with systems resets and not-ready-conditions), and firmware bugs (on-drive and further up the stack), rather than bit-rot.

Still:

> On average, each disk developed 0.26 checksum mismatches.

> The maximum number of mismatches observed for any single drive is 33,000.

Considering the latter can represent 132K on a modern, 4K-sectored drive, that's a remarkable amount of data loss, enough to warrant a checksumming higher up (such as in the filesystem).. in theory [1].

However, the fact that this was NetApp using their custom hardware as the testbed makes me wonder if the data are skewed, and if the numbers would be nearly this bad from a more "commodity" setup, such as at Google. The paper alludes to this when referring to the extra hardware for the "nearline" disks, and I'm always suspicious of huge discrepancies in statistics between "enterprise" and other disks, even more so when there's a drastic difference in comparison methodology.

It would be interesting to see if there are any numbers for more modern drives, especially as the distinction between "enterprise" and "consumer" drives is disappearing, if only because demand for the latter is disappearing.

[1] In practice, an individual isn't aware of the 16.5K/132K loss risk, which is vanishingly small compared to other risks, anyway, and businesses don't tend to care and have survived OK anyway.

walterbell · on July 18, 2018

Use ZFS, it can perform periodic integrity checks.

mmt · on July 18, 2018

I've never done so outside of FreeNAS appliances, partly because I remain persuaded that offloading the RAID portion to a card is more cost-effective and higher-performance, especially on otherwise RAM- and/or IO-constrained servers, and partly because the ZFS support under Linux has, historically, been less than ideal.

Higher-level checksum failures are, however, a situation, where I would appreciate an integration between filesystem and RAID, as I'd want a checksum error to mark a drive as bad, just like any other read error.

Do you happen to know if ZFS does that?

walterbell · on July 19, 2018

Does this help?

https://prefetch.net/blog/2011/10/15/using-the-zfs-scrub-fea...

mmt · on July 20, 2018

Unfortunately, no, it doesn't really say how ZFS behaves when an error is encountered.

This is super-disturbing and a dealbreaker, if it's still true:

> The scrub operation will consume any and all I/O resources on the system (there are supposed to be throttles in place, but I’ve yet to see them work effectively), so you definitely want to run it when you’re system isn’t busy servicing your customers.

I browsed a little of the oracle.com ZFS documentation but couldn't find much in the way of what triggers it to decide that a device is "faulted" other than being totally unreachable.

jquast · on July 17, 2018

> I went through the extra effort of sourcing disks from as many different vendors as possible.

This is very good advice!

If you already built your array, consider advice: "replace a bad disk with a different brand, whenever possible".

Over time, you naturally migrate away from the bad vendors/models/batches. After following this practice, it seems ridiculous to me now to keep replacing the same bad disks with the same vendor+model.

mmt · on July 17, 2018

Although I wouldn't go so far as to insist on switching brands (especially since, as another commenter pointed out, there has been so much consolidation, there remain only 3), I agree that replacing with at least a different model, or, failing that, a different batch, is a best practice for an already-built homogenous array.

Some of this can also be achieved ahead of time if one has multiple arrays with hot spares, by shuffling hot spares around, assuming there's some model diversity between the arrays but not within them.

I doubt I'll ever again have the luxury of being able to perform this kind of engineering, however. Even a minor increase in cost or cognitive/procedure complexity or a decrease in convenience just serves to encourage a "let's move everything to the cloud" reaction, so I keep my mouth shut.