Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm sad that drives don't have a 'shutdown' command which writes a few extra bytes of ECC data per page into otherwise empty flash cells.

It turns out that a few extra bytes can turn a 1 year endurance into a 100 year endurance.



There are programs with which you can add any desired amount of redundancy to your backup archives, so that they would survive corruption that does not affect a greater amount of data than the added redundancy.

For instance, on Linux there is par2cmdline. For all my backups, I create pax archives, which are then compressed, then encrypted, then expanded with par2create, then aggregated again in a single pax file (the legacy tar file formats are not good for faithfully storing all metadata of modern file systems and each kind of tar program may have proprietary non-portable extensions to handle this, therefore I use only the pax file format).

Besides that, important data should be replicated and stored on 2 or even 3 SSDs/HDDs/tapes, which should preferably be stored themselves in different locations.


Unfortunately some SSD controllers plainly refuse to read data they consider corrupted, even if you have extra parity that could potentially restore corrupted data, your entire drive might refuse to read.


Huh?

The issue being discussed is random blocks, yes?

If your entire drive is bricked, that is an entirely different issue.


Here’s the thing. That SSD controller is the interface between you and those blocks.

If it decides, by some arbitrary measurement, as defined by some logic within its black box firmware, that it should stop returning all blocks, then it will do so, and you have almost no recourse.

This is a very common failure mode of SSDs. As a consequence of some failed blocks (likely exceeding a number of failed blocks, or perhaps the controller’s own storage failed), drives will commonly brick themselves.

Perhaps you haven’t seen it happen, or your SSD doesn’t do this, or perhaps certain models or firmwares don’t, but some certainly do, both from my own experience, and countless accounts I’ve read elsewhere, so this is more common than you might realise.


This is correct, you still have to go through firmware to gain access to the block/page on “disk” and if the firmware decides the block is invalid than it fails.

You can sidestep this by bypassing the controller on a test bench though. Pinning wires to the chips. At that point it’s no longer an SSD.


The mechanism is usually that the SSD controller requires that some work be done before your read - for example rewriting some access tables to record 'hot' data.

That work can't be done because there is no free blocks. However, no space can be freed up because every spare writable block is bad or is in some other unusable state.

The drive is therefore dead - it will enumerate, but neither read nor write anything.


I don't think this is correct; it could read the flash block containing the [part of the] table in question, update it in memory, erase that block, then rewrite it into the same block.


I really wish this responsibility was something hoisted up into the FS and not a responsibility of the drive itself.

It's ridiculous (IMO) that SSD firmware is doing so much transparent work just to keep the illusion that the drive is actually spinning metal with similar sector write performance.


Linux supports raw flash, called an MTD device (memory technology device). It's often used in embedded systems. And it has MTD-native filesystems such as ubifs. But it's only really used in embedded systems because... PC SSDs don't expose that kind of interface. (Nor would you necessarily want them to. A faulty driver would quietly brick your hardware in a matter of minutes to hours)


A buggy firmware will brick an SSD and block every option for recovering at least part of the data.


Seems like the approach Apple is taking by soldering storage directly on the mainboard or using proprietary modules like in the Mac mini.


When only a number of 4 kB blocks cannot be read, if the amount of affected data is less than the amount of added redundancy the archive file can still be repaired.

For instance, if you have a 40 GB backup archive with 10% redundancy, 4 GB of data, i.e. one million 4 kB data blocks can be unreadable and you can still repair the archive and recover the complete content.

It is true that the entire SSD or HDD can become bricked. The solution for this, as I have already written in my previous comment, is to duplicate any SSD/HDD used for archival purposes, which I always do.


Yes, and? HDD controllers dying and head crashes are a thing too.

At least in the ‘bricked’ case it’s a trivial RMA - corrupt blocks tend to be a harder fight. And since ‘bricked’ is such a trivial RMA, manufacturers have more of an incentive to fix it or go broke, or avoid it in the first place.

This is why backups are important now; and always have been.


We're not talking about the SSD controller dying. The SSD controller in the hypothetical situation that's being described is working as intended.


Not as far as I can tell, where intended is ‘as any user would reasonably expect’. Bricking the drive (can’t even read) because of too many errors is not what most users would ever want.

Some would (enterprise maybe), but even then they’d want deterministic data deletes too, which doesn’t sound like are happening.


You can argue that controllers shouldn't behave that way. But they do, it's not a bug, and it's not a dead controller. It's a perfectly functional controller's response to dead blocks.


Cite? By definition it appears to not meet the definition of ‘functional’.


The definition of functional in the context of the discussion is that in works in the way the manufacture explicitly designed it work, in a standard industry practice fashion, not as an unforeseen bug or malfunction.

Not some abstract notion.


So not enumerating as a drive, and not allowing you to read even valid blocks is ‘working’?


Yes, same as a facility self-destructing, if it was programmed to do so, is working as per its spec.


And what spec requires that? I have yet to see one.


The manufacturers'.


Cite? I have yet to see that actually documented anywhere, and you keep avoiding actually referring to one either.


RE "....This is why backups are important now; and always have been..."

Still a big problem if backup is to the "..same technology..."


That’s why 3-2-1 is not just a good idea.


Thank you for this.

I had no knowledge of pax, or that par was an open standard, and I care about what they help with. Going to switch over to using both in my backups.


For handling pax archives, I recommend the "libarchive" package, which is available in many Linux distributions, even if it originally comes from FreeBSD.

Among other utilities, it installs the "bsdtar" program, which you can use in your scripts like this:

  bsdtar --create --verbose --format=pax --file="${DIRECTORY}".pax "${DIRECTORY}" || exit
And for extraction:

  bsdtar --extract --preserve-permissions --verbose --file="${DIRECTORY}".pax
The bsdtar program has options for compressing and/or encrypting the archives, for the case when you do not want to use directly other external programs.

"par2create" creates multiple files from the (normally compressed and encrypted) archive file, for storing the added redundancy. I make a directory where I move those files, then I use a second time bsdtar (obviously without any compression or encryption) to aggregate those files in a single archive with redundancy.

The libarchive package can also be taken directly from:

https://github.com/libarchive/libarchive

"libarchive" handles correctly all kinds of file metadata, e.g. extended file attributes and high-resolution file timestamps, which not all archiving utilities do. Many Linux utilities, with the default command-line options or when they have not been compiled from their source with adequate compilation options, which happens in some Linux distributions, may silently lose some of the file metadata, when copying, moving or archiving.


there's no reason that you have to create multiple files for par2 if you are storing the recovery data with the protected data. It only was split into files of varying size due to its source in protecting usenet posted binaries to allow users to not have to download the entire recovery data when they only needed a portion.


This is fine, but I'd prefer an option to transparently add parity bits to the drive, even if it means losing access to capacity.

Personally, I keep backups of critical data on a platter disk NAS, so I'm not concerned about losing critical data off of an SSD. However, I did recently have to reinstall Windows on a computer because of a randomly corrupted system file. Which is something this feature would have prevented.


Blind question with no attempt to look it up: why don't filesystems do this? It won't work for most boot code but that is relatively easy to fix by plugging it in somewhere else.


Wrong layer.

SSDs know which blocks have been written to a lot, have been giving a lot of read errors before etc., and often even have heterogeneous storages (such as a bit of SLC for burst writing next to a bunch of MLC for density).

They can spend ECC bits much more efficiently with that information than a file system ever could, which usually sees the storage as a flat, linear array of blocks.


This is true, but nevertheless you cannot place your trust only in the manufacturer of the SSD/HDD, as I have seen enough cases when the SSD/HDD reports no errors, but nonetheless it returns corrupted data.

For any important data you should have your own file hashes, for corruption detection, and you should add some form of redundancy for file repair, either with a specialized tool or simply by duplicating the file on separate storage media.

A database with file hashes can also serve other purposes than corruption detection, e.g. it can be used to find duplicate data without physically accessing the archival storage media.


Verifying at higher layers can be ok (it's still not ideal!), but trying to actively fix things below that are broken usually quickly becomes a nightmare.


IMO it's exactly the right layer, just like for ECC memory.

There's a lot of potential for errors when the storage controller processes and turns the data into analog magic to transmit it.

In practice, this is a solved problem, but only until someone makes a mistake, then there will be a lot of trouble debugging it between the manufacturer certainly denying their mistake and people getting caught up on the usual suspects.

Doing all the ECC stuff right on the CPU gives you all the benefits against bitrot and resilience against all errors in transmission for free.

And if all things go just right we might even be getting better instruction support for ECC stuff. That'd be a nice bonus


> There's a lot of potential for errors when the storage controller processes and turns the data into analog magic to transmit it.

That's a physical layer, and as such should obviously have end-to-end ECC appropriate to the task. But the error distribution shape is probably very different from that of bytes in NAND data at rest, which is different from that of DRAM and PCI again.

For the same reason, IP does not do error correction, but rather relies on lower layers to present error-free datagram semantics to it: Ethernet, Wi-Fi, and (managed-spectrum) 5G all have dramatically different properties that higher layers have no business worrying about. And sticking with that example, once it becomes TCP's job to handle packet loss due to transmission errors (instead of just congestion), things go south pretty quickly.


> And sticking with that example, once it becomes TCP's job to handle packet loss due to transmission errors (instead of just congestion), things go south pretty quickly.

Outside of wireless links (where FEC of some degree is necessary regardless) this is mostly because TCP’s checksum is so weak. QUIC for example handles this much better, since the packet’s authenticated encryption doubles as a robust error detecting code. And unlike TLS over TCP, the connection is resilient to these failures: a TCP packet that is corrupted but passes the TCP checksum will kill the TLS connection on top of it instead of retransmitting.


Ah, I meant go south in terms of performance, not correctness. Most TCP congestion control algorithms interpret loss exclusively as a congestion signal, since that's what most lower layers have historically presented to it.

This is why newer TCP variants that use different congestion signals can deal with networks that violate that assumption better, such as e.g. Starlink: https://blog.apnic.net/2024/05/17/a-transport-protocols-view...

Other than that, I didn't realize that TLS has no way of just retransmitting broken data without breaking the entire connection (and a potentially expensive request or response with it)! Makes sense at that layer, but I never thought about it in detail. Good to know, thank you.


ECC memory modules don’t do their own very complicated remapping from linear addresses to physical blocks like SSDs do. ECC memory is also oriented toward fixing transient errors, not persistently bad physical blocks.


You can still do this for boot code if the error isn't significant enough to make all of the boot fail. The "fixing it by plugging it in somewhere else" could then also be simple enough to the point of being fully automated.

ZFS has "copies=2", but iirc there are no filesystems with support for single disk erasure codes, which is a huge shame because these can be several orders of magnitude more robust compared to a simple copy for the same space.


zfs can run with a single disk stripe. pfsense gladly runs this way. See https://docs.netgate.com/pfsense/en/latest/install/install-z...


The filesystem doesn't have access to the right existing ECC data to be able to add a few bytes to do the job. It would need to store a whole extra copy.

There are potentially ways a filesystem could use heirarchical ECC to just store a small percentage extra, but it would be far from theoretically optimal and rely on the fact just a few logical blocks of the drive become unreadable, and those logical blocks aren't correlated in write time (which I imagine isn't true for most ssd firmware).


CD storage has an interesting take, the available sector size varies by use, i.e. audio or MPEG1 video (VideoCD) at 2352 data octets per sector (with two media level ECCs), actual data at 2048 octets per sector where the extra EDC/ECC can be exposed by reading "raw". I learned this the hard way with VideoPack's malformed VCD images, I wrote a tool to post-process the images to recreate the correct EDC/ECC per sector. Fun fact, ISO9660 stores file metadata simultaneously in big-endian and little form (AFAIR VP used to fluff that up too).


Octets? Don't you mean "bytes"? Or is that word problematic now?


I wonder if OP used "octets" because physical pattern in the CD used to represent a byte is a sequence of 17 pits and lands.

BTW, byte size during the history varied from 4 to 24 bit! Even now, based on interpretation, you can say 16 bit bytes do exist.

Char type can be 16 bit on some DSP systems.

I was curious, so I checked. Before this comment, I only knew about 7 bit bytes.


Personally, I prefer the word "bytes", but "octets" is technically more accurate as there are systems that use differently sized bytes. A lot of these are obsolete but there are also current examples, for example in most FPGA that provide SRAM blocks, it's actually arranged as 9, 18 or 36-bit wide with the expectation that you'll use the extra bits for parity or flags of some kind.


Octets is the term used in most international standards instead of the American "byte".

"Octet" has the advantage that it is not ambiguous. In old computer documentation, from the fifties to the late sixties, a "byte" could have meant any size between 6 bits and 16 bits, the same like "word", which could have meant anything between 8 bits and 64 bits, including values like 12 bits, 18 bits, 36 bits, 60 bits, or even 43 bits.

Traditionally, computer memory is divided in pages, which are divided in lines, which are divided in words, which are divided in bytes. However the sizes of any of those "units" has varied in very wide ranges in the early computers.

IBM System/360 has chosen the 8-bit byte, and the dominance of IBM has then forced this now ubiquitous meaning of "byte", but there were many computers before System/360 and many coexisting for some years with the IBM 360 and later mainframes, where byte meant something else.


Not problematic, minor pedantry. With much time spent reading (and occasionally writing) technical documentation it's octets, binary prefixes, and other wanton pedantry where likely to be understood/appreciated or precision is required.

FTR, ECMA-130 (the CD "yellow book" equivalent standard) is littered with the term "8-bit bytes", so it was certainly a thing then. Precision when simultaneously discussing eight-to-fourteen modulation, and the 17 encoding "bits" that hit the media for each octet as noted in a sibling comment.

Now, woktets on the other hand...


The term octets is pretty common in network protocol RFCs, maybe their vocabulary is biased in the direction of that writing.


Reed Solomon codes, or forward error correction is what you’re discussing. All modern drives do it at low levels anyway.

It would not be hard for a COW file system to use them, but it can easily get out of control paranoia wise. Ideally you’d need them for every bit of data, including metadata.

That said, I did have a computer that randomly bit flipped when writing to storage sometimes (eventually traced it to an iffy power supply), and PAR (a type of reed solomon coding forward error correction library) worked great for getting a working backup off the machine. Every other thing I tried would end up with at least a couple bit flip errors per GB, which make it impossible.


You can, but only if your CPU is directly connected to a flash chip with no controller in the way. Linux calls it the mtd subsystem (memory technology device).


That does sound like a good idea (even if I’m sure some very smart people know why it would be a bad idea)


I guess the only way to do this today is with a raid array?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: