I tried to get a patch in to do round robin reads in Linux RAID1 a few years ago due to this, but there was too great a request for doing a lot of benchmarks to show that it was ever useful that I abandoned getting it upstream.
My guess is that using the same disk performs better then round robin ... the disk could for example already have the data in cache, while the other disk has not. When it comes to optimization - the results are often unexpected. And you would have to do a lot of benchmarks on different disks, loads, use cases etc to make a conclusion, and you might find unexpected outcomes that you did not think about, like both disks being self-bricked at the same time as they have almost the same rw count.
One could do delayed round-robin. For one day (or other time period), the first-choice disk is disk A, the next time period, disk B... etc. And for intentional non-uniformity, you can vary the time-period.. it's one day for disk A, half a day for B. etc...
Possibly pushing the likelyhood of first failure out, preferably beyond capital lifetime assumptions? Possibly increasing system performance over the lifetime of the SSDs.
But, it's possible that this solution is not particularly useful depending on what mixes of equipment and workloads one expects to operate.
There was discussion about the spelling of "robbin" versus "robin", but that was easily corrected and a new patch submitted. It was definitely the request from the Linux RAID maintainer Neil Brown to provide benchmarks that stopped me from pursuing it further. I created the patch when someonein the Linux IRC channel was reporting this imbalance on their SSDs, so I had no access to hardware SSD hardware to test it on. This was a long time ago.
Counter to common intuition SSD reads arent completely non destructive (they deplete stored charge) and rolling over READ DISTURB counter triggers wear leveling on affected Block. Micron listed 100K READ cycles per block for MLC, 1Mil for SLC, probably 1K for TLC.
Ergo, non-balanced reading is better for endurance of the array, as it reduces the probability of contemporaneous multi-disk failure. Like disk 2 failing under the read load encountered during a rebuild.
In theory, this is true. But the effect of reads on SSD endurance is orders of magnitude smaller than the effect of writes. The reads required for a rebuild aren't going to trigger an endurance-related failure unless the drive was already within one full drive write of failing.
Not directly, but you will cause some writing to happen internally to the drive if a read disturb is encountered. When the SSD has trouble reading back data due to a read disturb or similar data degradation, it'll recover it using the error correcting codes and then write a fresh copy.
Pro tip: use RAID10 instead of RAID1 with Linux md to have the same level of redundancy for arrays with only two legs. RAID10 will properly stripe/balance reads across them, effectively doubling linear read throughput. The only real downside is that you will have to find and actively choose a sensible chunk size that fits both your hardware and workload.
I wonder if there are cache effects involved as well - continuously using the same SSD most of the time may reuse cached metadata relevant to the SSD innards that the kernel FS cache can't help with.
Let me clarify. All the VFS caches are above the md driver. The cache we care about, the buffer cache, is not below the md driver.
You'd need some kind of cache below the md driver to cache "metadata relevant to the SSD innards". Since there isn't one therefore there can't be a penalty due to caching to switch between SSDs in a mirror. That answers your question.
But since you want an insight, here's one for you. There is no "metadata relevant to the SSD innards" available. The md drivers do not have this information. I wish they did. When I was still playing with the md code I'd love to know the block and page sizes of the drives. But that information is two bytes. Not really something that would need a cache.
Let me clarify what I was saying: I was redundantly (for clarity) talking about cache on the SSD, trying to make it extra clear that I wasn't talking about any kernel cache.
I'm trying to figure out how, exactly, you've read my attempt at excessive clarity; it's not clear to me yet what you misread, since it seems like you're trying to correct some misunderstanding, but it hasn't worked because you haven't said anything that isn't already blindingly obvious.
There is no "metadata relevant to the SSD innards" available. The md drivers do not have this information
This is not an insight.
The cache I'm interested in is tied up in how the SSD presents as a block device, but isn't implemented as a block device, as block devices are normally understood (wear levelling and remapping, hidden parallelism / striping, etc.). Yes, the FS cache can't help with this. Yes, the RAID implementation can't help with this. Of course. It's internal to the SSD. It's innards. I don't see what information you added.
I may have misread your original post but your hostility isn't helping to clarify things. Instead of replying back "I was talking about the SSD cache" you replay back with "Well, it could hardly be below, could it. Was there an insight you wanted to mention?" So instead of clarifying things you decided to be rude.
Now that I know what you're talking about, let's try to answer your question. There are SSD controllers that don't use a RAM cache on the drive. It's not necessary for performance and doesn't do as much as on a HDD if present. The main benefit of a drive cache while reading is for read ahead. This is not needed for an SSD as it doesn't have to wait for the sectors to show up under the head.
The only thing that will affect things is reading file system blocks from the same SSD page/block. The md code already does this. If the next block requested is after the last blocks read it will use the same SSD.
If the block is from somewhere else then it doesn't matter if it comes from this SSD or that SSD. Access time will be the same.
AFAIK, there is no need to do "wear leveling" for reads, SSDs have "unlimited" life cycle reads. So the first SSD seeing ten times the reads will not make it fail sooner.
Even if reads did wear out an SSD, wouldn't it be better to wear them out one at a time rather then wear them all out equally so they all fail in rapid succession once they reach the end of their life cycles?
On one hand this is true, on the other, if you have two devices that can work for 5 years and then you anyway replace them because they are now too small and never failed you have one device that you replaced and the other hardly in use and you have an increase TCO since you had to buy another device in the middle of the life cycle.
Not that it is really critical, read-disturb exists but it is on a far lower scale compared to erase endurance.
> wouldn't it be better to wear them out one at a time rather then wear them all out equally
Agree and this would seem to make sense.
It is like having two rolls of paper towels in the kitchen. If you use both of them you are only extending the time till empty. If you use one of them only you can then replace that when it runs out running on the 2nd in the time it takes to replace.
I would say no, you'd want the whole disk to fail then so you can just replace it vs. it getting smaller over time, which would make capacity planning pretty complex.
We're talking about RAID1 here, so multiple SSDs. I'm saying you don't want multiple disks to fail in rapid succession, which is what you'd maybe get if you exactly equalized the wear on them.
There's no need. If a page has been read enough times to produce a read disturb, the remedy is to re-write the page, which moves the write endurance indicator (by the smallest possible increment).
Not actually true; SSDs experience "read disturbance" caused by application of read voltages. Essentially, the cells being read, and those nearby, experience a little wear on each read operation. It's a lot less wear than a write, but it's still there.
When I last looked at this, devices could suffer (say) 3K writes, and once written a cell would have to be re-written after 3K reads in order to maintain proper voltage levels.
You can't really individually address a cell; a single read will drag in a bunch of cells, if only for ECC. Write granularity varies from device to device, but is often on the order of 256 or 512 bytes (though "append only" operations to blocks are sometimes supported). Erase granularity is what really matters, with values on the order of 64K being popular. So really you just need to keep track of erase block counts, which is relatively easy, and just maintain a "guess" as to read counts.
It's much, much harder to keep track of logical block location, especially if you need to handle surprise removal of power.
> It's much, much harder to keep track of logical block location, especially if you need to handle surprise removal of power.
Is there a good reason at this point for Flash controllers to bother doing this part at all, instead of:
1. maintaining the per-block erase count metrics internally;
2. exposing the raw physical blocks over ATA;
3. adding an ATA protocol extension that returns the erase-count for a block whenever it's read/written?
Under such a system, the OS would maintain the logical block map (probably as a "layer" in logical volume management, mapping a wear-prone PV to a wearless, but gradually shrinking LV.) It could even put the logical-block-map metadata on a different disk, or pool the logical block maps for a bunch of disks together for a RAID-set and then RAID-mirror the pool, or whatever else. Basically you could play around with such maps the way you can play around with Linux's thin-pool metadata volumes.
There are quite a bit more complicated issues that require handling than just the erase count for each erase block. One part is that there are many things that the SSD controller does and keeps track of in order to increase the endurance to the maximum, things like:
1. How many bits of error did we detect in the read? Maybe it's time to relocate the data before it dies completely?
2. How long does the write takes? Maybe it is time to retire this block because it is near dead for unknown reasons?
3. How many times was this block read? Maybe it is time to relocate it before we risk read-disturb issues?
4. Try to distribute data evenly across the flash dies?
5. Ensure we have enough erased blocks to write into to avoid blocking new writes.
6. Can I do some background operation? I really need to garbage collect and rescue some nearly lost data.
What you suggest (and I entertained for a while too) is to return to the pre-IDE (RLL/MFM) days where the OS would do all the hard work and get all the flexibility. But the OS vendors do not want that complexity, with the more advanced technologies involved it is hard work and more importantly, it varies between vendors and for flash it can even need tuning for different flash generations or even batches. The OSes don't want that complexity.
On the other side, the different SSD vendors want to use the lowest flash chips and get maximum revenue. The algorithms that they implement are their trade secrets and the thing that makes them able to charge more for the better devices. If they were reduced to a flash dealer they would find themselves getting less money.
I do think that having a nearly raw flash interface and having a better processor above with possibly better algorithms would yield better results (especially if you integrate it deeper into the user application) but the complexity gets that much higher as well for the entire app and when (not if) there will be another technology the entire app will need to be rewritten which sucks.
I thought about moving the work that a flash controller did to an OS, on a game console. Would have saved some money (high tens to low hundreds of millions of dollars over the product lifetime). Just too damned risky and complicated to make work, given the ship schedule. The controllers do a lot of work, and provide a pretty clean abstraction over a bunch of complexity and dirty business under the hood.
Some of the meta data is only kept in ram and updated from time to time to special regions. These special regions can also suffer from endurance issues but are usually are SLC-use-MLC so they wear slower but it is a known failure mode for the metadata locations to wear and we've had it happen to us during testing, especially if we did lots of firmware upgrade/downgrade testing.
This is not necessarily about NAND wear leveling, it could be about load balancing. SSDs have limited ability to service requests, and they dissipate heat when they do so. At high load, one of the disks is going to be hotter for longer, and wear out faster as a result.
Then again, not sure if author was really thinking about these sorts of things, who knows.
One benefit of spreading out reads is detecting bad sectors on the drives that were normally not going to see much reads. Then the good data can be copied from the other mirror.
Back when I still ran Linux on my personal servers I noticed this. I was adding TRIM support to the md drivers before the maintainers finally got around to it. That's when I noticed the balancing code tended to favour one disk.
I added round-robin code and experimented with splitting large sequental requests between idle disks.
Does it matter? Do reads affect lifetime of SSDs? It can't be affecting performance, as if the drives were busy then it would be balanced much more evenly.
http://www.rkeene.org/viewer/tmp/linux-2.6.35.4-2rrrr.diff.h...