100%. Those disks are likely working much harder moving the head all over the pl...

gambiting · on April 22, 2025

....do you think the drive doesn't know where the empty space actually is?

MrDrMcCoy · on April 22, 2025

Drives blindly store and retrieve blocks wherever you tell them, with no awareness of how or if they relate to one another. It's a filesystem's job to keep track of what's where. Filesystems get fragmented over time, and especially as they get full. The more full they get, the more seeking and shuffling they have to do to find a place to write stuff. This will be the case even after the last spinning drive rusts out, as even flash eventually has to contend with fragmentation. Heck, even RAM has to deal with fragmentation. See the discussion from the last few weeks about the ongoing work to figure out a contiguous memory allocator in Linux. It's one of the great unsolved problems in general comparing that you and your descendants would be set for life if you could solve.

akx · on April 22, 2025

Not quite, AFAIK? Drive controllers may internally remap blocks to physical disk blocks (e.g. when a bad sector is detected; see the SMART attribute Reallocated Sector Count).

kbolino · on April 22, 2025

Logical Block Addressing (LBA) by its very nature provides no hard guarantees about where the blocks are located. However, the convention that both sides (file systems and drive controllers) recognize is that runs of consecutive LBAs generally refer to physically contiguous regions of the underlying storage (and this is true for both conventional spinning-platter HDDs as well as most flash-based SSDs). The protocols that bridge the two sides (like ATA, SCSI, and NVMe) use LBA runs as the basic unit of accessing storage.

So while block remapping can occur, and the physical storage has limits on its contiguity (you'll eventually reach the end of a track on a platter or an erasable page in a flash chip), the optimal way to use the storage is to put related things together in a run of consecutive LBAs as much as possible.

MrDrMcCoy · on April 22, 2025

Sure, but bad block tracking and error correction are pretty different from the implied file/volume awareness I was responding to.

kbolino · on April 22, 2025

Yes, to be clear, the drive controller generally (*) has no concept of volumes or files, and presents itself to the rest of the computer as a flat, linear collection of fixed-size logical blocks. Any additional structure comes from software running outside the drive, which the drive isn't aware of. The conventional bias that adjacent logical blocks are probably also adjacent physical blocks merely allows the abstraction to be maintained while also giving the file system some ability to encourage locality of related data.

* = There are some exceptions to this, e.g. some older flash controllers were made that could "speak" FAT16/32 and actually know if blocks were free or not. This particular use was supplanted by TRIM support.

genewitch · on April 22, 2025

I think you'll find that the word "find" doesn't mean "has to search", like one can find their nose in the middle of their face, if one desires.

Change the word to "seek" and it may make more sense.

fc417fc802 · on April 22, 2025

It makes more sense but it's not true for the modern CoW filesystems that I'm familiar with. Those allocate free space in slabs that they write to sequentially.

kbolino · on April 22, 2025

Also, CoW isn't some kind of magic. There are two meanings I can think of here:

A) When you modify a file, everything including the parts you didn't change is copied to a new location. I don't think this is how btrfs works.

B) Allocated storage is never overwritten, but modifying parts of a file won't copy the unchanged parts. A file's content is composed of a sequence (list or tree) of extents (contiguous, variable-length runs of 1 or more blocks) and if you change part of the file, you first create a new disconnected extent somewhere and write to that. Then, when you're done writing, the file's existing extent limits are resized so that the portion you changed is carved out, and finally the sequence of extents is set to {old part before your change}, {your change}, {old part after your change}. This leaves behind an orphaned extent, containing the old content of the part you changed, which is now free. From what evidence I can quickly gather, this is how btrfs works.

Compared to an ordinary file system, where changes that don't increase the size of a file are written directly to the original blocks, it should be fairly obvious that strategy (B) results in more fragmentation, since both appending to and simply modifying a file causes a new allocation, and the latter leaves a new hole behind.

While strategy (A) with contiguous allocation could eliminate internal (file) fragmentation, it would also be much more sensitive to external (free space) fragmentation, requiring lots of spare capacity and/or frequent defrag.

Either way, the use of CoW means you need more spare capacity, not less. It's designed to allow more work to be done in parallel, as fits modern hardware and software better, under the assumption that there's also ample amounts of extra space to work with. Denying it that extra space is going to make it suffer worse than a non-CoW file system would.

fc417fc802 · on April 22, 2025

Which is exactly why you periodically do maintenance to compact the free space. Thus it isn't an issue in practice unless you have a very specific workload in which case you should probably be using a specialized solution. (Although I've read that apparently you can even get a workload like postgres working reasonably well on zfs which surprises me.)

If things get to the point where there's over 1 TB of fragmented free space on a filesystem that is entirely the fault of the operator.

kbolino · on April 22, 2025

What argument are you driving at here? The smaller the free space, the harder it is to run compaction. The larger the free space, the easier it is. There are some confounding forces in certain workloads, but the general principle stands.

"Your free space shouldn't be very fragmented when you have such large amounts free!" is exactly why you should keep large amounts free.

kbolino · on April 22, 2025

If you delete files, or append to existing files, then the promises of the initial allocation strategy go out the window.