I'm always surprised by what a mess doing what seem like simple file operations ...

gerdesj · on May 2, 2018

Although its not exactly associated with this as such there is a growing understanding that SMB/CIFS shares have a nasty habit of reporting "on storage" before the data really is safe. That is a bit of a problem for many backup systems, unless you do a verify afterwards and pick up the pieces. Backups can involve massive files with odd write and read patterns and databases generally involve quite large files with odd read and write patterns compared to say document storage.

Perhaps we need database and backup oriented filesystems and not funny looking files on top of generic filesystems.

jandrewrogers · on May 2, 2018

Ironically, most sophisticated database engines do implement complete file systems, treating those "funny looking files" as little more than virtual block devices. In fact, with very little extra code, you can trivially retarget some database kernels to run directly on top of raw block devices, eliminating the redundant file system. It partly depends on the storage management requirements of the user e.g. if they expect to share block devices across unrelated applications. In my experience, the raw block device code is simpler and more reliable; there are many odd edge cases in Linux file system behavior that come up that you must account for if you require robust and reliable storage behavior on top of one.

There are some additional performance and behavioral advantages to working with the storage devices directly. Anecdotally, if you run databases on virtual machines (never recommended but many people do), using raw block devices instead of a file system often seems to eliminate much of the disk I/O weirdness that occurs under VMs.

makmanalp · on May 3, 2018

> you can trivially retarget some database kernels to run directly on top of raw block devices, eliminating the redundant file system

e.g. in mysql: https://dev.mysql.com/doc/refman/8.0/en/innodb-raw-devices.h...

hinkley · on May 3, 2018

It’s getting damned hard to avoid running a database on a VM these days.

bbuchalter · on May 3, 2018

Could you expand on what you mean by "weirdness" on VM disk I/O in the context of database storage?

jandrewrogers · on May 3, 2018

The storage has anomalously high latency and throughput variance with some patterns that you don't see with non-virtualized storage and a modest degradation in average performance. This is expected, but it makes it difficult to schedule I/O efficiently. This is more noticeable if you are doing direct I/O because having a VM intercept your storage access defeats the purpose.

What was surprising is that the direct I/O behavior appears to be conditional on whether you are accessing the storage through a file system. My database kernel is block device agnostic, using files and raw devices interchangeably via direct I/O. Against expectations, when we accessed the same virtualized storage as raw block devices, the behavior was like bare metal even though we are running the exact same operations over the same direct I/O interface in a VM. Basically, the only difference was the file descriptor type.

I'm guessing that file systems are virtualization aware to some extent and access through them is actively managed; raw device accesses are VM oblivious and simply passed through by the storage virtualization layer.

heavenlyblue · on May 2, 2018

Agree, there's already a certain trend towards e.g. etcd/co for online configuration management.

On top of that, many issues you may be facing re. files now have already been resolved if you change the stack: you can't do transactions with fs.

ajross · on May 3, 2018

> what a mess doing what seem like simple file operations is

Proper handling and reporting of hardware-level errors all the way up through the stack (driver, block layer, filesystem, C library) to the application so it can recover in a reliable way is not a simple operation!

Simple operations are open/close/read/write. Those work. Until they don't, then you need to know how far back the operations you already did and "assumed" had worked didn't. And in this case the promise made to PostgreSQL by fsync() wasn't as firm as the "obvious" interpretation of the documentation would lead one to believe.

dagenix · on May 3, 2018

I don't doubt its a hard problem. If there was a simple, obvious better way to do it, I imagine we'd have it by now.