HDF5eis: Storage IO solution for big multidimensional time series sensor data

nicobrevin · on May 30, 2023

It would be nice if the hdf5 people could add better support for concurrency, a la https://www.hdfgroup.org/2020/11/webinar-enabling-multithrea...

I can't imagine there'll be a queue of people wanting to implement that for fun on their weekend

mgaunard · on May 30, 2023

Mostly sounds like a bad idea.

HDF5 is on the way out anyway.

CreRecombinase · on May 30, 2023

It's absolutely not "on the way out". Lots of very slow moving, very deep pocketed organizations working on very long time horizons are heavily invested in HDF5. The same can't be said for any of the flavor-of-the-week ML "cloud-native" file formats.

klysm · on May 30, 2023

What’s the replacement?

packetlost · on May 30, 2023

HDF5 does so much there's really not a single replacement. The most prevalent is probably JSON from what I've seen, but SQLite, (C/T)SVs are also fairly popular. There's a ton of application-specific serialization formats too.

jjoonathan · on May 30, 2023

None of those (JSON, SQLite, CSV) are remotely appropriate for addressing HDF5's niche of storing large blocks of numbers in a manner that can approach their native space / perf characteristics.

dagw · on May 30, 2023

Yea, HDF5 is great at what it is great at. The problems however come when people try to use it as a general data storage format and pseudo database.

jjoonathan · on May 30, 2023

Right, it does get abused, but I don't see anyone coming for its niche and I don't see its niche evaporating.

chaxor · on May 30, 2023

Lol yeah, I always reach for paperback books now since they're a great modern replacement for my Kindle.

On a more serious note, DuckDB is actually a pretty fantastic replacement for SQLite for some data analytics tasks. Hdf5 is still great for numerical data in this space though. I would welcome other options like the SQLite/DuckDB scenario for matrix data.

petschge · on May 30, 2023

If there is any replacement in the HPC world (and I don't really see indications for that), it's Adios2.

mgaunard · on May 30, 2023

Arrow.

chaxor · on May 30, 2023

H5AD

:D

cozzyd · on May 30, 2023

There is an MPI version of HDF5 no?

nicobrevin · on May 30, 2023

Yeah, but no multi threading concurrency, only multi process. Not even parallel reads. There's a giant mutex around the call to read data from a dataset and my colleague's take on it was that it would be months of work to unpick all the state and make even reads thread safe.

29athrowaway · on May 30, 2023

Before moving to HDF5:

https://cyrille.rossant.net/moving-away-hdf5/

markkitti · on May 30, 2023

Over time, I've found it is quite possible to mitigate most of these issues. The default settings are not great, mostly because they default to maximize backwards compatibility. With a few settings you can consolidate the metadata such that the data layout is very simple.

The structure of a HDF5 file can be manipulated to put data where you want it in a file. For many files of the same type, you can set the metablock size so that your datasets are always at the same offsets.

You can use tools like h5ls or h5dump to describe the structure of a HDF5 file. It turns out that if you do this, you can make HDF5 cloud friendly.

https://medium.com/pangeo/cloud-performant-netcdf4-hdf5-with...

cjbgkagh · on May 30, 2023

I was wondering if anyone else had this problem, I found HDF5 to be a nightmare to work with and is on my short list of technologies I refuse to work with anymore.

zwaps · on May 30, 2023

Yes. I had to save string data in a column, which is frankly not something you want to do. HDF5 is super weird about encodings, quietly introduces errors, requires a fixed character length for that column and if you want to quickly access or search through the data, you have to write a program because HDF5View (or whatever it was called) is the worst. And since you still have to manage multiple reads and writes, the only benefit is maybe speed.

String data and HDF5 = No.

It does have an index tho compared to Parquet, which is handy for ML type stuff.

xmcqdpt2 · on May 30, 2023

I worked with HDF5 a bunch for physics simulation and it was extremely fragile. Performance was good on a distributed hard drive when reading and loading blocked data over many machines but files would corrupt whenever a compute node would crash. Maybe they fixed it since, that was in 2014 or so.

It's a quantum file format. It works so long as you aren't looking at it too much.

dguest · on May 30, 2023

What do you recommend for binary structured data? (assuming that SQL-like isn't appropriate)

rossant · on May 30, 2023

We're using the NPY format [1] at the International Brain Laboratory and we're pretty happy with it. It works well if you have few large arrays that you can store in different files within a folder. Another good option is Apache Parquet.

[1] https://numpy.org/devdocs/reference/generated/numpy.lib.form...

cjbgkagh · on May 30, 2023

Protobuf is my go-to and I have had no issues with it. If I had to do larger memory mapped files I guess flatbuffers would be the way to go.

Nympy has a reasonably sensible format, but I haven’t tried to do anything tricky with it.

dagw · on May 30, 2023

Protobuf has caused me nothing but trouble once I have even moderately large data sets. If what you are storing is large matrices or rasters, honestly nothing beats HDF5.

cjbgkagh · on May 30, 2023

I’m old school, from the Managing Gigabytes by Ian H Witton era so I have ways of tailoring the data and protobuf schema to suit my application. Large matrices are stored as contiguous chunks, similar to how tensorflow stores tensors in graphs. I work with TBs of data in GB chunks. I used to work with PBs of data but that was a long time ago. Derialization of protobuf messages is so fast that it hasn’t been an issue and I can use it from any language with protobuf support.

saltcured · on May 30, 2023

I think that depends on the use case and full data lifecycle. How is data acquired and added to datasets? How is it consumed?

An online, database-like product is most important if you need to coordinate concurrent activity by a number of data producers and consumers while maintaining a coherent view of the growing data. If you can break it into distinct phases with individual actors, passive serialization formats can make more sense. Adoption of bject-storage semantics would help eliminate some of the corruption/concurrency hazards mentioned in Cyrille's post. You write entire files and expose them to read-only consumers once they are valid and complete, side-stepping concurrent writer/reader scenarios.

However, object-storage still has coherence problems. If you expect metadata to need rounds of editing or curation, you don't want it embedded in bulk objects, where mutation is expensive, i.e. rewriting an entirely new version of the object. It is easier with a companion file strategy, where you can regenerate smaller metadata files alongside immutable bulk data files. The object store can ensure that concurrent users are only encountering a coherent snapshot of each metadata file, i.e. before or after it was replaced. But it does not provide coherent views of multi-file changes. You avoid low-level codec failures from concurrent writes to shared files, but you may still have semantic inconsistencies when interpreting the evolving multi-object dataset.

A hybrid approach would be to have some other data-management system with database-like properties to track the objects. The database stores a coherent catalog of which data objects are available. New objects are added and registered in the catalog for others to read, but you don't mutate them afterwards. The catalog should contain metadata needed to coordinate the workflows of production and consumption of objects. You could benefit from using the catalog as an authoritative store of metadata that is still being curated/refined. But you can also export snapshots of metadata into companion objects at appropriate milestones in the processing workflow. These could even be versioned to record well-defined "data releases" or provenance info related to individual processing tasks.

Of course, this still leaves the question of what file format(s) to use for the individual objects in the system... one could do all of the above and still choose to use HDF5! But these individual files would be simpler, i.e. to store a single N-dimensional numerical array in one file. Microscopy projects might choose OME-TIFF for individual image stacks, or something even more mundane like a directory of individual 2D TIFF or PNG files, if that maps well to how their acquisition or processing pipeline works. Sometimes, the right file layout is just as important as choosing chunking parameters within a format like HDF5. Naive pipelines often process whole files in RAM, so you end up choosing appropriate file layouts to align to the kind of sparse or sequential access needed by data producer and consumer programs.

tired_star_nrg · on May 30, 2023

Oh wow, sounds very similar to trying to use Microsoft Access files for production at scale. Many of the same headaches with everything in a single file

osigurdson · on May 30, 2023

>> everything in a single file

A file is just an abstraction over a block device. HDF5 is a meta abstraction, storing multiple file abstractions within an existing container file abstraction. There is nothing conceptually wrong about this but incredible to me that anyone would have invested the time into creating an abstraction that looks exactly like its container just to avoid having to tar it.

wiz21c · on May 30, 2023

what are the alternatives to HDF5 then ? (I thought HDF5 was a de facto standard, I feel wrong now :-))

dguest · on May 30, 2023

It is a de facto standard: you can tell because the author of that post suggests rolling your own simpler data format as an alternative.

There are indeed a lot of reasons not to use HDF5: if you have a lot of small records, are primarily storing strings, or you don't share your data with many colleagues, there are a lot of better alternatives. But if you just want to dump a few hundred GB into one massive file and have decent IO in a format people can figure out, it works pretty good.

wnkrshm · on May 30, 2023

You cannot load or write (may only have been one of those operations, strangely) a file containing non-ASCII characters with the C/C++ libraries if you target Win.

If you target Linux, those work out of the box due to the file system doing the lifting.

With windows we had to use the short file name (e.g. FILEN~1.EXT) as a workaround.

Also, you have to watch out for what the library does when it writes to (Edit: Windows) network shares - we've seen it write NaNs where there were values in the data it should save, maybe a latency issue, maybe a configuration issue - but not something you get told about.

I would like to move away from it.

planede · on May 30, 2023

> You cannot load or write (may only have been one of those operations, strangely) a file containing non-ASCII characters with the C/C++ libraries if you target Win.

This is definitely not true, at least for C++, but I agree that it can be hassle.

dguest · on May 30, 2023

I was pretty happy with it. But then, I was moving from something much worse that very few people use.

teleforce · on May 30, 2023

Full text download:

https://www.researchgate.net/publication/367373175_HDF5eis_A...

neilz · on June 6, 2023

https://www.researchgate.net/publication/43170035_STUDI_WILL...

teleforce · on May 30, 2023

This data format can be used for Big Data Seismology:

https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2021...

xarope · on May 30, 2023

At the risk of shoehorning SQLite into this as a solution, would it? Does SQLite handle this type of data (big multidimensional time series sensor data) well?

osigurdson · on May 30, 2023

Relational DBs aren't great for timeseries data which is why solutions like TimescaleDB, InfluxDB and other time series databases exist. Of course, it can work perfectly fine if the you don't have too much data.

The fundamental problem with time series data is often that the insert pattern is the exact opposite of typical retrieval patterns. For example, you may insert 1000s of properties at once (most efficiently stored as interleaved data), yet typical access patterns involve obtaining a single property over many timestamps. Of course, it can work fine despite the inefficiency. If you have a few 100 sensors and are sampling them every minute, storing data for a month, it would likely be fine. If you plan on storing audio samples as records in a DB (i.e. one record per sample), it will fail.

As another commenter mentions, HDF5 doesn't really help much of this. It is an unnecessary rabbit hole to some extent. It mostly just a glitchy b-tree implementation with no WAL log + some compression algorithms.

zX41ZdbW · on May 30, 2023

It's not correct to say that relational DBMS are not suitable for time series.

ClickHouse is a relational DBMS, and it works for time series better than specialized systems (like TimescaleDB or InfluxDB). You can quickly get trillions of records for time series, but it is not a problem. For example, see https://www.youtube.com/watch?v=JlcI2Vfz_uk

If you try TimescaleDB in this scenario, it will barely be crawling.

PS. I am developer of ClickHouse, and I use it for these scenarios every time.

tpetry · on May 30, 2023

I guess it was just wrong names of the parent. He shared that OLTP databases are not good for the use-case but OLAP databases designed for analytical purposes like ClickHouse or Timescale are perfect for this. However, Timescale is not a real OLAP database - they are something in-between.

osigurdson · on May 30, 2023

When I say "Relational DB's aren't great for time series", I mean Postgres, MySQL, Oracle and SQL server just using traditional tables.

osigurdson · on May 30, 2023

I'm gonna try ClickHouse since you posted.

jofer · on May 30, 2023

No. The issue is that you quickly have trillions of rows. You don't need a relational db setup or fancy non-relationary db solutions. Both are well known anti-patterns. One row per sample falls over quickly for regularly sampled data.

What does work for large regularly sampled datasets is array storage solutions like hdf, or image formats, etc etc.

It's the same reason and same approach for your vacation photos. You wouldn't put each pixel in the db, but you might very well have a db with a row for each photo.

Use a db for the non-regular portions (e.g. an index of datasets) and array storage for the actual data itself.

IshKebab · on May 30, 2023

I've used it for this. It worked way better than it had any right to. Obviously you don't use one sample per row like `jofer` assumed, but you can easily build something on top of SQLite to store compressed blocks of matrices as rows. That's how HDF5 works too (it splits matrices into blocks and compresses them independently).

It's obviously a bit more work than relying on HDF5 to do it for you, but not that bad since you don't need to implement a full general solution.

foobarbecue · on May 30, 2023

So the 5 is an S, and it's seismic? Nifty.

cozzyd · on May 30, 2023

wonder how it compares with ROOT or Parquet...

fbdab103 · on May 30, 2023

I think only ROOT would be comparable. I, perhaps naively, think of Parquet as just a binary csv/table format.

A modern alternative would be Zarr[0] which was frustrated with the complexity of HDF5 and wanted something that would be more amenable to working with S3 storage. Never worked with it, but the ideas are laudable. Then again, I was never all that frustrated with HDF5. I wish the exploratory tooling was better, but that is about the end of my complaints.

[0] https://zarr.readthedocs.io/en/stable/index.html

prpl · on May 30, 2023

ROOT is most comparable. Iceberg+Parquet/ORC/Avro is comparable to (and with better benefits) to some of the multi-file catalog-like features of HDF5.

There’s some cases where data really needs to be structured, especially closer to instruments producing data. The thing is - some or all of that data is going to end up in a database or a dataframe sooner or later

dendrite9 · on May 30, 2023

I am involved in a project producing say 100 channels of 32 bit measurements that has to be roughly aligned with a series of camera images, or a video stream. My work is unconnected to the data storage question but I've been thinking about it as an interesting question I don't know enough about. I've thought about trying to implement something on a small scale as learning experience in my spare time, probably with limits on the total storage/duration. HDF5 looked like an interesting option but I've heard about issues. Do you have any suggestions for tools to look at or reading that might help push me toward doing (on not) such a project?

dguest · on May 30, 2023

It depends on how much data you have and what the rest of the stack looks like. If you have O(GB) of data that needs to be accessible in one python session you could just try out h5py:

https://pypi.org/project/h5py/

If you're using another language there might be a more appropriate high level library.

dendrite9 · on May 30, 2023

Yeah I guess I'd limit myself to 1-10GB just for sanity and probably use python. This isn't for customers or anything, just something that both caught my attention as a project and something I need to learn a fair amount to work on. I don't know enough about managing the data and need to figure out where to start. All the measurements need context from times before and after to be useful and I feel like I'm missing or misunderstanding some basic ideas.

Thanks

_Wintermute · on May 30, 2023

Zarr is gaining quite a bit of traction in bioimaging, it makes a lot of sense compared to proprietary file formats or millions of .tiff files which are both common scenarios.

fbdab103 · on May 31, 2023

Maybe I will revisit. I defer to being conservative when it comes to adopting data storage formats. I know I will be able to read HDF5 10 years from now. Zarr does not give me that same safe feeling yet.

dguest · on May 30, 2023

In my experience HDF5 is like a simplified version of ROOT: doesn't provide a full data analysis suite, graphics library, or embedded compiler. It won't assume that everything in your data maps to a C++ type, or that every C++ type should be serializable as data. It just does data, for better or worse.

It also has a spec, which can be useful.

cozzyd · on May 30, 2023

I'm familiar with HDF5, but I wonder how the performance of this solution built on top of HDF5 compares to ROOT (usually HDF5 does not compare favorably, but that's probably mostly due to not storing data in a chunked columnar way, which should provide similar benefits to ROOT).

dguest · on May 30, 2023

The comparisons I've seen have generally been comparing ROOT on a problem it was optimized for to HDF5 (or whatever) out of the box on the same problem. There are a half dozen knobs to turn in any of these libraries, if you don't turn them I'd be skeptical of any general conclusions.