Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We are really looking for serialization libraries that will work with pandas and scikit.

This stuff is really all over the place - PMML, Arrow, Dill, pickle.

Some stuff won't work with one or the other. I will actually pay for consistency versus performance.

There are way too many primitive serialization libraries. Surprisingly none for the higher order ML, etc stuff.

Give the kind of people behind Arrow, I would love wrapper that will use Arrow to do all of this...But doesn't matter at the end of the day.



So stuff like this or marshmallow is more for cases when you have some database / ORM objects and you want to serialize them out to a json object, or you want to process form/POST data into a well-structured json or database object.

For your use case, it's more about large amounts of tabular data and efficient (binary / columnar / compressed) serialization and queryability. I'd say that the defacto standard for that is the HDF5 standard, which PyTables supports (http://www.pytables.org/). This is what pandas uses under the hood and I've been using this with hundreds of millions of rows with no problem.

Arrow is slightly more different - it's a specification for the in-memory layout of data that enables faster computation. This is more about what happens if you have data in memory and you want to use it with another tool - serializing / deserializing, munging formats is a waste of time if tools can standardize how they store dataframes in memory and can work on each other's tables. As far as I understand, Feather is not an implementation of arrow (that would be up to the processing tools like pandas), but supports a way of saving and loading that in-memory format to and from disk efficiently and in an interoperable way. (https://github.com/wesm/feather)

Also of note is parquet, which has similar goals to HDF and feather, but the continuum / dask people have been working on a wrapper for that called fastparquet (https://github.com/dask/fastparquet). In my experience it has a few hitches right now but works darn well, and gives me better performance than HDF. This is also one of the hadoop ecosystem defacto standards for storage formats, which again is good for interop.


Do you know of a source that compares these different libraries in terms of capabilities, focus/use cases, size limits, performance, format support, etc.?

Googling turned up very little for me.

TIA

Edit: libraries mentioned in thread:

PMML, Arrow, Dill, marshmallow, pytables, parquet/fastparquet (and pickle, obviously)


No, I don't, but some of these are apples and oranges, that was part of my point. You're conflating many different types of things.

Specifically, the ones I talked about are for storing large tabular datasets on disk. Stuff that lays out data on disk so that it's easy and efficient to query only a part of the dataset, e.g. only certain columns or only certain rows that match a predicate or within a range of indexes. These can store hundreds of gb, no problem. They often have some sort of compression, like LZ, snappy or blosc that has relatively low CPU overhead while giving decent compression. I tried to separate the file formats (which are readable from other languages) from the python libraries that write them. For this, I'd default to pytables / HDF5, barring some specific use case where you'd already know what other one you need.

Dill / pickle are for serializing generic python objects. I wouldn't really use them to store anything big, but it's very convenient for complicated data structures, like hierarchies of objects and classes. E.g. to save the current running state of your program. You don't have to think about storage formats and layouts and serialization routines, if you have a list of python objects you can pickle it. Pickle is built in, while dill is an external library that nicely handles a bunch more edge cases.

PMML seems like an XML based format specifically for trained machine learning models. Don't really know much about this.


McKinney has been hard at work getting parquet and arrow support in pandas.

http://wesmckinney.com/blog/outlook-for-2017/

>Give the kind of people behind Arrow, I would love wrapper that will use Arrow to do all of this...But doesn't matter at the end of the day.

pyarrow; pyarrow.parquet (which uses parquet-cpp).


Wow this is great. I've been working around the jvm to integrate sklearn and some spark jobs that produce Parquet. This is a huge relief


Arrow doesn't do scikit - atleast last time I checked . Has it changed ?


pyarrow has methods to convert to pandas, which scikit supports

http://pyarrow.readthedocs.io/en/latest/pandas.html


No - this is not it. Scikit models need to be persisted. The only ways I have found is pickle or dill.

Take a look at this to understand what I mean . http://stackoverflow.com/questions/32757656/what-are-the-pit...


Python's data infrastructure has a huge problem: serialization and thus saving data results.

A good serialization library should serialize:

  - classes/objects (best practice: objects for holding data)
  - pandas/numpy objects (must have: minimizing space)
  - namedtuples (currently: a mess, factory implementation)
  - dicts and lists of dicts (must have: space efficiency)
Compare to Matlab: save(f, 'anyobject'); anyobject=load(f)

Python is terrible at this and it limits use in real data analysis environments and limits competition with matlab.


> Compare to Matlab: save(f, 'anyobject'); anyobject=load(f)

If you want matlab files in Python you can use `scipy.io.loadmat('file.mat')`. PyTables (built on hdf5) is a better solution since the hdf5 format is a lot more flexible than matlab's (ime). But Parquet is looking to be the best solution moving forward as it's gaining a lot of mindshare as the go-to flexible format for data and will be / is used in Arrow.

But really, Matlab is on par with pickles when it comes to serialisation. It's a trap solution.


Actually, since Matlab v7.3, .mat files are actually hdf5 files.


To expand on fnord, to my knowledge, pickle handles all of these things. Its still a bad solution, but it does everything you want.

    pickle.dump(f, anyobject)
    anyobject = pickle.load(f)


Pickle had size constraints that make it unsuitable in certain ML applications.


Indeed, but I expect that's also true for matlab's vanilla solution.


Does using protocol version 4 help with this?


Thanks for expanding on that mhneu. So our primary focus with Kim has certainly been around serializing/marshaling JSON though we've used it for plenty of other uses cases.

It's great to get a view of other problems people are experiencing.

Now we've finished wrapping up 1.0.0 we're going to be spending some time on the roadmap of new features. I personally feel variation in use cases from our own is only going to help make Kim better so we'll defo look into this problem some more in the near future. Right now though i couldn't say for sure what Kim would have to offer when working with Pandas etc as we've simply never tried.


defo?


definitely.


i believe the benchmark is set by R and it's RData format. It saves everything in the R domain. ML models, dataframes, everything.

Works pretty well - I know of large financial firms that are using this in production to load large trained models of size hundreds of GB


One of the things we felt very strongly about when developing Kim was that Simple things should be simple. Complex things should be possible. To that end the Pipeline system behind the Field objects really does allow anything to be achieved. Wether thats producing values from composite fields or handling unique or non standard data types.

it would be great if you can share some ways that you specifically need serialization to work for something like pandas, or better yet, some ways existing solutions don’t work with pandas. We’ve had some pretty unique requirements ourselves and have not found any blockers yet.

Thanks for the message.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: