It's not even two years old and doesn't have a 1.0 release. It wouldn't be smart...

wesm · on Sept 26, 2017

Yikes! I disagree with you -- these are assertions made on no evidence. This software is suitable for production systems as long as you are OK with occasional API changes / deprecations as the software evolves. We are not far off from a 1.0 release, but release numbers are a synthetic construct anyway.

olympus · on Sept 26, 2017

I'm not putting the project down, but your comment "as long as you are OK with occasional API changes / deprecations" is exactly my point. Facebook doesn't want a system borking because an Arrow dev decided that some API feature was redundant or needed a new name. Presumably they have a test environment to check for that sort of thing. They certainly don't want to refactor a large code base because of some tiny API change.

If someone is using Pandas, Spark, or whatever for an important product, it's probably best for them to maintain whatever underlying data layer until the Arrow devs (I guess that means you) are willing to commit to a somewhat stable API. A stable API and a relatively bug-free experience is what typically marks a 1.0 release.

There are plenty of smaller projects that should be perfectly happy to use the 0.7 release and grow/evolve as Arrow does. Especially when using Pandas+Arrow, since it's probably not a production environment and I can spare a few hours to fix a confusing bug.

wesm · on Sept 26, 2017

I disagree with your premise that production systems require API stability in all thirdparty dependencies.

dotancohen · on Sept 26, 2017

Stability does not imply immutability, but rather established.

In new projects it is common for a method or class to change direction or role as it develops, which may bring with it a refactor of identifier names, parameters, and such. After the class has been used for some time in different contexts these changes happen less and less, hence we call the class stable. A 1.0 release implies this stability.

Third party dependencies are still free to mutate their APIs, but when maintaining a production system you don't want to be mole-whacking API changes every point release.

dogruck · on Sept 26, 2017

I also disagree. I would agree with "won't use because the prod release has a lot of bugs." But APIs change.

lmeyerov · on Sept 26, 2017

It's useful to separate the Arrow file format for standardizing passing columnar data between systems, and the native c library for not reinventing the wheel when doing so. Likewise, who this is to be used by: framework devs, and rather indirectly & niche by data analysts/engineers.

For example, Apache Drill and the follow-on startup Dremio is in use in various places, and I believe they use the Arrow runtime. In contrast, when we worked on Graphistry/NodeJS <> MapD bindings, we did zero-copy data interop by agreeing on just the file format.

For people used to building frameworks like the above, Arrow is at a fine point. We hoped to use the runtime, but it wasn't necessary so far. More importantly, as framework builders, it got us past the typical decision of roll-your-own format via ~flatbuffers vs. dealing with orc/parquet. As a user, you'd be leveraging Pandas, Spark, etc., and only calling Arrow when occasionally talking between systems using your framework's internally supported interop layer. Data engineers will eventually be more exposed to this as they focus more on say streaming arrows vs non-streaming parquets, but that seems early for now.