Fast analysis with DuckDB and Pyarrow

minimaxir · on April 30, 2022

The fact that most data analysis/ETL tutorials on the internet have converged on the same CSV/pandas tactics over the past decade is disappointing when newer tools demonstrated here such as DuckDB/Arrow have practical advantages without much code complexity overhead.

This post also links to another discussion about the Parquet data format (https://pythonspeed.com/articles/pandas-read-csv-fast/), also supported by Arrow, which is also extremely useful but I never see anyone talking about it. Granted, Parquet data can't natively be imported into Excel which is likely the main cause.

41b696ef1113 · on May 1, 2022

I would say most of the tooling around Parquet sucks. A lot of simple use cases (head, sample, merge, etc) are locked behind rather heavyweight ecosystems (Hadoop, Spark, or other JVM dependencies). Newer tools are starting to develop that make the situation better.

Is Parquet better/faster/stronger than keeping everything in schemaless CSVs? 100%, but it has historically meant that I have to make trade-offs to benefit the tooling rather than how I want to interactively approach my data.

Most people do not have "big data" problems where the performance differences of Parquet vs csv matter.

vanpythonista · on May 1, 2022

Shameless plug, my project exposes the common Parquet operations using a Rust CLI tool that utilizes the Rust API for Apache Arrow and can be used without any Java/Hadoop/Spark dependencies. Also available a static binary.

https://github.com/manojkarthick/pqrs

41b696ef1113 · on May 1, 2022

That looks great! Exactly the kind of tooling required to allow me to feel more comfortable locking up human readable data into a blob. Being a single Rust binary is just the cherry on top.

I might submit a few feature requests, but one that immediately comes to mind: csv -> parquet. Perhaps out of scope for the original vision, but having a single utility that could roundtrip data would be fantastically useful.

vanpythonista · on May 1, 2022

Glad you like it. A CSV->Parquet conversion command would certainly be really useful!

1egg0myegg0 · on May 1, 2022

I hope that DuckDB can help here! You can use the CLI or Python client to read a parquet file, but my favorite for quick exploration is popping open DBeaver (a SQL IDE) and using the DuckDB JDBC driver. That way you can see some column metadata, and even open multiple files, etc. I wrote a short guide for that here: https://duckdb.org/docs/guides/sql_editors/dbeaver

teej · on April 30, 2022

These tools are very new compared to CSV+pandas. And most things you want to get data out of won’t give it to you in parquet.

The future is very promising, I am personally very excited about DuckDB. But it’s too soon to be griping about old tutorials.

datalopers · on April 30, 2022

These aren’t really new though. These are simply bringing old-school reliable SQL directly into Python. Hopefully Python “data” people will stop using godforsaken tools like pandas or spark and finally embrace what the rest of us have been using for decades.

sakras · on May 1, 2022

Are you kidding?! SQL is an ancient mess with different dialects and idiosyncrasies. What bringing a query engine into Python gives is a functional query DSL. DuckDB’s non-SQL interface, where you build up a query plan from nodes in Python is the future: you can give names and reuse subqueries, you can programmatically generate query trees, you get all of the power of a real programming language for analytics, and it all gets boiled down to fast implementations of relational operators.

datatrashfire · on May 4, 2022

As someone who is fluent in Spark, pandas, and several SQL "dialects", at the end of the day I have to say I think all the query paradigms suck, SQL is just the best we have.

I find it's "limitations" generally enforce structure that makes rewriting and distilling bad queries easier. Well written and formatted queries can be extremely readable and easy to navigate/grep in a way that I have never found the others to be.

The flexibility of dropping in and out of a full fledged programming language and a query api, I find it tends to grow unnecessarily complex in the hands of many practitioners.

Although my preference is a SQL cursor or equivalent interface in my chosen language. For some reason I find the strict separation between SQL (the declarative expression of business logic and relations) and the language (imperative or functional control) very helpful.

SQL also has the benefits of portability. Almost every query computation engine supports SQL these days. While, obviously you will need to rewrite the parts that rely on unique extensions, the migration path is greatly simplified.

The one thing I think pandas has going for it that I desperately wish was picked up as a new standard in SQL are aggregations for seemlessy moving between different time series frequencies. Pandas as problematic as it is, I have yet to find anything else that makes time frequency conversions as convenient and predictable.

datalopers · on May 1, 2022

> DuckDB’s non-SQL interface, where you build up a query plan from nodes in Python is the future

DuckDB is a convenient drop-in for OLAP in the way that SQLite is for OLTP, it's a great library. Python's death on the vine as a data analytics platform cannot come fast enough. There's a reason people are abandoning pandas and spark/databricks in droves and fleeing to useful SQL-like tooling such as Snowflake. Once this tech bubble bursts, and the tens-of-thousands of subsidized Python-packing data scientists/engineers/analysts get laid off, the language will return to what it should be: another scripting language that generates SQL for talking to a real data engine.

jagtesh · on April 30, 2022

First of all, thanks for sharing this OP! So glad to see a way to query a df using SQL without further transformation.

Arrow has been truly revolutionary in this regard, providing a solid in-memory data format (with performant APIs in many languages) for interchange between different engines and even formats.

You can go from ORC to Parset to CSV on a local FS or S3.

With DuckDB, it’s like you can build your own AWS Athena at likely a fraction of the cost. Now if only someone would integrate vaex with DuckDB, it will make your powerful Apple Silicon machines a compelling alternative to running a full fledged Spark/Hadoop cluster.

mywaifuismeta · on May 1, 2022

> With DuckDB, it’s like you can build your own AWS Athena at likely a fraction of the cost

Isn't the whole purpose of Athena to scale to large amounts of data that don't fit into memory? How does duckdb fit in here? I thought it's an in-memory database?

1egg0myegg0 · on May 1, 2022

DuckDB is not only an in memory DB - it can persist data and handle larger than RAM workloads. We are expanding the number of operators that can handle larger than memory data, but it is a key design goal!

lmeyerov · on May 1, 2022

Yep, dask_sql is probably a better comparison here (and similarly arrow-friendly). We've been having pretty good experiences here, though clearly young. For in-process, we're sticking with pandas/cudf (GPU dataframes), and likewise still Arrow for IO interop, so easy hand-off.

mritchie712 · on April 30, 2022

Is it me or do posts about data tools do better on HN then your average software post?

pbowyer · on May 1, 2022

I find the posts are usually better quality than the standard HN submission, but even more the comments on HN are better. A complete generalisation, but those replying to data tools posts are more thoughtful in their comments and exhibit less absolutism than I see on software development related posts.

johnmyleswhite · on May 1, 2022

I imagine one factor is that there are lots of OSS projects that are simultaneously startups in the data space these days.

minimaxir · on April 30, 2022

Higher signal-to-noise, at the least.

code_biologist · on May 1, 2022

Data engineer here. Lots of great posts on data stuff is one of the reasons I love HN.

brightbeige · on April 30, 2022

As of Pandas 1.4, you can use the pyarrow engine for reading a csv

    df = pd.read_csv("large.csv", engine="pyarrow")

https://pythonspeed.com/articles/pandas-read-csv-fast/

mkl · on April 30, 2022

They do that in the article too.

FridgeSeal · on May 1, 2022

Yes but then I have to, 1. use python, 2. deal with the nightmare that is python packaging 3. hope I don't run out of memory while Python does its thing.

Getting non-technical people to do all that (a project manager wanting to see some simple stats for example) becomes difficult. Having a single binary that's easier to distribute, faster to execute and can be plugged into an existing query tool (like DataGrip) is a huge win for actual usability.

mywaifuismeta · on May 1, 2022

Can someone explain what the purpose of duckdb is? From what gather from the docs, it's a column-oriented in-memory SQL database? So I would use it over something like in-memory sqlite simply because it's faster due to vectorization? Or are there other aspects that I'm missing?

akdor1154 · on May 1, 2022

Like SQLite, it's not necessarily in-memory, it has an on-disk format as well. I haven't given it a massive workout for bigger-than-memory workloads but in theory they are possible.

With that said, the resident memory size of stuff-in-parquet/arrow/duckdb is a lot less in practice than stuff-in-pandas (i know) and stuff-in-sqlite (i believe), so it still enables more in-memory workloads than you might otherwise be able to do.

wmwmwm · on May 1, 2022

I really like that you can use duckdb to sql query an on disk directory structure of parquet files with no preloading into RAM or other db formats. Super useful/quick and only one line of code! Instead of selecting from a table you just pass a glob pattern into the SQL - and since it’s a oneliner I can use it in adhoc notebooks too. It even has a to_df() method on the query result so you can get it into pandas for further manipulation

kthejoker2 · on May 1, 2022

As you noted, it's column-oriented, so designed for OLAP workloads rather than OLTP workloads. Think data warehousing, BI, Big Data analytics, feature engineering ...

mywaifuismeta · on May 1, 2022

But isn't it in-memory? In my experience, data warehousing, BI, analytics, etc are usually defined by large data in data lake or warehouse that doesn't fit into memory. How would duckdb help here?

1egg0myegg0 · on May 1, 2022

DuckDB can certainly handle larger than memory data! It's a streaming engine, so it doesn't need everything in memory all at once. Some operators (like sorting) can also buffer out to disk if you run low on memory mid-computation. We are working to expand that to more operators! It also can selectively read from parquet for only the columns and row groups that you need. (Disclaimer - I do documentation for DuckDB!)

FridgeSeal · on May 1, 2022

No it can query larger-than-memory datasets.

blub · on May 1, 2022

When pandas can’t manage a data set, I would convert to hdf5 and use the vaex pandas-like API. This is very fast and memory efficient (out of core) for my typical 40-70GB data set size and reportedly remains fast up to 1 TB.

The key ingredient is the on-disk format: hdf5 and more recently also arrow.

Looks like DuckDb would force me to use SQL, which is not what I typically want. That being said, if this offered better data management than a binary blob like hdf5, it might be worthwhile. If the use case requires processing historical data for example, a real DB would be better than juggling 10 x 40GB files or having one big 400GB file that keeps growing.

1egg0myegg0 · on May 1, 2022

We have a Relational API in addition to SQL! Here are some examples for the Python Relational API client:

https://github.com/duckdb/duckdb/blob/master/examples/python...

Plus, if you are working in Python, you can use DuckDB as the engine underneath Ibis, Fugue, Siuba, or anything that works with SQLAlchemy (using the DuckDB-engine driver)! In R, you can use dplyr or dbplyr.

DuckDB's file format is one way to persist data (it uses a single file), but you can also write out to Parquet, or write out to Apache Arrow and then parquet (in a partitioned format I believe).

Disclaimer - I write docs for DuckDB!

braaannigan · on May 1, 2022

I've been making these comparisons between the new dataframe libraries in python so a couple of comments:

"Reading the full CSV without datetime parsing is in line in terms of speed though." This sentence was a bit ambiguous, but is important: if you read this file in pandas with engine='pyarrow' but don't convert the date/time column to a pandas datetime dtype you get the same ~100 ms read time as calling PyArrow directly. So basically the entire time is spent converting the strings to dates. If this datetime thing isn't an issue for you then you can just use the engine='pyarrow' argument to read CSVs with Pandas.

In my own tests with various datesets Polars has always been much faster than duckdb/pyarrow. For this relatively small dataset it's about 2x faster, which is about the smallest margin I've found. Polars is also much easier to write, as its query optimization is so effective - you don't need to know all the tricks that Pandas requires (and is still 3x-10x faster than Pandas even when you apply all the tricks).

I've started making videos to address the need for more guidance in Polars - see this new on reading CSVs: https://www.youtube.com/watch?v=nGritAo-71o

tomrod · on April 30, 2022

I like pyarrow a lot, but this is my first time come across DuckDB. I'll check it out!

I'm curious how it loads so fast initially.

losvedir · on April 30, 2022

I've been thinking about this lately at work, where we do a bunch of stuff with a pile of python scripts built around csv files and pandas.

This post prompted me to give pyarrow a spin and WOW did it load the CSVs fast compared to what I'm used to. That said, I'm not entirely sure what to do with them from there. When I tried table["col"] it gave back a list of lists (maybe because of the multi-threaded loading?). And while the article talked about "querying" the arrow files with duckdb, how easy is it to work with them otherwise, in terms of updating them? And what about non-SQL style updates? Sometimes you just have to do an itertuples or apply and do random non-declarative modifications to the data.

adityapatadia · on May 1, 2022

Is it just me or others also think this is a lot of work for data analysis. For data of this size, BigQuery is not only fast but also very cheap.

kthejoker2 · on May 1, 2022

Arrow and Flight are the future of OLAP, cool to see all of my favorite data tools and platforms jumping on board.

singhrac · on April 30, 2022

This is mildly off topic, but I am very unhappy with Pandas. Every single API feels bolted on without any consideration of composability or ergonomics. After spending 4 years with a much better proprietary library I cannot deal with arbitrary functions I have to learn like “value_counts” or whatever the output of a “groupby” is.

rdedev · on April 30, 2022

My go-to these days is Polars. You get good performance since it uses arrow in the back. Coupled that with built-in lazy evaluation and it's API design it's pretty good for me. There are some caveats you need to be aware though. It doesn't always work as a drop in replacement for pandas

akdor1154 · on May 1, 2022

Yup, Polars is awesome for both its performance and its stingy but very well designed API.

silveraxe93 · on May 1, 2022

I tried Polars, but feel like it's not ready for production yet.

It's fast! The API is nice. But the documentation is not great. While using pandas I can just open the docs and find all functions and examples of how to use, etc. Polars felt really bare.

It's something I'd like to use in personal projects. But other data scientists would not accept using polars as it is today.

ritchie46 · on May 1, 2022

Author of polars here. What would you say the docs is lacking? Quite curious how we should fill the gap. I understand the gap is large due to 12 years of being the standard for python. There is ton of materials for pandas.

We hope to have such a clear API that once the expression language clicks, it is less needed to have the docs open constantly as it should feel natural.

silveraxe93 · on May 1, 2022

Oh hey! I should know better than to take a potshot at software in hackernews...

I used it a month or two ago, so I don't remember specifics, but. The docs weren't unclear, they were bare compared to pandas tough. Once I found an example of how to do something I managed to do it. But a couple of times I opened a page and there was WIP message.

There's definitely a bit of unfairness in the comparison, as pandas as you say has been the standard for python for years. But that means I already know all the syntax and how the API works, so even just a method name or required params will get me a long way.

When I was fumbling around with polars I would have liked more examples of how to use each method/function. I mean, once the language clicks I wouldn't need it anymore. But at the start I really need that hand-holding. I'd have to reread other pages to remember simple syntax.

p6steve · on May 2, 2022

Hi @ritchie46 - I have just written [raku Dan](https://github.com/p6steve/raku-Dan) as a way to scratch the "data analytics" itch in a new way -- my next step is to write Dan::Polars as a polars binding via (eg.) raku NativeCall. Can you point me to a good recipe for success? ~p6steve@furnival.net

silveraxe93 · on May 1, 2022

Thanks for creating it btw! I didn't mean to sound so negative.

I used on an analysis that was taking long and it went way quicker.

kermatt · on April 30, 2022

Are you referring to https://www.pola.rs ?

rdedev · on April 30, 2022

Yup. My bad. Looks like autocorrect screwed me

isoprophlex · on April 30, 2022

Pandas is an absolute horrorshow: poor performance, inconsistent API, terrible implicit behavior leading to footguns.

And everyone uses it because it's what you do when your boss tells you "we're transforming the analytics team, you're all to become data scientists because everyone has data scientists now". You just grab whatever had the biggest mindshare on SO and in random yt tutorials. Can't blame them.

But hooo boy does pd get on my nerves.

Care to share what propriety stuff you were using?

singhrac · on April 30, 2022

I can't really share in any detail, I think, but the best part was that "Series" were immutable and had sorted keys (indexes). Essentially they were (math) functions, so "indexes" had unique elements. All the important bits had fast numpy/Cython implementations, but the semantics were good because of unique keys.

Honestly I still feel like I'm missing some sort of larger story about the semantics of Pandas (like the "functions" explanation above), so if anyone knows of anything that made Pandas click, please let me know.

radus · on May 1, 2022

This addresses some of my misgivings with pandas: https://tomaugspurger.github.io/method-chaining.html

singhrac · on May 5, 2022

I had time to read through this, thank you for linking it to me. I like method chaining, and appreciate the terseness of the syntax. I don't think it addresses my main complaint (and other's complaints, below) about Pandas. I'll write a lengthy comment here in the hopes that a Pandas/Polars maintainer sees it and gets some inspiration.

One big issue I see with Pandas is that it assumes that I want multidimensional keys represented via axes labels on some sort of grid (or hypercube). This is evident from the __getitem__ interface, which is frankly confusing. In reality, selection is the most important detail, and managing columns is just a nice-to-have. Pandas's MVP is the Series object, not DataFrame.

In article #1 in your link, "SettingWithCopy" is emblematic of the issue that happens when you allow mutability, so mutability delenda est. This is why FridgeSeal is confused below. One day I'll write a blog post about ways Pandas could be better, but in the meantime I will probably monkey patch a lot of the nonsense out.

shankr · on April 30, 2022

yeah even after working for years with pandas, I never feel very confident writing it. I always have to look-up even simpler stuff.

FridgeSeal · on May 1, 2022

And the view-vs-copy shenanigans - a bunch of the API's will print warnings about using the wrong one, and direct you to some docs, but don't actually tell you whether you've just viewed or copied, and the docs don't make it particularly clear which API does which either.