The fact that most data analysis/ETL tutorials on the internet have converged on the same CSV/pandas tactics over the past decade is disappointing when newer tools demonstrated here such as DuckDB/Arrow have practical advantages without much code complexity overhead.
This post also links to another discussion about the Parquet data format (https://pythonspeed.com/articles/pandas-read-csv-fast/), also supported by Arrow, which is also extremely useful but I never see anyone talking about it. Granted, Parquet data can't natively be imported into Excel which is likely the main cause.
I would say most of the tooling around Parquet sucks. A lot of simple use cases (head, sample, merge, etc) are locked behind rather heavyweight ecosystems (Hadoop, Spark, or other JVM dependencies). Newer tools are starting to develop that make the situation better.
Is Parquet better/faster/stronger than keeping everything in schemaless CSVs? 100%, but it has historically meant that I have to make trade-offs to benefit the tooling rather than how I want to interactively approach my data.
Most people do not have "big data" problems where the performance differences of Parquet vs csv matter.
Shameless plug, my project exposes the common Parquet operations using a Rust CLI tool that utilizes the Rust API for Apache Arrow and can be used without any Java/Hadoop/Spark dependencies. Also available a static binary.
That looks great! Exactly the kind of tooling required to allow me to feel more comfortable locking up human readable data into a blob. Being a single Rust binary is just the cherry on top.
I might submit a few feature requests, but one that immediately comes to mind: csv -> parquet. Perhaps out of scope for the original vision, but having a single utility that could roundtrip data would be fantastically useful.
I hope that DuckDB can help here! You can use the CLI or Python client to read a parquet file, but my favorite for quick exploration is popping open DBeaver (a SQL IDE) and using the DuckDB JDBC driver. That way you can see some column metadata, and even open multiple files, etc. I wrote a short guide for that here:
https://duckdb.org/docs/guides/sql_editors/dbeaver
These aren’t really new though. These are simply bringing old-school reliable SQL directly into Python. Hopefully Python “data” people will stop using godforsaken tools like pandas or spark and finally embrace what the rest of us have been using for decades.
Are you kidding?! SQL is an ancient mess with different dialects and idiosyncrasies. What bringing a query engine into Python gives is a functional query DSL. DuckDB’s non-SQL interface, where you build up a query plan from nodes in Python is the future: you can give names and reuse subqueries, you can programmatically generate query trees, you get all of the power of a real programming language for analytics, and it all gets boiled down to fast implementations of relational operators.
As someone who is fluent in Spark, pandas, and several SQL "dialects", at the end of the day I have to say I think all the query paradigms suck, SQL is just the best we have.
I find it's "limitations" generally enforce structure that makes rewriting and distilling bad queries easier. Well written and formatted queries can be extremely readable and easy to navigate/grep in a way that I have never found the others to be.
The flexibility of dropping in and out of a full fledged programming language and a query api, I find it tends to grow unnecessarily complex in the hands of many practitioners.
Although my preference is a SQL cursor or equivalent interface in my chosen language. For some reason I find the strict separation between SQL (the declarative expression of business logic and relations) and the language (imperative or functional control) very helpful.
SQL also has the benefits of portability. Almost every query computation engine supports SQL these days. While, obviously you will need to rewrite the parts that rely on unique extensions, the migration path is greatly simplified.
The one thing I think pandas has going for it that I desperately wish was picked up as a new standard in SQL are aggregations for seemlessy moving between different time series frequencies. Pandas as problematic as it is, I have yet to find anything else that makes time frequency conversions as convenient and predictable.
> DuckDB’s non-SQL interface, where you build up a query plan from nodes in Python is the future
DuckDB is a convenient drop-in for OLAP in the way that SQLite is for OLTP, it's a great library. Python's death on the vine as a data analytics platform cannot come fast enough. There's a reason people are abandoning pandas and spark/databricks in droves and fleeing to useful SQL-like tooling such as Snowflake. Once this tech bubble bursts, and the tens-of-thousands of subsidized Python-packing data scientists/engineers/analysts get laid off, the language will return to what it should be: another scripting language that generates SQL for talking to a real data engine.
First of all, thanks for sharing this OP! So glad to see a way to query a df using SQL without further transformation.
Arrow has been truly revolutionary in this regard, providing a solid in-memory data format (with performant APIs in many languages) for interchange between different engines and even formats.
You can go from ORC to Parset to CSV on a local FS or S3.
With DuckDB, it’s like you can build your own AWS Athena at likely a fraction of the cost. Now if only someone would integrate vaex with DuckDB, it will make your powerful Apple Silicon machines a compelling alternative to running a full fledged Spark/Hadoop cluster.
> With DuckDB, it’s like you can build your own AWS Athena at likely a fraction of the cost
Isn't the whole purpose of Athena to scale to large amounts of data that don't fit into memory? How does duckdb fit in here? I thought it's an in-memory database?
DuckDB is not only an in memory DB - it can persist data and handle larger than RAM workloads. We are expanding the number of operators that can handle larger than memory data, but it is a key design goal!
Yep, dask_sql is probably a better comparison here (and similarly arrow-friendly). We've been having pretty good experiences here, though clearly young. For in-process, we're sticking with pandas/cudf (GPU dataframes), and likewise still Arrow for IO interop, so easy hand-off.
I find the posts are usually better quality than the standard HN submission, but even more the comments on HN are better. A complete generalisation, but those replying to data tools posts are more thoughtful in their comments and exhibit less absolutism than I see on software development related posts.
Yes but then I have to,
1. use python,
2. deal with the nightmare that is python packaging
3. hope I don't run out of memory while Python does its thing.
Getting non-technical people to do all that (a project manager wanting to see some simple stats for example) becomes difficult. Having a single binary that's easier to distribute, faster to execute and can be plugged into an existing query tool (like DataGrip) is a huge win for actual usability.
Can someone explain what the purpose of duckdb is? From what gather from the docs, it's a column-oriented in-memory SQL database? So I would use it over something like in-memory sqlite simply because it's faster due to vectorization? Or are there other aspects that I'm missing?
Like SQLite, it's not necessarily in-memory, it has an on-disk format as well. I haven't given it a massive workout for bigger-than-memory workloads but in theory they are possible.
With that said, the resident memory size of stuff-in-parquet/arrow/duckdb is a lot less in practice than stuff-in-pandas (i know) and stuff-in-sqlite (i believe), so it still enables more in-memory workloads than you might otherwise be able to do.
I really like that you can use duckdb to sql query an on disk directory structure of parquet files with no preloading into RAM or other db formats. Super useful/quick and only one line of code! Instead of selecting from a table you just pass a glob pattern into the SQL - and since it’s a oneliner I can use it in adhoc notebooks too. It even has a to_df() method on the query result so you can get it into pandas for further manipulation
As you noted, it's column-oriented, so designed for OLAP workloads rather than OLTP workloads. Think data warehousing, BI, Big Data analytics, feature engineering ...
But isn't it in-memory? In my experience, data warehousing, BI, analytics, etc are usually defined by large data in data lake or warehouse that doesn't fit into memory. How would duckdb help here?
DuckDB can certainly handle larger than memory data! It's a streaming engine, so it doesn't need everything in memory all at once. Some operators (like sorting) can also buffer out to disk if you run low on memory mid-computation. We are working to expand that to more operators! It also can selectively read from parquet for only the columns and row groups that you need. (Disclaimer - I do documentation for DuckDB!)
When pandas can’t manage a data set, I would convert to hdf5 and use the vaex pandas-like API. This is very fast and memory efficient (out of core) for my typical 40-70GB data set size and reportedly remains fast up to 1 TB.
The key ingredient is the on-disk format: hdf5 and more recently also arrow.
Looks like DuckDb would force me to use SQL, which is not what I typically want. That being said, if this offered better data management than a binary blob like hdf5, it might be worthwhile. If the use case requires processing historical data for example, a real DB would be better than juggling 10 x 40GB files or having one big 400GB file that keeps growing.
Plus, if you are working in Python, you can use DuckDB as the engine underneath Ibis, Fugue, Siuba, or anything that works with SQLAlchemy (using the DuckDB-engine driver)! In R, you can use dplyr or dbplyr.
DuckDB's file format is one way to persist data (it uses a single file), but you can also write out to Parquet, or write out to Apache Arrow and then parquet (in a partitioned format I believe).
I've been making these comparisons between the new dataframe libraries in python so a couple of comments:
"Reading the full CSV without datetime parsing is in line in terms of speed though."
This sentence was a bit ambiguous, but is important: if you read this file in pandas with engine='pyarrow' but don't convert the date/time column to a pandas datetime dtype you get the same ~100 ms read time as calling PyArrow directly. So basically the entire time is spent converting the strings to dates. If this datetime thing isn't an issue for you then you can just use the engine='pyarrow' argument to read CSVs with Pandas.
In my own tests with various datesets Polars has always been much faster than duckdb/pyarrow. For this relatively small dataset it's about 2x faster, which is about the smallest margin I've found. Polars is also much easier to write, as its query optimization is so effective - you don't need to know all the tricks that Pandas requires (and is still 3x-10x faster than Pandas even when you apply all the tricks).
I've been thinking about this lately at work, where we do a bunch of stuff with a pile of python scripts built around csv files and pandas.
This post prompted me to give pyarrow a spin and WOW did it load the CSVs fast compared to what I'm used to. That said, I'm not entirely sure what to do with them from there. When I tried table["col"] it gave back a list of lists (maybe because of the multi-threaded loading?). And while the article talked about "querying" the arrow files with duckdb, how easy is it to work with them otherwise, in terms of updating them? And what about non-SQL style updates? Sometimes you just have to do an itertuples or apply and do random non-declarative modifications to the data.
This is mildly off topic, but I am very unhappy with Pandas. Every single API feels bolted on without any consideration of composability or ergonomics. After spending 4 years with a much better proprietary library I cannot deal with arbitrary functions I have to learn like “value_counts” or whatever the output of a “groupby” is.
My go-to these days is Polars. You get good performance since it uses arrow in the back. Coupled that with built-in lazy evaluation and it's API design it's pretty good for me. There are some caveats you need to be aware though. It doesn't always work as a drop in replacement for pandas
I tried Polars, but feel like it's not ready for production yet.
It's fast! The API is nice. But the documentation is not great. While using pandas I can just open the docs and find all functions and examples of how to use, etc. Polars felt really bare.
It's something I'd like to use in personal projects. But other data scientists would not accept using polars as it is today.
Author of polars here. What would you say the docs is lacking? Quite curious how we should fill the gap. I understand the gap is large due to 12 years of being the standard for python. There is ton of materials for pandas.
We hope to have such a clear API that once the expression language clicks, it is less needed to have the docs open constantly as it should feel natural.
Oh hey! I should know better than to take a potshot at software in hackernews...
I used it a month or two ago, so I don't remember specifics, but.
The docs weren't unclear, they were bare compared to pandas tough.
Once I found an example of how to do something I managed to do it. But a couple of times I opened a page and there was WIP message.
There's definitely a bit of unfairness in the comparison, as pandas as you say has been the standard for python for years. But that means I already know all the syntax and how the API works, so even just a method name or required params will get me a long way.
When I was fumbling around with polars I would have liked more examples of how to use each method/function. I mean, once the language clicks I wouldn't need it anymore. But at the start I really need that hand-holding. I'd have to reread other pages to remember simple syntax.
Hi @ritchie46 - I have just written [raku Dan](https://github.com/p6steve/raku-Dan) as a way to scratch the "data analytics" itch in a new way -- my next step is to write Dan::Polars as a polars binding via (eg.) raku NativeCall. Can you point me to a good recipe for success? ~p6steve@furnival.net
Pandas is an absolute horrorshow: poor performance, inconsistent API, terrible implicit behavior leading to footguns.
And everyone uses it because it's what you do when your boss tells you "we're transforming the analytics team, you're all to become data scientists because everyone has data scientists now". You just grab whatever had the biggest mindshare on SO and in random yt tutorials. Can't blame them.
But hooo boy does pd get on my nerves.
Care to share what propriety stuff you were using?
I can't really share in any detail, I think, but the best part was that "Series" were immutable and had sorted keys (indexes). Essentially they were (math) functions, so "indexes" had unique elements. All the important bits had fast numpy/Cython implementations, but the semantics were good because of unique keys.
Honestly I still feel like I'm missing some sort of larger story about the semantics of Pandas (like the "functions" explanation above), so if anyone knows of anything that made Pandas click, please let me know.
I had time to read through this, thank you for linking it to me. I like method chaining, and appreciate the terseness of the syntax. I don't think it addresses my main complaint (and other's complaints, below) about Pandas. I'll write a lengthy comment here in the hopes that a Pandas/Polars maintainer sees it and gets some inspiration.
One big issue I see with Pandas is that it assumes that I want multidimensional keys represented via axes labels on some sort of grid (or hypercube). This is evident from the __getitem__ interface, which is frankly confusing. In reality, selection is the most important detail, and managing columns is just a nice-to-have. Pandas's MVP is the Series object, not DataFrame.
In article #1 in your link, "SettingWithCopy" is emblematic of the issue that happens when you allow mutability, so mutability delenda est. This is why FridgeSeal is confused below. One day I'll write a blog post about ways Pandas could be better, but in the meantime I will probably monkey patch a lot of the nonsense out.
And the view-vs-copy shenanigans - a bunch of the API's will print warnings about using the wrong one, and direct you to some docs, but don't actually tell you whether you've just viewed or copied, and the docs don't make it particularly clear which API does which either.
This post also links to another discussion about the Parquet data format (https://pythonspeed.com/articles/pandas-read-csv-fast/), also supported by Arrow, which is also extremely useful but I never see anyone talking about it. Granted, Parquet data can't natively be imported into Excel which is likely the main cause.