The fact that most data analysis/ETL tutorials on the internet have converged on the same CSV/pandas tactics over the past decade is disappointing when newer tools demonstrated here such as DuckDB/Arrow have practical advantages without much code complexity overhead.
This post also links to another discussion about the Parquet data format (https://pythonspeed.com/articles/pandas-read-csv-fast/), also supported by Arrow, which is also extremely useful but I never see anyone talking about it. Granted, Parquet data can't natively be imported into Excel which is likely the main cause.
I would say most of the tooling around Parquet sucks. A lot of simple use cases (head, sample, merge, etc) are locked behind rather heavyweight ecosystems (Hadoop, Spark, or other JVM dependencies). Newer tools are starting to develop that make the situation better.
Is Parquet better/faster/stronger than keeping everything in schemaless CSVs? 100%, but it has historically meant that I have to make trade-offs to benefit the tooling rather than how I want to interactively approach my data.
Most people do not have "big data" problems where the performance differences of Parquet vs csv matter.
Shameless plug, my project exposes the common Parquet operations using a Rust CLI tool that utilizes the Rust API for Apache Arrow and can be used without any Java/Hadoop/Spark dependencies. Also available a static binary.
That looks great! Exactly the kind of tooling required to allow me to feel more comfortable locking up human readable data into a blob. Being a single Rust binary is just the cherry on top.
I might submit a few feature requests, but one that immediately comes to mind: csv -> parquet. Perhaps out of scope for the original vision, but having a single utility that could roundtrip data would be fantastically useful.
I hope that DuckDB can help here! You can use the CLI or Python client to read a parquet file, but my favorite for quick exploration is popping open DBeaver (a SQL IDE) and using the DuckDB JDBC driver. That way you can see some column metadata, and even open multiple files, etc. I wrote a short guide for that here:
https://duckdb.org/docs/guides/sql_editors/dbeaver
These aren’t really new though. These are simply bringing old-school reliable SQL directly into Python. Hopefully Python “data” people will stop using godforsaken tools like pandas or spark and finally embrace what the rest of us have been using for decades.
Are you kidding?! SQL is an ancient mess with different dialects and idiosyncrasies. What bringing a query engine into Python gives is a functional query DSL. DuckDB’s non-SQL interface, where you build up a query plan from nodes in Python is the future: you can give names and reuse subqueries, you can programmatically generate query trees, you get all of the power of a real programming language for analytics, and it all gets boiled down to fast implementations of relational operators.
As someone who is fluent in Spark, pandas, and several SQL "dialects", at the end of the day I have to say I think all the query paradigms suck, SQL is just the best we have.
I find it's "limitations" generally enforce structure that makes rewriting and distilling bad queries easier. Well written and formatted queries can be extremely readable and easy to navigate/grep in a way that I have never found the others to be.
The flexibility of dropping in and out of a full fledged programming language and a query api, I find it tends to grow unnecessarily complex in the hands of many practitioners.
Although my preference is a SQL cursor or equivalent interface in my chosen language. For some reason I find the strict separation between SQL (the declarative expression of business logic and relations) and the language (imperative or functional control) very helpful.
SQL also has the benefits of portability. Almost every query computation engine supports SQL these days. While, obviously you will need to rewrite the parts that rely on unique extensions, the migration path is greatly simplified.
The one thing I think pandas has going for it that I desperately wish was picked up as a new standard in SQL are aggregations for seemlessy moving between different time series frequencies. Pandas as problematic as it is, I have yet to find anything else that makes time frequency conversions as convenient and predictable.
> DuckDB’s non-SQL interface, where you build up a query plan from nodes in Python is the future
DuckDB is a convenient drop-in for OLAP in the way that SQLite is for OLTP, it's a great library. Python's death on the vine as a data analytics platform cannot come fast enough. There's a reason people are abandoning pandas and spark/databricks in droves and fleeing to useful SQL-like tooling such as Snowflake. Once this tech bubble bursts, and the tens-of-thousands of subsidized Python-packing data scientists/engineers/analysts get laid off, the language will return to what it should be: another scripting language that generates SQL for talking to a real data engine.
This post also links to another discussion about the Parquet data format (https://pythonspeed.com/articles/pandas-read-csv-fast/), also supported by Arrow, which is also extremely useful but I never see anyone talking about it. Granted, Parquet data can't natively be imported into Excel which is likely the main cause.