I was a fan of dbt for a while, but the shine wore off when I saw one of my smartest coworkers try to use it. His development speed was about an order of magnitude slower than what I expect from data scientists using dataframe packages like R's dplyr, Python's polars, or Spark DataFrames. In my own experience, dbt is significantly better than writing raw SQL, but still nowhere near a normal software development experience. It needs an IDE badly, but the current IDE is cloud-only.
I am currently building my analytics pipelines on top of R's dbplyr package, which allows me to run dplyr operations on database tables just as if they were local data frames. Since it's R, it's not without warts, but at least I get to use a real programming language and the free and fantastic RStudio IDE.
> allows me to run dplyr operations on database tables
Which require that data take a round trip between the R process and the database. It's not uncommon for these jobs to spend more time in the read/write step than doing meaningful work, and why I prefer dbt when possible to keep transformations as close to the data as possible.
The second point the grandparent made was that they were using dbplyr, which allows you to avoid having the data take a round trip between the R process and the database.
The dbplyr package takes your R code and executes it in your database, returning a remote dataframe. This is quite useful because there is no mvoving data back and forth.
Hmm, may be I've had network issues last time I checked. Anyway, that's the only place you can get it — it's a completely free desktop app, but it's not open source.
The difference is that raw data can come in many shapes (e.g. impossibly nested jsons) and unexpected quality (changing field names...). I cannot easily work with this in SQL and definitely not write tests to cover the ever increasing complexity of dealing with messy data.
After that step the data is a lot more uniform, so then it's easy to use SQL.
Dbt isn't a substitute for pandas/R to a data scientist. They are complements. Dbt is for the data transformation pipeline that prepares the data so the DS can write simple queries using their favourite tools on curated data.
The core divide in this space is SQL vs. $OTHER_LANGUAGE. If you're fluent in SQL you'll be great at dbt. If you're only fluent in not-SQL you won't.
A lot of these discussions are basically "can we please just use the language/tooling we know", and sometimes the answer's yes! A lot of people breathe a big sigh of relief when they discover you can use like, Spark through Python and what-not. Regardless of whether or not it's a good idea or good use of resources, this space is advancing because the market demands it. But like, I don't think you're ever gonna use dbt with not-SQL; it's all in on SQL. Feels like maybe it was the wrong fit for your team.
I should say I'm not sure if your coworker was an SQL person so dunno if this directly applies. Just saying what my experience has been.
I am currently building my analytics pipelines on top of R's dbplyr package, which allows me to run dplyr operations on database tables just as if they were local data frames. Since it's R, it's not without warts, but at least I get to use a real programming language and the free and fantastic RStudio IDE.