Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Is the "modern data stack" still a useful idea? (getdbt.com)
129 points by tim_sw on Feb 11, 2024 | hide | past | favorite | 128 comments


I'm a newly minted head of analytics who transitioned from a different domain, so I never had to muck my way through the MDS but attentively watched others from the sidelines over the last few years. Best I can tell, "the modern data stack" is just a marketing phrase invented by a cadre of vampire vendors. The lessons I learned watching others translated into a few simple requirements for our nascent "stack" that most importantly include transparent pricing I can reason about and divvy up, as many integrations as possible so I can minimize rolling my own, and a straightforward framework for ETL code. These three requirements plainly disqualify most of the MDS universe.

With the benefit of starting basically from scratch and not having to mess around with real-time analytics, it's pretty easy to ignore the MDS vendors. So far I've landed on BigQuery, AirByte, GitHub, BI Engine, Looker Studio, and Pandas 2.x or DuckDB for local stuff. I send as many things as possible straight to BQ, lock junior analysts out of gigantic tables, archive periodically to partitioned parquet files in cold storage, use mostly turnkey integrations, and ruthlessly prioritize custom ETL jobs. Putting GitHub in the mix isn't super ergonomic and we may be in the market for new tools once we cross the "big data" frontier, but that'll be a while from now. I'll probably never know or care what the MDS vendors think I'm missing.


You're on the right track with pricing. What I find ubiquitous about MDS people is a lack of old-school ideas like benchmarking and price-performance. Cloud salespeople talk endlessly about your company becoming data-driven until you bring up cost data.


Fellow head of data here. For a while, my team ran on MySQL, a shared file server we mapped as a local drive, a team server in the back closet for tasks too large for our laptops, a MySQL database, an SVN server, and a Wikimedia server for documentation. We're upgrading a few of those items, but it's amazing how cheaply you can run a machine learning team if you only have a few TB of structured data to analyze.


AirByte was definitely one of the MDS vendors.

They literally got a huge investment because their deck bragged about having a few thousand Github stars and a few thousand folks in a Slack channel.


I can't tell if this is envy or scorn


I don't think it's either.

Modern Data Stack is a set of characteristics - cloud based, SaaS, consumption based pricing, ELT bias, open source bias, component based etc.

AirByte is all of them. It's just a fact rather than a criticism.


I think this is fair and I don’t doubt that AirByte jumped on the MDS marketing bandwagon, but the fact that their pricing isn’t inscrutable sets them apart in my view.


Neither, just an interesting factoid that highlights just how zany the VC landscape was in the MDS space just a few years ago.

https://airbyte.com/blog/the-deck-we-used-to-raise-our-150m-...


Github stars used to be a signal of quality, and velocity of those stars indicated a winning product. But like all rating systems, they are for sale to the nearest Mechanical Turk or GenAI.


Airbyte is definitely one of the MDS vendors. Plus they have a ton of bugs because their only focus is having the most connectors on the market. A lot of them are broken or badly implemented.


I don't doubt the bugs. In fact, I expect them because so many of the connectors are community contributed and so I try to think of it as more like GitHub than a curated set of expertly-built connectors. That said, for the time being it's more important that I not break my other rules. The promise of a perfectly working connector (if such a thing exists) in exchange for unknowable pricing or a SDK I'm gonna have a hard time training people on is a tradeoff I feel I can't reasonably make right now, but I'm very open to the idea that my calculus may change.


Sorry nothing positive to say here.

I’ve been using dbt and MDS for nearly 3.5 years and I believe the entire approach is profoundly broken. There’s really nothing “modern” about it, especially compared to software engineering.

https://on-systems.tech/blog/135-draining-the-data-swamp/

Building on Extracted operational data is hard at best and a business altering security liability at worse.

I believe The MDS trails at least a decade behind modern software engineering practices, lacking industry guidance and generic tooling to support: CI/CD, versioned deployment artifacts, zero downtime deployments, unit testing, observability, monitoring and alerting.

MDS Data engineering is a meme in the industry, the expectation of “100%” combined with the lack of modern tooling makes success really hard to achieve.


The mistake everyone makes is treating the entire space as if it's somehow different than the rest of software engineering, when it is precisely exactly the same.


Well… no I would disagree. I am a data engineer who relentlessly pushes for more classic SWE best practices in our engineering teams, but there are some fundamental differences you cannot ignore the data engineering that are not present in standard software engineering.

The most significant of these is that version control is not guaranteed to be the primary gate to changes to your system’s state. The size, shape, volume, skew of your data can change independently of any code. You can have the best unit/integration tests and airtight CI/CD but it doesn’t matter if your date source suddenly starts populating a column you were not expecting, or the kurtosis of your data changes and now your once performant jobs are timing out.

Compounding this, there is no one answer for how to handle data changes. Sure you can monitor your schema and grind everything to a halt the second something is different in your source, but then you’re likely running a high false positive rate, and causing backup or data loss issues for downstream consumers.

The truth is change management and scaling is more or less a predictable and solved problem in traditional SWE. But the same principles can’t be applied with the same effectiveness to DE. I have seen a number of SWE-turned-DE struggle precisely because they do not believe there to be any difference between the two. It would be nice if it were so simple.


You are massively oversimplifying "classic SWE" here. Ultimately it makes no difference.

How you handle contract violations between dependencies and third parties is not an inherently different problem. What should you do if a data source changes schema? How do you detect it, alert, take action? How are users or customers or consumer jobs impacted? This is all just normal software.

In many ways it feels like you're making my case for me.


Whenever I get confronted with SWE best practices it is all about version control, unit tests, integration tests and CI/CD.

What are some good resources for the other things you describe?


I don’t have great links handy, but read up on observability.

You’ll want to read up on what people do on error, especially in the case that contracts change (search “api versioning”.)


Thank you!

It may be that those topics are part of the SWE process but not generally applicable for example UI work and therefore overlooked.

But it may also be something else. That a process can create Grafana dashboards and processes for API updates like a pro.

But only the person that really cares about the data and know it from inside out can catch the "unknown unknowns". And for that you need to give access (and hand over responsibilities) to Analyst persons too and in some ways trust them too, even though they may not have as rigorous processes.


>The most significant of these is that version control is not guaranteed to be the primary gate to changes to your system’s state. The size, shape, volume, skew of your data can change independently of any code.

That has been true in SWE since the days of ENIAC.


Machine learning people have cottoned on to this, so MLOps was invented.


The training part of MLOps is important to ensure replicability and other desirable properties or the ML artifact, but the rest is clearly good CI/CD and observability.


We had an application engine come in and run an ML project.

The first thing he did was remove real data from training, as he was shocked to see that development wasn’t done against dummy data.

And once the dev environment is running on real data, the changes in how you develop and operationalize just seems to cascade in my experience.


I am slightly confused by your reply. Can you elaborate a bit, was it good or bad?


You can develop a database schema for your application on dummy data, and test your business logic on dummy data. But you can't develop a machine learning model on dummy data.

It is best practice to keep live data out of your development environment for normal software development, but it is impossible for machine learning projects.


Yep. I advise my clients to have a controlled, sandbox partition of prod. Call it the whiteboard or what have you, things can be erased or you can cache model version runs. Your data transforms can happen in dev but training should happen against real data. Synthetic data can be a compromise but that can sharply bias your model and, since it a harder lift, often multiplies data prep time


Yes, this is what I thought and exactly what I am experiencing right now.

It is somewhat frustrating to see the engineering/platform team spending time on this kind of abstraction that is essentially useless for doing the work we need to do.


This. Nearly everyone is confused.


I was a fan of dbt for a while, but the shine wore off when I saw one of my smartest coworkers try to use it. His development speed was about an order of magnitude slower than what I expect from data scientists using dataframe packages like R's dplyr, Python's polars, or Spark DataFrames. In my own experience, dbt is significantly better than writing raw SQL, but still nowhere near a normal software development experience. It needs an IDE badly, but the current IDE is cloud-only.

I am currently building my analytics pipelines on top of R's dbplyr package, which allows me to run dplyr operations on database tables just as if they were local data frames. Since it's R, it's not without warts, but at least I get to use a real programming language and the free and fantastic RStudio IDE.


> allows me to run dplyr operations on database tables

Which require that data take a round trip between the R process and the database. It's not uncommon for these jobs to spend more time in the read/write step than doing meaningful work, and why I prefer dbt when possible to keep transformations as close to the data as possible.


The second point the grandparent made was that they were using dbplyr, which allows you to avoid having the data take a round trip between the R process and the database.


The dbplyr package takes your R code and executes it in your database, returning a remote dataframe. This is quite useful because there is no mvoving data back and forth.


We built the perfect IDE for dbt at Deep Chanel, with autocomplete automatic real time type checking for the whole project.

Sadly, the company closed this June, and even the website is already down.


You need to find a way to release this product. I would pay to use it.


It's already been released and publicly available for free for almost a year.


So where is it, if "even the website is already down"?

ETA: Seems it isn't. https://www.deepchannel.com/ That what you meant?


Hmm, may be I've had network issues last time I checked. Anyway, that's the only place you can get it — it's a completely free desktop app, but it's not open source.


This is extremely interesting. You might be onto something.


Imo dbt sits in the analysts space. SQL is the language they know and dbt is the best solution to scale that.

In that sense it should only be used for last mile data transformations. Not for wrangling raw data into neat tables.


I am curious, what do you perceive as the difference between data transformations and "wrangling raw data into tables"?

Edit: I am also curious why you consider SQL to be inferior for the latter?


The difference is that raw data can come in many shapes (e.g. impossibly nested jsons) and unexpected quality (changing field names...). I cannot easily work with this in SQL and definitely not write tests to cover the ever increasing complexity of dealing with messy data.

After that step the data is a lot more uniform, so then it's easy to use SQL.


Dbt isn't a substitute for pandas/R to a data scientist. They are complements. Dbt is for the data transformation pipeline that prepares the data so the DS can write simple queries using their favourite tools on curated data.


The core divide in this space is SQL vs. $OTHER_LANGUAGE. If you're fluent in SQL you'll be great at dbt. If you're only fluent in not-SQL you won't.

A lot of these discussions are basically "can we please just use the language/tooling we know", and sometimes the answer's yes! A lot of people breathe a big sigh of relief when they discover you can use like, Spark through Python and what-not. Regardless of whether or not it's a good idea or good use of resources, this space is advancing because the market demands it. But like, I don't think you're ever gonna use dbt with not-SQL; it's all in on SQL. Feels like maybe it was the wrong fit for your team.

I should say I'm not sure if your coworker was an SQL person so dunno if this directly applies. Just saying what my experience has been.


Y’all don’t have CI/CD? Maybe it’s hyped but we do simple stuff. Snowflake schema DWH. Jinja2 templated SQL or dbt. Airflow with a monolith tool that does transforms and such. Is it perfect? No. Is it understandable? Very much so.

Testing is such an interesting concept in data engineering. One needs consistent test data. We aim to implement that with snapshots eventually but now we have sanity checks at each layer.

> lacking industry guidance and generic tooling to support: CI/CD, versioned deployment artifacts, zero downtime deployments, unit testing, observability, monitoring and alerting.

Ci cd is doable and easy: changes are deployed via GitHub actions, the CI part is a bit missing I guess without tests. Versioned artifacts also, add a tag to the airflow job to know what tagged version of the code is running. Zero downtime deployments we do that all day — I mean using views we can a/b deploy changes to the underlying tables and then do a simple schema change to the view and nobody knows the difference. Unit testing still yet to be done. Observability and alerting we use internal dashboards and sentry.

What am I missing?


FWIW, we are building a CI/CD solution snowflake https://www.bytebase.com/docs/tutorials/database-change-mana...


Testing and monitoring in every aspect are the only things that are greatly lacking, you always need to glue something together to get nice tests or some kind of data contract monitoring.


Relevance and actual value, I'd assume.


> CI/CD, versioned deployment artifacts, zero downtime deployments, unit testing, observability, monitoring and alerting.

This definitely stood out to me when I was working with a lot of Snowflake’s new products, especially their “native apps”.

I started on some bespoke tooling given the lack of everything that you mentioned and at the time was thinking about how to productize it / turn it into some open source tools but I’ve not gotten back around to it since leaving that job.

I had a really great experience working with sqlglot from Tobiko Data https://tobikodata.com/ — haven’t had a chance to check out their sqlmesh project yet but I have some faith in their work.


Yeah seconding sqlglot, has saved me _tons_ of time. I think yeah, they're smart and onto something here.


A surprising number of companies (even large, well known ones) don't even have basic dashboards and metrics, most of what you mention is irrelevant to most companies.


dbt was a huge leap forward in this regard.

It enables source code control, modularity, testing, CI/CD, environments, documentation, git branching, local development experience.

Not a fanboy, but it's who reason for being is basically to enable "software engineering practices for data".


Yeah, that’s exactly what the Data team does here. It was researched, it was tested, it was implemented, it was proven, it gets improved from time to time, it works.

We Platform just help them test or setup some things from time to time, but that workflow they have with dbt works quite well, and all the things they’ve complained in these threads we have yet to suffer (at the magnitude people is complaining).

Not sure if there is something “better” these days, but these teams have not had a single major incident with their dbt workflow.


Yup, it is not perfect but I will take dbt every day against the old way as some random DDL files in git.


So basically its huge advantage is that it saves its code as text, which makes it gittable?

(Dang, and here the tool I've been designing in my head for several years was going to claim that niche... So unique. ;-)


Nah, the main thing is the dependency resolution. So you make pipelines like:

- employees

- tall_employees

- tall_employees_by_salary

These tables (dbt would call them models because they can also be views depending on how they're configured) depend on each other, so you have to build them in a specific order. Without something like this, you're manually running SQL to build your pipelines, and that's very error prone. With something like this you can run tests (they're also SQL in dbt, which is pretty common in data engineering teams), run your pipeline in CI, etc.

There's a bunch of other features, but they all hang off this basic idea.


> you're manually running SQL to build your pipelines

you can have some sh file which runs SQL in desired order..


> runs SQL in desired order

Sure, one at a time. dbt builds a DAG of the SQL operations and can run transformations concurrently.


and why would you need this? If the goal is to reduce e2e latency, then optimizing SQL is likely much more beneficial, since it can improve performance by NNN times, not just by N times.


Well there's 2 main reasons.

The first is that you can update a subset of your pipeline, like say you have an internal dashboard for server stats and another external dashboard that's a product for customers, and you notice a problem w/ the server stats dashboard. You can have dbt just build the models that underlie the server stats dashboard and also exempt any model used by the external dashboard.

The second is that you can effectively defer processing. dbt builds views by default (though it of course will also build tables, including ephemeral tables that are cleaned up after a build), which means your transforms don't take place until you query the data. If you can get away with it, this is what you want, and there are lots of ways to stay on views (caching, use BigQuery, etc). This means your pipeline at least has the potential of being "real time".

But, overall we're talking about pipelines that have 100s or 1000s of models. You need some kind of system for that; a shell script isn't gonna cut it.


You obviously can have modularity with shell scripts:

run_all_etl.sh

build_internal_dashboard_etl.sh (called from run_all_etl.sh)

build_external_dashboard_etl.sh (called from run_all_etl.sh)


No you misunderstand. If you have some kind of dependency tree that's like:

    A        B        C        D
        1        2        3
        i        ii
    Ai     i-ii
...and you want to only rebuild Ai and anything that it might rely on, all you do is `dbt build +Ai`. If you want to rebuild B and update anything that relies on it, all you do is `dbt build B+`

> You obviously can have modularity with shell scripts

Just as kind of a meta point, I think you're assuming that people are introducing a lot of incidental complexity into this problem, but I don't think you're open to the possibility that there's a lot of essential complexity here. I'm talking about DAG tools and the features that hang off of them, and you're saying "hey there's new thing called shell scripting". That's not useful in this conversation. Maybe going forward, let's assume people know about Bash.


In my example ETL is divided into modules with clear scope and dependencies, as I believe should happened in software practices.

If you have kitchen sink of 1000 tables chaotically connected to each other, then yes, you need DAG management tool to manage chaos..


> If you have kitchen sink of 1000 tables chaotically connected to each other, then yes, you need DAG management tool to manage chaos.

This describes 99% of data teams, so I think we agree!


> This describes 99% of data teams, so I think we agree!

there is a chance that dbt incentivies data teams to go that route, otherwise they would think about modularity and ownership more.


I'm a Vim user and I've made this point when I rail against IDEs; it's even on the front page today [0]: "5. The belief that complex systems require armies of designers and programmers is wrong. A system that is not understood in its entirety, or at least to significant degree of detail by a single individual, should probably not be built."

I think I agree, but I think there are a lot of things you could say about this. What if you're just starting out and you don't know how to efficiently design systems? What if you work in an industry of people under 40 (under 30?) who by definition can't have the kind of experience that would let you design small systems? What if you work for a company that doesn't want to give you the time to figure it out? What if you have a competitor that's just shipping trash and forcing you out of the market with a "we'll clean up the tech debt later" mentality? How do we make tools that are powerful enough to enable us to do hard things, but wise enough to keep us from doing foolish things? Should we rely on our tools to do this? Is this an argument against centralization? Isn't that deeply inefficient?

Honestly I struggle with this all the time. I've got the outlines of some ideas and there are some tech communities grappling with the same issues, but I think we're just at the very beginning of a rethink about how we do software.

[0]: https://liam-on-linux.dreamwidth.org/88032.html


> How do we make tools that are powerful enough to enable us to do hard things, but wise enough to keep us from doing foolish things? Should we rely on our tools to do this?

IMO: No we shouldn't. And that's why I'm in the "Think about it yourself, bulid your ETL with a clean structure so you know what your 1,000 [SQL|shell] scripts do" in stead of the "Let your tool figure it out for you" one.

> Is this an argument against centralization? Isn't that deeply inefficient?

Or for centralization, if we hark back to the "You can get a loooong way on a single server" argument above. Depends on how you define "centralization", I suppose. (Seems to me everything always depends on your definitions, in the end.)

But wait, does this still have anything to do with IDEs? As opposed to Vim (or, say, Notepad)? If so, I'd say IDEs also belong to the category good centralization. At least older ones, before they became desktoppified Web apps with baked-in AS (Artificial Stupidity)...

> Honestly I struggle with this all the time. I've got the outlines of some ideas and there are some tech communities grappling with the same issues, but I think we're just at the very beginning of a rethink about how we do software.

On that debate I'm not so much conservative as... Reactionary.


How many models are in "run all etl", and what if one fails? You can't see how issues can arise here, with what you describe? I mean, go ahead and write your SQL pipeline in bash. Many of us are using dbt because we see a lot of benefit, we aren't all morons.


> How many models are in "run all etl", and what if one fails?

hundred tables. one fail -> whole pipeline fails to ensure e2e and overall consistency.


> and why would you need this?

So things finish considerably faster?

> then optimizing SQL is likely much more beneficial

This only gets you so far. Running transformations one-at-a-time when the server can do more is just holding you back. My last dbt project had over 300+ tables and running transformations serially would have taken an order of magnitude longer than running concurrently.


> Running transformations one-at-a-time when the server can do more is just holding you back

in general, the end goal of such ETL optimization is to have all your CPU or IO saturated. But I imagine for 300 tables transformations it could be too much work.


The goal of ETL is to transform data for practical applications, for which CPU and memory is there to be utilized.

This particular dbt process was running against the second smallest Redshift instance, which barely noticed when the process kicked off every 10 minutes to run incremental updates. Would run for maybe 60-90 seconds, with only the occasional lock contention with long-running analyst queries that was easily mitigated.


> The goal of ETL is

I said goal of "such ETL optimization", not ETL.

Luckily for me it is an hour and 50 lines(likely less) of code to write such parallel executor with dependency tracking if need will arise.


I had the misfortune of having to reimplement dbt at work (we have a weirdo "don't directly connect to app DBs" rule so I had to add support for running arbitrary Python to fetch data from our microservices over gRPC), and yeah the dependency tracking, subsetting, testing, parallelism and such is super easy -- it's more like 1000 lines of Python/SQL for the basics, but yeah no sweat really.

Where we ran into issues was:

- Availability: the general model of dbt is DROP the model (OK by default they TRUNCATE but w/e) and rebuild it. But this means that model's unavailable (read: locked) during rebuilding. Further, you can't do that if any downstream view depends on your model. Even if you decide it's OK to drop the downstream view, rebuild all its upstream dependencies, then rebuild it, it's unavailable the whole time. Further, you probably don't have just one downstream view depending on these upstream models, so you've got to work out all the downstream views to DROP and then rebuild, which is like a secondary dependency tree to walk while walking the primary dependency tree. But, that mind-bending weirdness aside, you're essentially making your whole DB unavailable during a build. Sometimes this is OK, for us it really wasn't.

- Incremental tables added a lot of complexity: rebuilding bigger tables that mostly don't change is a big waste of time and resources. But this is an entirely parallel code path which more or less doubled our code burden.

We tried pretty hard to manage these, which ultimately was a big rabbit hole. We eventually settled on schema versioning where we'd build a whole new DB in a different schema, run tests on it, and promote it if it passed. This let us avoid both the availability and incremental ditches, and as a bonus we could instantly restore past schemas if we discovered issues (restoring DB backups is sloooooooow and lossy).

But, I think the engineering lesson here was we either should have rescinded that weirdo app DB rule, or restricted our custom work to ELTing the data out of our apps and used dbt downstream of that.


This is an old sub-thread now, but I would agree that dbt is a very simple idea and easy to use at its core. Store your models in text files, in git, add a dag. It’s easy to use and very lightweight so I don’t see any reason to roll your own.

Then you get things like testing, documentation, modularity if you want it.


I've worked in data for a while now as an Data Engineer, Data Scientist, and Director and working with experienced software engineers has highlighted to me that most of the data stack is fluff. All the layers/tools that are heaped upon one another just lead to complexity and dependence on paid services. I'm not advocating for reinventing the wheel, but rather that a bespoke solutions seem to be worth the cost. In short, I'd prefer to spend the budget on experienced developers who can build maintainable systems than on a plethora of MDS tools. YMMV, though.


There is no "the" data stack.

Some organizations have complicated orchestration requirements, some don't. Some organizations have large amounts of heterogeneous data, some don't. Some organizations need to run large distributed machine learning tasks, some don't.

If you adopt tools that don't actually solve real problems that your team/org faces, then of course you'll end up with a bunch of extraneous fluff.

Your comment has the same flavor as the frequent HN meme of "I just use Vi and Tmux and a BSD server for my business and anything else is a sign of organizational failure and individual weakness".

All of these tools exist for a reason. Where you get into trouble is losing focus on solving your real practical problems.


> Where you get into trouble is losing focus on solving your real practical problems.

Absolutely agree. As I say YMMV, but my experience is that often people have adopted tools because "that's what data teams use" rather than because it's the best tool for the job. Often, we could — and should — build an in-house solution that fits perfectly rather than buying (every month) an imperfect solution.


> Some organizations need to run large distributed machine learning tasks, some don't.

I think almost no organizations need to run "large distributed machine learning tasks". Many just think they do, because it's the buzzword of the moment.


Also true, but I think maybe that's a great example of what I and OP are saying here. Do you really need it? Or do you just think you might need it one day because people are talking about it on Medium, or because you had a nice chat with some sales rep? Or do you just want it on your resume?


Facebook's ad prediction models weren't distributed until maybe 2014 or 2015. Just buying bigger boxes can get you a loooooinnnng way.


> All the layers/tools that are heaped upon one another just lead to complexity and dependence on paid services. I'm not advocating for reinventing the wheel, but rather that a bespoke solutions seem to be worth the cost.

Which third party products do you use in your stack, then?

Do you only use basic cloud building blocks (VMs, EBS, S3), or F/OSS components like Spark and Trino, or are higher level things like Tableau/Looker also part of the solution?


If you wouldn’t mind: in your experience how often does the problem seem to stem from “we might need that later…” or “we don’t have a concrete set of questions we want to answer?”

I have considerably less experience but so far I have experienced a complicated data science stack being used as a way to handle a ton of data without clear questions and desires on how to act on possible answers. As if they’re searching a haystack for… something?


IME, it has usually stemmed from "that's how everyone else is doing it, so we should too" without considering specific use/business cases. Senior management were simply not willing to deviate from the norm.


And also: I need a job. My job/workplace has these tools. They are usable for my job and if I were to question the tools that would come on top of my workload, so I use them even if I know that cheaper tools were used at my last job.

Do I care if they pay my salary and the tools are usable?


That's perfectly reasonable, but it sounds like you're saying the tools work for your use case. So long as the executive/management level is evaluating the cost-benefit of other tools, and is receptive to voices from ICs, then there's no problem to solve.

It's just not what the GP is describing. In some orgs you end up with these expensive, standard-ish "suites" that are way too generalized, don't solve business needs as efficiently as other tools, are painful for ICs to use to meet requirements, and it seems like no one can really explain why these tools are being used over something else other than rumors (the previous CTO came from this vendor, we got a sweetheart deal 10 billing cycles ago, etc.).

So really this an issue of ownership; someone needs to be accountable for infrastructure decisions and outcomes, and it's way healthier for an org if ICs are aligned with management on the same. It sounds like you're not in that sort of org, so good for you.


But do you care if they pay your salary and the tools are barely usable?


You are describing exactly how the MDS works: buy what you want and build what want.


Disclaimer: I am the co-founder of a competitor, Bruin (https://getbruin.com). We are exactly the kind of integrated platform Tristan is talking about.

The article resonates with me a lot, and it is because we called this out months ago. I find the idea of Tristan walking back on the premise of MDS and claiming it to be not useful anymore funny because they were one of the main drivers of the term and the whole hype around it.

He even acknowledges that they played ball with other companies in the space:

> There was a lot of valuable co-marketing, partnership deals, co-sponsored events, and co-selling. This had real value for everyone involved—customers and vendors alike. Companies voluntarily integrated their products together, cross-promoted each other publicly, and built partnerships that made owning and operating these technologies far easier for customers.

Sorry, but no, this didn't have value for the customers, only for the vendors. The customers were left alone by themselves to deal with all of this complexity, and the vendors made a lot of money off of that. They convinced companies that they needed a bunch of different tools to build a simple pipeline, and rode on the wave of huge valuations based on these same ideas that they are walking back on.

The companies prefer integrated solutions now because they woke up. Instead of paying 150k/y each to Fivetran, dbt and whatnot, they realize that they are better of just hiring an engineer or two in the worst case. It is 2024, and none of these tools still properly talk to each other. Do you want to get an end-to-end lineage of your data? Good luck with that. How about quality? How about governance? Companies are left alone with this hype cycle.

I'd claim that a significant part of the blame lies on the executives and leaders in the companies, who just jumped on the ship for the sake of building their CVs and skipped the critical thinking step. None of them seriously asked themselves the question of whether or not it makes sense.

To be honest, I feel sorry for all the money spent on building solutions around all of these hype-driven products.


As someone who has worked in this space, I would intuitively go with the decoupled tools every time. The toughest thing is matching a monolith of opinions with your requirements. Individual tools leave design space in their integration and thus allow proper engineering. That they allow it, does not mean it happens. If you go with a shopping bag architecture then you might be better off results wise with an integrated solution, but still, when the real engineers come along they’ll have an uphill battle to uproot an integrated solution vs redesigning the combination of individually good tools.


> I would intuitively go with the decoupled tools every time.

Seems to me that the problem is, each of these "decoupled tools" you mention isn't some specialised little awl or hacksaw ("the Unix philosophy..."), nor even a do-anything Swiss Army knife -- but a whole stack of cloud services, "integrations", and administration tools in its own right. A whole machine shop of machine shops. Building your own custom setup out of that is hard. You'll end up with something at least as unwieldy as a really big do-anything single-vendor mega-machine shop stack of cloud services, "integrations", and administration tools.


I don't even understand your argument.

This is all just buy vs build. Pay Fivetran, or pay an engineer to build and maintain API connections. Pay dbt or use dbt core and orchestrate it yourself. There are tons of observability tools, lineage tools etc that you can buy. Or build them yourself. Or use AWS or GCP and their broad tooling.

You'll never convince me that having more options is bad.


He's not "walking back on the premise of MDS", but acknowledges that the term "MDS" had been co-opted and lost its descriptive fidelity.

> Sorry, but no, this didn't have value for the customers, only for the vendors

Absolutely untrue. I've discovered multiple vendors over the last two decades because of co-marketing and partnership deals, and I know many others that have. My success rate with these vendors is no different from vendors I found myself or were referred to me by others.

> The customers were left alone by themselves

This is true with every vendor integration with every company on the globe. You either 1) have internal expertise to push through the pain, 2) hire support from the vendor itself (often X hours come baked into the deal), or 3) hire a consultant. This is as true for the MDS vendors as it is for Salesforce, AWS, or an ERP. It's true for your company, too.

> They convinced companies that they needed a bunch of different tools

Except you do? You need a database, you need something to extract data and move it locally, you need something to transform that data into something useful, and you need the ability to deliver that data to end users. This was true twenty years ago, true ten years ago, and still true today. The all-in-one vendors prior to MDS were horrible, and I'd rather cut off my arm then use something like Pentaho again. I'm also not convinced the current all-in-one vendors are any better.

One of the advantages of the pick-your-stack MDS was that you could tailor the tools to you organizations specific needs, rather than have to fight against whatever your mega-vendor offers. I loved the fact that I could use open source for 90% of my stack, and then use PowerBI for the last 10%. Or I could be 100% FOSS and deliver data with Metabase. Or I could use Tableau for 50%, and FOSS for the other 50%.

> they realize that they are better of just hiring an engineer or two in the worst case

This simply isn't true for every size or class of company. Many businesses are better off with vendor-supported platforms than trying to build something bespoke, which can quickly become a liability with staff turnover or lack of strong internal technical leadership.

> It is 2024, and none of these tools still properly talk to each other

What does this even mean? How does a data extract tool "talk" to a transformation tool? Or "talk" to the database? I can see from the screenshot on your website that it offers a GUI with drag-and-drop - is that your definition of "talking"? Or are you referring to a tight integration that doesn't require additional work to glue them together?

> How about quality? How about governance?

These are management and organization issues, not tooling. I've seen amazing quality and governance with teams working in MSSQL/SSIS, and I've seen horrible quality and governance from teams using Hadoop/Spark. A tool can make a competent practitioners job easier, but it's not going to make an incompetent competent.


Disclaimer : I'm the cofounder of Syncari. I agree 100%. The whole notion of modern data stack is a myopic , neither here nor there philosophy to address data in my opinion.

Just take a look at the sheer number of tools one has to run - ETL(or more commonly, data dumpers), Wearhouse, transform tool,"reverse ETL", data quality tools, BI. And if you want some ML, that's whole another game.

We also believe strongly in an integrated approach to this space, with a data model at the center of it all.


Deleted


The Modern Data Stack / MLOps product space was succinctly described by one actually-technical CEO as "vending into ignorance"; the author corroborates this with a commendably candid take:

>Imagine it’s 2021, peak MDS, and you meet the CDO of a large bank. “Oh cool,” she says, “you’re the CEO of a tech company. What does your product do?” What do you say?

>“We build a tool that leverages the power of the cloud to apply standard SQL and software engineering best practices to the historically mundane (but critical!) job of data transformation.”

>“We’re the standard for data transformation in the modern data stack.”

>I will tell you that, empirically, option #2 is more effective.

This tallies with what I've seen from a lot of enterprise CxOs and their teams as technology hype moved from big data and block chain and onto data science/machine learning.

There is so much to write about this, but I'll just recommend "Life Cycle of a Silver Bullet" http://freyr.websages.com/Life_Cycle_of_a_Silver_Bullet.pdf, which deserves more attention than it's had on HN.


Is MLOps more or less of a thing than prompt engineer?


MLOps is deploying, monitoring and (re)training ML models. Sits in the DevOps and data engineering space.

Prompt engineering is making generative AI do what you want by crafting the right context. I would put it somewhere in the software and data engineering space, given they will most likely integrate applications with it. MLOps comes into play if you have your own trained or tuned model.


More of a thing, but it's mostly DevOps.


I've been in data and analytics for over a decade and co-founded a consolidated-ETL company (https://www.polytomic.com).

Nice to see Tristan realising this (people should be commended for changing their minds). There were three problems with this term:

1. It mostly resonated with VCs and industry observers whose jobs are to peddle in buzzwords, rather than users who simply want their problems solved.

2. It was (and is) ill-defined. Ask a group of people in the industry to define it and you're guaranteed to get different answers.

3. It committed the cardinal sin of using the adjective 'modern' in a noun. At some point, everything today stops being modern. Couple this with (2) and your term is now even more meaningless.

At a mundane level there's no drama here: just another example in the long list of noise contributors to the free-money party that was going on during the Covid years (as the essay acknowledges).


It seems that there’s a philosophy with these kinds od cloud data services: measure everything then figure out what to measure after.

It seems that the better approach is to make a hypothesis first, collect only the data you need, and then analyze.

Otherwise it seems the indirect costs of retooling around “measure all the things indiscriminately” and the direct costs of paying for such services don’t seem worth it.

Can someone offer a reasonable rebuttal?


The guy you're arguing with in the reply thread has a point but I think you're talking past each other because you're coming at it from two different sides. I understand his point about syncing everything if you're a small enough company or the data is small enough because the cost is immaterial.

I agree with most of what you're saying and have seen the downfalls of "just sync all our data" first hand. At a late stage start-up I worked at, we would often have teams (finance, CX, etc.) implement a shiny new tool and immediately request to have that data accessible for analytics/reporting. If we did blindly agree, the data would often never be used and it would run up our Fivetran costs to sync the data and our Github actions bill to build the dbt models and include it in our modeling.

We eventually landed on an approach that considered the following:

- Does the tool already have reporting baked in? If so, just use that until you need more.

- How often are you really going to be using it and what makes this data so special that you need to join it to our other data?

- How much work will be involved to implement a sync for this data? Does Fivetran or Segment have a connector readily available?

- If the tool doesn't have reporting built-in and the team did have a genuine business case for the data being synced, we would figure out which specific objects from the source that needed to be synced and sync them at a fairly infrequent cadence (once a day).

So while I offer no reasonable rebuttal, I wanted to explain a happy medium that worked for us.


Thanks for the insight!

As for communicating the issue to administration I feel that putting things in terms of costs in dollars, even if the approach is as arbitrary as any other form of internal accounting, can help make the case.

Management is as susceptible to the marketing of panaceas as much as engineers! They must be informed of a more true cost.


It doesn't always make sense to "collect only the data you need." For example, someone in our org will ostensibly have a good reason for setting up an event in Google Analytics. As the head of analytics I have no use for their event but I'm going to end up ingesting it into BigQuery anyway because the built-in GA4->BQ transfer sends ALL the data, and I'm fine with that because it's damn near free for me to ingest and store. I could instead "collect only what I need" by running the transfer through an ELT tool and filtering out that event, but why would I bother doing that work and paying the ELT vendor money when Google will do it for free and charge me like $1K/year to store 100M rows of everything collected in GA4?


That cost analysis seems incomplete. How much did it cost to implement in the first place? That seems like millions of dollars in capitalization before service expenses are even considered!


Sorry but I can't tell if you're being sarcastic or not. Google Analytics and BigQuery are both nominally free, the service expense is $1K/year, and the implementation cost literally a few minutes of my time.


I’m not being sarcastic. There is engineering labor that requires hooking up all of these events. Then there is maintenance costs. And many more costs, direct or otherwise.


With respect to the example I gave, no, there actually isn’t. In fact I don’t even understand what you think the work and costs involved are.


I’ve seen organizations spend millions just in engineering labor costs for such endeavors. But sure, for small companies with limited domain events to begin with, the costs are relatively lower, and probably with the lower domain complexity it is less of a share of engineering time overall.

But there are still opportunity costs! It’s not free to install dependencies across numerous code bases and hook into and label the events, even for a small business.


OP suggests using "built-in GA4->BQ transfer [to send] ALL the data".

This seems like an extremely small endeavor.


So yes, for a small business that just drops a GA snippet into the bottom of their app it is inexpensive. I’m not sure that this qualifies as the “modern data stack”.

That’s also not what happens when a larger company integrates Segment into their applications! That takes many engineering months to set up and then maintain for the lifetime of the app.

If the full costs are analyzed, is the money spent well?

There’s sone irony here in that much effort is put towards analysis of user behavior but very little effort put towards accounting for where the money is spent internally.

That is why there is a slowdown in customers using these MDS products, not what the author of the article claims, rather that when feet are put to the fire, expensive services that haven’t paid off their debt are given the axe.


I think i can spot the difference in interpretation here.

GG*GGP "someone in our org will ostensibly have a good reason for setting up an event in Google Analytics"

In this case they already have Google Analytics. The question is whether it was worth exporting that data to BigQuery.

Then whether it was worth setting up GA in the first place is not discussed.


Isn't the setup cost a key part of the total cost of analytics, especially when it comes to these MDS systems?

I encourage anyone in a position to make purchases or who has a say in what engineer labor focuses on is to get a copy of Horngren's Cost Accounting. It should make my position very clear!

FWIW, there is indeed capitalization occurring when an organization incorporates analytics, even the "measure all the things" approach, but the question is, once the total costs are discovered, does the return outweigh the expenditure?

These things cannot be considered in isolation, especially when using advanced approaches like Activity-Based Costing, something that is basically a requirement for a product or service that is fundamentally software in nature. There is too much complexity to treat managerial accounting in a software setting like a steel mill.

The reason why this doesn't happen is YOLO business management fueled by the casino that is VC backed companies. But we are in a rapidly maturing industry where margins are slimming and costs need to be reigned in, especially in publicly traded companies where investors will punish a company that makes wildly off target financial projections discovered when cash flows are reported in the following quarters.

I encourage everyone in all management positions to start thinking about and tracking actual costs. Not story points, not customer use patterns, but cold hard cash!


Yes, we spent money and engineering time setting up a "data layer" for Google Analytics, though it was mainly to ensure continuity with existing reporting on core site metrics.

That said, most routine data collection in our org isn't from custom instrumentation requiring engineering lift. It happens when someone sets up their own event in Google Analytics or adds a new field in Salesforce or whatever. This kind of bloat is inevitable, and at my scale (which is pretty big but not like F500 big), with my stack, it's always going to be cheaper to just ingest everything than to spend labor on trying to manage and minimize it, especially when I can ingest a lot of stuff for cheap or free and then discard what I don't need.


Thank you for the clarification! It has given me some good meat to chew on. It seems to give a lot of credence to the idea that you don’t need much more than GA!


Oh, it's something for the ad industry.

Ads are watched only by people who don't have good ad blockers.


compiled a quick list of terms to help understand this piece:

MDS - modern data stack

BI - business intelligence

Redshift - amazon redshift data warehouse service product handles large scale data sets and database migrations can handle analytic workoads on big data sets with column oriented approach built on top of massive parallel processing (MPP) from data warehouse company ParAccel (later acquired by Actian)

ETL - extract, transform, load

ELT - extract, load, transform -- an alternative to ETL that stores raw data

Looker - does BI, started with looker data sciences, acquired by google

Tableau - data visualization focused on business intelligence

clickstream data - user website navigation records

snowflake - an MDS data platform solution

mongo - nosql database

datadog - monitoring, analytics for devops

confluent - real time data streams

databricks - web based cluster management and data lakes for machine learning

meme-ification - reduction of an idea to cartoons

peak - highest point preceeding drop off

fivetran - data ingestion

dbt - dbt labs, data transformation

CDO - chief data officer

co-marketing - integrated marketing such as linked brands

co-sponsored - cooperative sponsorship

co-selling - shared sales assets including data records

ARR - annual recurring revenue

ZIRP - zero interest rate policy

private multiples - private company valuation calculations, metrics

forward revenue - revenue expected in the future

PowerBI - microsoft business analytics


I think you need to go a step further and define a few more terms like what you mean when you say "Analytics" and "business intelligence"

For example >PowerBI - microsoft business analytics

I don't think I would classify Power BI as an analytics tool - In my limited experience with using it I have found that it has some nice ways to build reports and visualize data graphically but only very rudimentary statistical functions and is not really suited for performing data analysis.

My organization is travelling along the "modern data" pipeline so I'm grappling with some of the same confusion myself about what everything is in the new world...

In the "old data" world I'm familiar with - I would use stuff like SAS or R to perform ad-hoc analytics i.e. analyze data using statistics, perform regressions, Timeseries forecasting, distribution analysis etc.

For reporting and visualizing data: that would fall to things like excel, Cognos, ggplot2, Proc report and Some javascript plotting packages like d3js etc.

To Extract data would involve running SQL query against databases (Oracle, DB2 or Microsoft SQL Server etc directly) directly then using sas or R to transform the data as needed.

Modern approach seems to be Databricks to ingest data into datalake, Python via Juypter notebooks to perform analytics hosted on Azure ML studio running in cloud on a "compute instance" for another level of Indirection...Then Power BI to visualize and build reports.


DBT in simplest terms is just a way to orchestrate transformations in dependency order. Table "A" needs to be updated before table "B," because "B" selects from "A." It works well for that.

I've seen it used to get wild west SQL logic into version control. To replace scheduled SQL workbooks running entire data pipelines.

I've also seen over-engineered integrations with other orchestration tools in the "stack."

Once a DBT project grows to a certain size, roughly 500-1000 .SQL files, it gets hard to manage. Not impossible, though it takes intentionality about how to group things together and scale them out operationally.

"Slim CI" is a recent buzzword that has some nice ideas about build automations.

Cosmos is supposed to solve a lot of the automation gripes with dbt. I haven't tried it yet but would like to. https://www.astronomer.io/cosmos/


The Modern Data Stack is Dead! Long Live the Analytics Stack!\

Interesting to see one of the strongest popularizers of the term acknowledging this shift. The free money is over, so we need to get back to work and dbt is the least bad way to organize lots of SQL for data management.


What would HN commenters define as a truly modern data stack right now? Mostly asking what this chart https://a16z.com/emerging-architectures-for-modern-data-infr... would look like in 2024.


  - Postgres and Kafka
  - Airflow and Flink
  - Kudu+Impala or Clickhouse
  - Iceberg/Parquet or Delta
  - Ozone(HDFS) or Ceph
  - Spark Rapids and/or Ray+Metaflow
  - Open Metadata or Atlas+Ranger


TBH that sounds like a heap of buzzwords not useful for anyone (but maybe an architect promoting his career).


Pretty much, but updated for 2024.


> What would HN commenters define as a truly modern data stack right now?

Just don’t spend too much time debating what it is, or the analysts who need numbers yesterday will develop their own solutions before you’ve even published your A3 architecture diagram.

A bird in the hand, as they say.


IT DEPENDS. But for standard ecom/ops stuff (and this is heavily biased by the fact this is what I know and have deployed with low overhead): Big query. This is fed by an ETL pipeline of the stuff you need. dbt to ELT into useful, reportable tables. Something to visualise - I still like Mode, but there's lots of new BI toys around. KISS.

This is extremely low maintenance at the actual requirement levels of most businesses with < 50 employees.


- Fivetran

- Snowflake/BigQuery/Redshift

- dbt

- Looker/Metabase/Tableau


Intriguing that the CEO of DBT declares MDS ("modern data stack") to be a meaningless term (i.e. no longer useful). Maybe a good demarcation that we are entering the post-modern phase of cloud tech generally? Cloud still there, but not the only option.

I look forward to this happening with GenAI -- the tools we come up with will be pretty cool in the long run. I hope we can find ways around platform enshittification because that really sucks. Similar for blockchain tech.

I see a common thread in these techs too -- the same type of LinkedIn influencers shop these during their hype phases. Ultimately, useful technology and best practices come out of it!


Well put! Nothing's modern anymore, the tech cycles are closing in on one another so much the old one hasn't gone out of phase until it's overcome by not one but several new ones. In a nice touch of self-refentiality, your message itself is very post-modern.


I hadn't considered that. Thanks for pointing that out to me :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: