Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Some ETLs these days are simple scripts - e.g. Spark, Dataflow - with their config files.


Being able to define an ETL workload within a simple script is not the same as your ETL system itself being a simple script.

While I love both Spark and Dataflow, both of them are incredibly complex distributed systems with very high operational costs. Someone, somewhere is paying a lot of money to have an operational resource maintain that complexity. Whether you have an internal devops resource doing so or you're using a managed service, you're paying for that complexity somehow. And, for a lot of workloads, you aren't actually getting any more value than you would from standing up a ~$50/month standard Debian/Ubuntu server and a set of simple scripts on it.


You don't have to have a cluster to run spark scripts, setting master to `local` (and running it on one machine) is often enough for small anounts of data.


They don't have high operational costs - you can run them as a script on your local machine. You're making them out to be more complex than they really are.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: