Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> clusters should be treated like cattle, not pets.

Off-topic, but is this really how people do k8s these days? Years ago when I was at Google, each physical datacenter had at most several "clusters", which would have fifty thousand cores and run every job from every team. A single k8s cluster is already a task management system (with a lot of complexity), so what do people gain by having many clusters, other than more complexity?



The most common thing I've heard is "blast radius reduction", i.e. the general public are not yet smart enough to run large shared infrastructures. That seems something that should be obviously true.

People had exactly the same experiences with Mesos and OpenStack, but k8s has decent tooling for turning up many clusters, so there is an easy workaround


I still feel like that would only work in very niche cases.

I mean, if people aren't smart enough to run a large shared infrastructure, how can I trust them to run a large number of shared clusters, even if each cluster is small. The final scale is still the same.


Updating 100 clusters bares less risk than updating a single giant one.


And no SRE would allow you to run your application in a single cluster. Borg Cells were federated but not codependent - Google's biggest outages were due to the few components that did not sufficiently isolate clusters from one another.

Clusters are probably still pets to most orgs, but the lessons about how to manage complexity still apply. Each of my terraform state files is a pet and I treat it like such... but I also use change-control to assure that even though I don't regularly recreate it from scratch, I understand all that was there.


There are potentially quite a few benefits of being able to spin up clusters on demand [1]:

* Fully reproducible cluster builds and deployments.

* The type of cluster (can be) an implementation detail, making it easy to move between e.g Minikube, Kops, EKS, etc. After all, K8s is just a runtime.

* Developers can create temporary dev environments or replicas of other clusters

* Promote code through multiple environments from local Minikube clusters to cloud environments

* Version your applications and dependent infrastructure code together

* Simplify upgrades by launching a brand new cluster, migrating traffic and tearing the old one down (blue/green)

* Test in-place upgrades by launching a replica of an existing cluster to test the upgrade before repeating it in production

* Increase agility by making it easier to rearchitect your systems - if you have a pet, modifying the overall architecture can be painful

* Frequently test your disaster recovery processes as a by-product for no extra effort (sans data)

* Reduced blast radius

[1] https://docs.sugarkube.io/#benefits-of-sugarkube


I think for one, you cannot easily have Masters span regions without risk of them falling out of communication. Similarly the workers should be located nearby. If there's a counterexample to this I'd love to see it.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: