Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Exactly right. Network partition is a hard problem in automatic failover of replicated system. You don't want the standby to become master unless it can be sure the primary master is absolutely down. It's difficult to unwind the mess if two masters are active and taking changes.

In high availability system design, the secondary node literally has to shut down the primary's power (called the Shoot-At-The-Head technique) to ensure it's really down when it's not responding via network.

Of course over long distance cross-datacenter replication, shutting down power remotely is not reliable. In the last HA clusters I built, the failover between datacenters is done via manual decision. It means there could be a 15 minutes to 30 minutes window to do the manual failover, but it's an acceptable risk since datacenter failure is rare, like AWS failure once in a blue moon.



I remember when I first used HA-Linux in a project being highly amused when I came across the acronym STONITH and discovering that it meant "Shoot The Other Node In The Head" :-) http://www.linux-ha.org/wiki/STONITH

All jokes aside, it is indeed a very important concept when dealing with high availability.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: