I hate when someone suggests adding process to fix a problem - and I saw this quite a few times already.
Most people clearly understand the benefits of adding process, but very few seem to realize the costs.
If I tried hard I am pretty sure I could create a checklist with 1000 items for each developer to go through, and no one could argue against any of the items - individually, they would all be ok / necessary / correct. However, if I forced every developer to go through the list every time, for every change, they would - rightly so - feel crushed.
With very few exceptions - where a new process is really granted, I see people trying to substitute either thinking or automation by process. Which is a recipe for bureaucracy, and in my view a good part of why working for a BigCo can be so miserable sometimes.
A new process should be a last resort, after we answered yes to: a) Is it really beyond us to automate this?, b) Is there some flaw in human beings that will ensure this will repeat? c) Are the consequences of this mistake really serious?
I agree that process can be a demoralizing effect, but I like to play devil's advocate.
An hour of Google's revenue lost from 14% of their customers costs them about $350,000 (judging roughly by Q109 revenue numbers). Had it been 100% of customers impacted during that hour (i.e. a bigger goof-up) they'd have lost ~$2.55 million.
If you had 9,000 engineers making 100k a year on average, they could each spend an hour every day on paranoid safeguard processes and only cost the company $308,000.
So is it worth investing in processes to avoid that? Absolutely. Even if they can't find a good way to automate this, it's hard to argue against protecting against that sort of loss of revenue however you can.
It's incredibly myopic to say that an hour of downtime equals an hour of lost revenue. When my favorite takeout place has busy phone lines, I wait 5 minutes and then call back.
I wanted to search for something during the downtime, and I didn't go to Yahoo--I waited. They definitely lost revenue, but it is a ridiculous, baseless claim that everyone went somewhere else during the downtime. You have absolutely no data to make any such claim.
Furthermore, your calculated cost of an engineer's time is simplistic and inaccurate. It doesn't count the lost revenues from delaying the release of their work, and the reduced value of that money by gaining a lower time value for it (getting money earlier means more time to multiply it through investments and reinvestments).
And, even worse, you are comparing the DAILY costs of developer time (which you grossly underestimated) to something that happens, maybe, once or twice per decade.
It could cost them many millions--maybe even hundreds of millions or more--per year to implement such a policy.
That's why these mistakes happen--it is cheaper to fix the rare screw-up than to waste too much time checking everything, except for very few circumstances.
If you had 9,000 engineers making 100k a year on average, they could each spend an hour every day on paranoid safeguard processes and only cost the company $308,000.
Uhm, it's closer to $500k. Per day. So based on today's goof, which is presumably rare, with your batch of added paranoia they'd be still be out $150k for the day, and they'd be out $500k all of the days that there wasn't a colossal screwup.
You're 100% right. I picked a too outlandish number in my devil's advocating and back-of-the-napkin'd the engineering wrong. Sorry about that-- can't correct my comment anymore, unfortunately. You shoulda slammed me sooner. I can't see any rational reason to argue that way with those economics.
It turns out that you can reduce the number of surgical instruments left in patients by 33% by implementing checklists. Surgeons won't do it because it feels beneath them.
what bothers me is why this isn't more like 90% (in other words, why you don't check all items in and out - anything else is pretty lame supply chain management)
I'm sure they have plenty of safeguards in place. Think about all the times in the last 5 years that Google's service has been seriously interrupted. Oh wait, there aren't any. Their turn around time on this bug was pretty fantastic as well. Discovered, analyzed, patched, and apologized for in under 24 hours. Nice.
You are right. It's unfair (and all too easy) for someone from the outside to point their finger and say, "Bad! They should have done X."
However, the reality is that Google has positioned themselves such that they have quickly become a utility. This goes beyond search, they are a communications network (email, chat) and an ad service among other things. How many websites depend solely on Google AdSense for their revenue?
Just like electricity, telephone service, and other utilities, we have come to depend on them for living our daily lives. Loss of service at a large scale cannot be tolerated.
This may have been a relatively minor event, but the point remains that it still seems much too easy for such events to take place. I feel that Google should be one of the companies pushing technology in this area forward. Hopefully they will release information on how they plan to prevent this sort of event from occurring in the future.
This is at least the second time this year that they've had a serious/major issue for an hour. The other being the snafu where they marked the entire world as malware.
They respond awesomely, but they're not immune to major issues and I doubt it will get any easier for them.
Sure, but even if something is faulty I would guess that they have a way to test these updates on only a small portion of their network/traffic (at least much smaller than the 14% they said was affected).