Is it me or does it seem that when you are something like Google there should be...

igorhvr · on May 14, 2009

I hate when someone suggests adding process to fix a problem - and I saw this quite a few times already.

Most people clearly understand the benefits of adding process, but very few seem to realize the costs.

If I tried hard I am pretty sure I could create a checklist with 1000 items for each developer to go through, and no one could argue against any of the items - individually, they would all be ok / necessary / correct. However, if I forced every developer to go through the list every time, for every change, they would - rightly so - feel crushed.

With very few exceptions - where a new process is really granted, I see people trying to substitute either thinking or automation by process. Which is a recipe for bureaucracy, and in my view a good part of why working for a BigCo can be so miserable sometimes.

A new process should be a last resort, after we answered yes to: a) Is it really beyond us to automate this?, b) Is there some flaw in human beings that will ensure this will repeat? c) Are the consequences of this mistake really serious?

dfj225 · on May 14, 2009

When I said process, I meant automated process. As a programmer, performing calculable tasks by hand is always out of the question :-P

thorax · on May 14, 2009

I agree that process can be a demoralizing effect, but I like to play devil's advocate.

An hour of Google's revenue lost from 14% of their customers costs them about $350,000 (judging roughly by Q109 revenue numbers). Had it been 100% of customers impacted during that hour (i.e. a bigger goof-up) they'd have lost ~$2.55 million.

If you had 9,000 engineers making 100k a year on average, they could each spend an hour every day on paranoid safeguard processes and only cost the company $308,000.

So is it worth investing in processes to avoid that? Absolutely. Even if they can't find a good way to automate this, it's hard to argue against protecting against that sort of loss of revenue however you can.

edmccaffrey · on May 14, 2009

It's incredibly myopic to say that an hour of downtime equals an hour of lost revenue. When my favorite takeout place has busy phone lines, I wait 5 minutes and then call back.

I wanted to search for something during the downtime, and I didn't go to Yahoo--I waited. They definitely lost revenue, but it is a ridiculous, baseless claim that everyone went somewhere else during the downtime. You have absolutely no data to make any such claim.

Furthermore, your calculated cost of an engineer's time is simplistic and inaccurate. It doesn't count the lost revenues from delaying the release of their work, and the reduced value of that money by gaining a lower time value for it (getting money earlier means more time to multiply it through investments and reinvestments).

And, even worse, you are comparing the DAILY costs of developer time (which you grossly underestimated) to something that happens, maybe, once or twice per decade.

It could cost them many millions--maybe even hundreds of millions or more--per year to implement such a policy.

That's why these mistakes happen--it is cheaper to fix the rare screw-up than to waste too much time checking everything, except for very few circumstances.

wheels · on May 14, 2009

If you had 9,000 engineers making 100k a year on average, they could each spend an hour every day on paranoid safeguard processes and only cost the company $308,000.

Uhm, it's closer to $500k. Per day. So based on today's goof, which is presumably rare, with your batch of added paranoia they'd be still be out $150k for the day, and they'd be out $500k all of the days that there wasn't a colossal screwup.

thorax · on May 15, 2009

You're 100% right. I picked a too outlandish number in my devil's advocating and back-of-the-napkin'd the engineering wrong. Sorry about that-- can't correct my comment anymore, unfortunately. You shoulda slammed me sooner. I can't see any rational reason to argue that way with those economics.

thras · on May 14, 2009

It turns out that you can reduce the number of surgical instruments left in patients by 33% by implementing checklists. Surgeons won't do it because it feels beneath them.

bkudria · on May 15, 2009

And it is. We have computers for this. Specifically, RFID chips, and readers.

Surgeons are very very expensive, so their time is too.

andreyf · on May 15, 2009

Surgeons are very very expensive, so their time is too.

Surgical instruments, on the other hand, are easy to replace ;)

snprbob86 · on May 15, 2009

You don't need to get that fancy. Just paint silhouettes of the instruments onto the counter. Same effect, simpler operation, bonus clutter reduction.

eru · on May 15, 2009

http://www.newyorker.com/reporting/2007/12/10/071210fa_fact_...

TJensen · on May 14, 2009

So get the nurses to do it for them, which, I believe, is what the researchers who were looking at this had to do.

imajes · on May 15, 2009

what bothers me is why this isn't more like 90% (in other words, why you don't check all items in and out - anything else is pretty lame supply chain management)

tsally · on May 14, 2009

I'm sure they have plenty of safeguards in place. Think about all the times in the last 5 years that Google's service has been seriously interrupted. Oh wait, there aren't any. Their turn around time on this bug was pretty fantastic as well. Discovered, analyzed, patched, and apologized for in under 24 hours. Nice.

dfj225 · on May 15, 2009

You are right. It's unfair (and all too easy) for someone from the outside to point their finger and say, "Bad! They should have done X."

However, the reality is that Google has positioned themselves such that they have quickly become a utility. This goes beyond search, they are a communications network (email, chat) and an ad service among other things. How many websites depend solely on Google AdSense for their revenue?

Just like electricity, telephone service, and other utilities, we have come to depend on them for living our daily lives. Loss of service at a large scale cannot be tolerated.

This may have been a relatively minor event, but the point remains that it still seems much too easy for such events to take place. I feel that Google should be one of the companies pushing technology in this area forward. Hopefully they will release information on how they plan to prevent this sort of event from occurring in the future.

thorax · on May 14, 2009

This is at least the second time this year that they've had a serious/major issue for an hour. The other being the snafu where they marked the entire world as malware.

They respond awesomely, but they're not immune to major issues and I doubt it will get any easier for them.

wmf · on May 14, 2009

Yes, but everything has bugs.

http://nanog.org/meetings/nanog44/presentations/Monday/Gill_...

dfj225 · on May 14, 2009

Sure, but even if something is faulty I would guess that they have a way to test these updates on only a small portion of their network/traffic (at least much smaller than the 14% they said was affected).