Even with a large service, you can immediately identify a fluctuation in the number of complaints (in addition to signals from monitoring tools). Speaking from experience. In fact, the larger the service is, the easier it is to statistically identify an uptick in the number of complaints per smaller time interval
Even with a large service, you can immediately identify a fluctuation in the number of complaints
Immediately is a relative term. I would say "1 hour" is pretty much as immediately as it gets on these scales.
I wouldn't be surprised if a significant complaint fluctuation only manifested long after amazon discovered the problem in their own monitoring.
statistically identify an uptick in the number of complaints per smaller time interval
Yes, but volume is not everything. You also have to qualify (triage) the input, get engineers on the case, confirm the issue, perhaps get clearance for a public announcement. All the while many of the key people are busy either trying to figure out what is going on, or trying to dispatch information the right people, or just running around waving their arms furiously.
Do you have an idea how many such complaints amazon is receiving on a normal day? Per hour?
Why make every single AWS customer panic for an hour
Diagnosing problems in a big system is not that easy.
A turnaround time of an hour is not too bad for a behemoth the size of Amazon, and when you consider that this was a worst-case scenario.