If you rely solely on east 1 maybe?

omh2 · on Dec 22, 2021

AWS doesn't follow their own advice about hosting multi-regional.

When us-east-1 is sufficiently borked the management API and IAM services in all regions tend to go down with it.

Static infrastructures usually avoid the fallout, but anyone dependent on the API or otherwise dynamically created resources often get caught in the blast regardless of region

jeremyjh · on Dec 22, 2021

I didn't hear any reports of that happening in the most recent outage. The console was inoperable but you could work around using regional console host names.

omh2 · on Dec 22, 2021

If you're referring to Dec 7, it absolutely did. Metrics went down nearly across the board, which also means most auto-scaling setups were non-functional. Cloudfront metrics didn't properly recover until the next day

Logging in with root credentials was not possible in any region, and even logging in with IAM creds in other regions yielded an intermittently buggy console

and as is usual with us-east-1 outages management API calls were a complete crap shoot regardless of region

tyingq · on Dec 22, 2021

There are some services that do have hard US-EAST-1 dependencies. Cloudfront, because of certificates. Route53. The control API for IAM (adding/removing roles, etc). And there's also the notion of "global endpoints" like https://sts.amazonaws.com... it's not clear why that exists, because it fails when us-east-1 does. It would be better to only have regional endpoints if the "global" ones are region-specific in reality. The endpoint thing is documented, but it's still confusing to people.

The dependency chains can bite you too. During the us-east-1 outage, a Lambda run by cron-like schedules via EventBridge was itself in an okay state, but the EventBridge events that kick it off were stuck in a queue that was released when the problem was fixed. So if your Lambda wasn't idempotent, and you ran it in another region during the outage, you ended up with problems.

throwanem · on Dec 22, 2021

It wasn't reliable. I heard of many more who weren't able to get in that way than who were, and was in the former category myself.

We didn't take any downtime, but if anything had gone wrong there would have been nothing we could do about it until IAM came back up.