AWS had its worst outage in 10 years. It lasted 16 hours and took down 140 services including EC2. The author of this post doesn't work at Amazon. But, they are an experience developer who took information from several locations and aggregated it together.
DynamoDB had a service failure at about 7am. Since AWS dogfoods much of their stuff internally. So, when DynamoDB went down so did literally everything else that was using it. So, what was the issue? A race condition in DNS registration.
DynamoDB has an automated DNS load balancing system. They have have three DNS Enactors (adder of rules) that operate in the three availability zones. These don't have any coordination. The exact us-east-1a was going very, very slow - likely 10x-100x slower than normal.
The DNS service gets rid of old plans in an automated way. This is done via "keep last N" strategy. The DNS plan that was being enacted fell outside of the safety of N. This meant that an active DNS plan was removed! This time of check vs. time of use issue led to everything going down. The system didn't have a fallback to fix itself once the DNS broke.
There's two issues here: a TOCTOU bug and a missing check for a stale plan being active or not. They mention the Swiss Cheese model: the more holes that are in the system, the more likely something is too happen. Much of the time, several things need to go wrong. This outage lasted about 3 hours but more damage was caused.
The EC2 service uses DynamoDB for metadata management. So, current instances could keep running but you couldn't create new ones or stop current ones. Once DynamoDB came back up, EC2 still didn't work. The DropletWorkflow Manager (DWFM) contains a large list of active leases (10^6) of which 10^2 are broken, meaning that a connection needs to be reestablished. The heartbeat timeouts raised to 10^5, creating a gigantic queue that led to a congestive collapse. This was only fixed by manual intervention.
The author claims that software is much buggier than we realize. If AWS, the giant of the cloud industry, has these types of issues laying around then there are many more to be discovered. Overall, a good post on the outage.