On June 19, 2020, the primary RDS instance that stores JetBrains Space Cloud customer data crashed at 10:38 UTC and recovered at 12:27 UTC. Consequently, all cloud organizations were not available from 10:38 UTC until 12:27 UTC.
No data was lost during the incident.
No operations were available for all organizations from 10:38 UTC to 12:27 UTC.
Space team receives alerts from the monitoring system about connectivity issues from Space nodes to DB. The alerts are acknowledged, and Space team begins the investigation.
The RDS instance was restarted due to out of memory. ECS begins to restart Supervisor service containers as they fail to pass the health checks due to a lack of connection with DB.
Space displays a 500 error page that contains
502 bad gateway from the internal Supervisor service. This service was not intended to be a single point of failure but appeared as one. Therefore, the Space Cloud service is in a completely degraded state.
Supervisor service is not able to start. Internal health check alerts appear. Even though the RDS instance status reads
Available, it refuses all incoming connections.
The reason for the restart was the InnoDB crash that caused a high ReadIOPS to read bin log.
The RDS refuses all other DB activities.
Space team decides to reboot the RDS with failover to switch to the second instance in Multi-AZ.
The RDS instance status is
Available, but it still doesn't accept any incoming connections.
The RDS reboot with failover fails with the following error:
Timed out waiting for a state safe to initiate requested failover.
Space team decides to reboot the instance without failover.
Space team opens the AWS Support Case to bring the RDS instance to live and to find out what is going on with recovery. The Support Case ID is shared with the AWS team and escalated.
The RDS instance is recovered from the failure and started to accept the connections. Supervisor service starts to connect to DB. Space service starts to operate in a proper manner.
The RDS instance reboots according to a planned restart requested at 11:20 UTC.
Space service is affected and degraded.
The RDS instance is up, and the connections begin to get served.
Space service was recovered from the failure without any data loss.
The RDS instance crashed with the Out Of Memory error. It contained the majority of tenants' data and the internal organization routing tables. The crash caused a maximum impact as the tenants that run on other unaffected RDS instances were still unable to access their instances due to unexpected routing outage.
The out of memory crash was caused by a significant number of organization databases in memory, which led to memory leakage. That's MySQL behavior that can be fixed by migration to PostgreSQL, which is currently in progress. The Space team runs scheduled maintenance to reboot the RDS to free the available memory. In this particular case, the reboot was scheduled for the night after the incident.
Surprisingly, the Multi-AZ failover didn't work during the InnoDB recovery. As a result, the service restore time was more than an hour. It was impossible to avoid it or cancel even from the AWS side.
High ReadIOPS consumption:
- Letting the RDS freeable memory go below 10% is risky. It's better to start maintaining the reboot as soon as possible until a proper solution is applied.
- Manually restarting RDS during a recovery procedure can extend the outage time.
- Don't trust the RDS Available state. It may refuse incoming connections.
- Don't rely on internal service availability as a single point of failure.
- Do not run such massive RDS databases. It has a vast impact.
- Migrate the Cloud Supervisor table to a dedicated database - Done!
- Migrate organizations' schemas from shared RDS to the smaller one with PostgreSQL onboard.