Incident Report — 19 June 2020

Updated June 30, 2020 16:02

SUMMARY

On June 19, 2020, the primary RDS instance that stores JetBrains Space Cloud customer data crashed at 10:38 UTC and recovered at 12:27 UTC. Consequently, all cloud organizations were not available from 10:38 UTC until 12:27 UTC.

No data was lost during the incident.

IMPACT

No operations were available for all organizations from 10:38 UTC to 12:27 UTC.

TIMELINE

10:38 UTC
Space team receives alerts from the monitoring system about connectivity issues from Space nodes to DB. The alerts are acknowledged, and Space team begins the investigation.

10:40 UTC
The RDS instance was restarted due to out of memory. ECS begins to restart Supervisor service containers as they fail to pass the health checks due to a lack of connection with DB.
Space displays a 500 error page that contains 502 bad gateway from the internal Supervisor service. This service was not intended to be a single point of failure but appeared as one. Therefore, the Space Cloud service is in a completely degraded state.

10:44 UTC
Supervisor service is not able to start. Internal health check alerts appear. Even though the RDS instance status reads Available, it refuses all incoming connections.

10:50 UTC
The reason for the restart was the InnoDB crash that caused a high ReadIOPS to read bin log.
The RDS refuses all other DB activities.

11:05 UTC
Space team decides to reboot the RDS with failover to switch to the second instance in Multi-AZ.

11:10 UTC
The RDS instance status is Available, but it still doesn't accept any incoming connections.

11:15 UTC
The RDS reboot with failover fails with the following error: Timed out waiting for a state safe to initiate requested failover.

11:20 UTC
Space team decides to reboot the instance without failover.

11:49 UTC
Space team opens the AWS Support Case to bring the RDS instance to live and to find out what is going on with recovery. The Support Case ID is shared with the AWS team and escalated.

12:07 UTC
The RDS instance is recovered from the failure and started to accept the connections. Supervisor service starts to connect to DB. Space service starts to operate in a proper manner.

12:19 UTC
The RDS instance reboots according to a planned restart requested at 11:20 UTC.
Space service is affected and degraded.

12:24 UTC
The RDS instance is up, and the connections begin to get served.

12:27 UTC
Space service was recovered from the failure without any data loss.

ROOT CAUSE

The RDS instance crashed with the Out Of Memory error. It contained the majority of tenants' data and the internal organization routing tables. The crash caused a maximum impact as the tenants that run on other unaffected RDS instances were still unable to access their instances due to unexpected routing outage.

Memory graph:

The out of memory crash was caused by a significant number of organization databases in memory, which led to memory leakage. That's MySQL behavior that can be fixed by migration to PostgreSQL, which is currently in progress. The Space team runs scheduled maintenance to reboot the RDS to free the available memory. In this particular case, the reboot was scheduled for the night after the incident.

Surprisingly, the Multi-AZ failover didn't work during the InnoDB recovery. As a result, the service restore time was more than an hour. It was impossible to avoid it or cancel even from the AWS side.

High ReadIOPS consumption:

LESSONS LEARNED

Letting the RDS freeable memory go below 10% is risky. It's better to start maintaining the reboot as soon as possible until a proper solution is applied.
Manually restarting RDS during a recovery procedure can extend the outage time.
Don't trust the RDS Available state. It may refuse incoming connections.
Don't rely on internal service availability as a single point of failure.
Do not run such massive RDS databases. It has a vast impact.

ACTION POINTS

Migrate the Cloud Supervisor table to a dedicated database - Done!
Migrate organizations' schemas from shared RDS to the smaller one with PostgreSQL onboard.

Please sign in to leave a comment.

Have more questions?

Submit a request