WellnessLiving server outage analysis: July 19, 2021

On July 18, 2021, at 08:00 UTC, our databases were automatically upgraded by AWS which caused instability and unexpected restarts within one of our 30 database servers. These issues resulted in errors within the WellnessLiving site, causing it to be temporarily unavailable. With guidance from AWS Support, it was determined that the cause of the issue was a bug in the new, upgraded version of the database engine. After determining the root cause of this issue, action was promptly taken to begin restoring data to new database instances, using engines pre-dating the bug. This involved stopping all services until the majority of the data was successfully transferred and restored.

Next steps

Now that databases have been restored, any missing data has been retrieved. Additionally, all background tasks in queue have been completed. Any new tasks submitted (such as auto-payments, state updates, and automated emails) will be completed normally.

Note If you continue to experience issues using WellnessLiving, we recommend clearing your cache and refreshing your browser.

Incident timeline

Time (UTC)Summary
Sun Jul 18 08:00 UTCAn automatic database upgrade was started by AWS.
Sun Jul 18 12:27 UTCNumerous 5xx errors were detected on our website and an investigation was started.
Sun Jul 18 12:33 UTCA support ticket was opened with AWS as database servers were found to be restarting sporadically.
Sun Jul 18 13:07 UTCAfter the initial investigation with AWS, Amazon’s support team advised that escalation was required for their internal teams to investigate further.
Sun Jul 18 15:38 UTCThe incident was further escalated with AWS as database servers began restarting multiple times every hour.
Sun Jul 18 19:50 UTCThe cause of issue was determined to be an unresolved bug in the new version of the database engine. As the issue was with the underlying database engine, restoring from backups of the database was not a possible fix. With guidance from AWS, action was promptly taken to execute the best immediate solution and perform a database dump and restore to a new database instance running an engine that pre-dated the bug.
Mon Jul 19 11:25 UTCThe issue escalated with AWS as the database dump was taking far longer than discussed.
Mon Jul 19 12:10 UTCAfter a review with AWS Support, we were able to decrease the estimated time of completion for the dump process from days to hours.
Mon Jul 19 20:15 UTCThe database dump was completed and database services were restored.

Root cause

  • A bug in a new version of the database engine that was upgraded on July 18, 2021, at 08:00 UTC, resulted in our database servers restarting every few minutes, subsequently causing the WellnessLiving site to become unavailable.

Resolution

  • The issue was resolved by migrating all database data to new database servers using database engines pre-dating the bug.

Preventative measures

  • We have disabled automatic version upgrades managed by AWS for our database servers.
  • We have integrated the database engine upgrade process into our testing process and upgrades will be done manually going forward.
Was this article helpful?
(42 out of 46 people found this article helpful)
Cancel