This is a more detailed timeline and post mortem for the incident occurring Feb 22, 2017: The incident manifested as a failure to publish new predictions on incoming tickets to customers' Salesforce and Zendesk orgs. This problem arose as our main production database became largely unresponsive, with CPU railing to near max. The root cause is related to automated DB maintenance that AWS kicks off for Postgres-managed DBs. These activities (such as autovaccuuming) occur all the time without incident. Today, that was not the case. For most clients, the total time we were not predicting on new tickets was about 1hour 50 minutes. We apologize for any disruptions this may have caused.
Timeline (all times Pacific):
11:17am Initial internal alerts of some slowness with some customers sent to Wise Ops channels
11:18am Investigation begins
11:28am Communication to some affected clients
11:32am Issue localized to problem with production database
11:50am Issue identified as connected to AWS maintenance routines
11:55am Mitigation steps fail to bring down CPU/load to normal levels.
11:56am DB backup restore process initiated
1:05 pm DB backup restore complete
1:08 pm New DB connected to running app/middleware
1:09 pm New tickets fetched and predicted for all customers
1:25 pm Backfill caught up for older tickets completed for majority of customers
1:30 pm Update email sent to all clients
We continue to monitor the health of the new DB and the backfilling operations for our larger customers that are still in progress. This incident will be set to resolved once all backfills are complete.
Posted 8 months ago. Feb 22, 2017 - 18:12 PST
We are recovering from a problem that arose as our transactional database became largely unresponsive. Predictions (and actions) for new tickets for all customers should be working as normal. Older tickets (before noon today Pacific) are currently getting backfilled and some predictions for those tickets will not be available until the backfill jobs have completed. We will post more detailed information about the outage to this incident.