At 8:00 UTC we detected an issue with the scheduled backups not triggering. We quickly determined that one of the vendor libraries we use to process scheduled backups is crashing. At 10:30 UTC we rolled out a workaround, and soon after a permanent fix.
Users with scheduled backups set to run on March 15 from around midnight UTC to 10:30 UTC. The scheduler has ignored these events, so we recommend triggering them manually if you need them.
Manual backups have not been affected.
What we are doing to prevent this from happening again
This issue was an edge case from the RabbitMQ failure on March 12. Now that we’ve finally propagated the update across the whole server infrastructure (and triggered this edge case where the update is incompatible with the vendor library), we can finally say that the RabbitMQ issue is over, just like John McClane’s marriage.
At 6:25 UTC we started getting user feedback about manual backups getting stuck in queue. Our developers isolated the cause – a vendor software called RabbitMQ stopped consuming the messages, effectively freezing the job queue, globally.
After several reboots and a client update we finally got RabbitMQ working. But then, a plot twist hit. As we spun up another 40 instances to deal with the backup backlog, RabbitMQ crashed due to the sheer number of requests. At this point we decided to put ManageWP into maintenance mode until we stabilize the service.
At 14:07 UTC we rebooted RabbitMQ again, throttled the requests and brought ManageWP dashboard back online.
The bug was global, and affected everyone who logged into the service in that time frame. Scheduled tasks that should have run in this time frame will be requeued, and will run in the next few hours. Manual tasks that were frozen in queue have been deleted. You will need to run these again.
Unfortunately, there’s not much else that we could have done to affect the outcome. The vendor software that failed ran reliably for the past 4 years. When it started failing, we got a notification and reacted accordingly.
At 11:20 UTC an application bug started off as a random background service failure, blocking the PHP’s connection resources. Web servers and data servers were unable to accept connections or create new ones, effectively blocking the ManageWP service. At 13:30 UTC the bug was fixed.
Everyone who logged into their dashboard in that timeframe. No data was lost, except some queued jobs we couldn’t process (updates, syncs etc.).
Aside from enforcing a “make love, not bugs” policy, we’re adding a New Relic alert that will notify us about a bug before it gets to a stage where it can take the service down.
At 20:00 GMT we detected a bug with the Safe Updates queue. In certain scenarios the system delays an update by 5 seconds, reducing load and preventing the update from triggering when a backup is being made (among other things). The bug prevented the update from getting into the queue after being delayed, leaving it in a limbo. This in turn resulted in an endless queue in the front end of the dashboard.
The exact number is unknown, since some updates were affected, while others worked as usual. A rough estimate is that around 10% of users have been affected with at least one update delay.
While we are not still certain what caused this bug, we’ve built a workaround for it, ensuring it does not happen again. We will not rest until we get to the bottom of this, no matter what it takes, Dana. The truth is out there.
At 8:30 GMT a bug in the ManageWP server back end triggered a high volume of notifications that were sent to the server database. This in turn caused the server to become unstable. By 10:00 we fixed the bug and restored the service.
People logging in between 8:30 and 10:00 GMT experienced intermittent glitches – some could not log in, getting a 502 error. Others would get an occasional error message on the dashboard, but were otherwise able to manage their websites.
Diligence, diligence, diligence. We’re constantly ramping up our efforts to test the code we push live. As a result, we’re catching bugs that would otherwise be undetected. Some bugs will inevitably sneak into production, tho. It’s up to us to fix them ASAP, unless we want to inadvertently cause a machine uprising. And I’m not talking about the good kind, like Matrix, but the Maximum Overdrive kind, with Emilio Estevez.