We built a sophisticated system to update the Worker plugin, and it’s been running for over a year. We basically redesigned the way WordPress update is done: our server verifies each file one by one, and updates the outdated files. We also have fail safes in place that rebuild the latest stable version if something happens during the update. It’s incremental, and it is very robust, making it virtually impossible for an Worker update to cause a permanent sync loss.
The reason we built this system is out of necessity: a failed Worker update puts you in panic mode, scrambling to log into that website and reconnect your website manually. We will go into more details in a separate article, and here we will focus on the mess we made.
On Thursday, Oct 26, 14:00 GMT we pushed an update for the websites hosted on Pro Managed WordPress hosting platform – they needed a quick update. The update was strictly related to the Pro Managed WordPress hosting platform, and did not affect anyone who isn’t hosted on it. Unfortunately, we soon noticed that the automated update is not getting all websites to the latest version. This, coupled with the dashboard being hardcoded to not show Worker updates, caused a lot of confusion for you. Not showing the Worker update was a change we made recently, this is so that we can be fully responsible for making sure all websites are connected to the dashboard, secure and the Worker plugin is running on the latest stable version.
We got the update system back online after a while, but to keep the load down it took a few days to get everyone to the latest version.
On Sunday, Oct 30, we also realized that the digest emails are also reporting Worker updates, further adding to the confusion.
On Monday, Oct 31, we resolved the update issue, and almost all users have been automatically updated to the latest version. The few remaining users will be automatically updated a couple of hours after they log in and sync their websites.
It was a global issue. However, it’s worth noting that, apart from all of the confusion and uncertainty, there was no loss in connectivity or degradation of service in any shape or form.
Frankly, this was a stupid move on our part. We knew that there was no threat, but we weren’t transparent enough to keep you in the loop, and make sure you don’t freak out. The lack of information caused you to worry about this more than you should be. That’s why we will soon publish an article that walks you through the technology powering the Worker plugin update. We will also make sure that if something does go wrong in the future, you get a timely notification.
We consider you our partners, and you deserve nothing less.
A bug in the code used by the Collaborate feature caused a 45-minute downtime, between 13:30 and 14:15 GMT. Users that are also someone else’s collaborators could not log in during that time. We partially resolved the login issue, and the users could log in as collaborators, but not into their own accounts. 2 hours later we resolved that issue as well.
Or so we thought.
At 06:05 GMT next morning the traffic increased, and a couple of instances were activated to handle it. Unfortunately, these instances had the same bug we fixed yesterday, and logged out some of the collaborators that hit these instances. We went through each of these instances and fixed them.
Only users who have the option of logging in as collaborators on other people’s accounts were affected. Regular users did not experience any loss of connectivity.
We are putting up safeguards to prevent creation of new instances with bad configuration. We are also doing a deep dive into the issue to find out how the bug initially appeared and escaped detection so far.
As part of our ongoing fight to keep our code clean and performant, we are constantly refactoring old code to keep it up to standards.
This particular issue happened during the refactor of the authentication logic that is ran when a ManageWP Classic user is authenticated in Orion, the current version of our dashboard. Instead of authenticating the user only once and then reusing the cookie for subsequent requests, the regression introduced as part of the refactor triggered the code for re-authentication in ManageWP classic.
This in turn increased the load to the ManageWP classic database.
Because there were no more available MySQL connections available the ManageWP classic authentication endpoint started returning 502 status codes which triggered an exception in Orion that logs out the user.
Users that were on ManageWP Classic before transferring to Orion, logging on August 10, 2017, between 18:11h and 18:30h GMT, experienced random logouts from their dashboard.
We have added additional metrics that we are tracking and using for alerting so we can react faster. We have also improved our code review process in order to catch mistakes like this before they are deployed to production.
In order to better track the Safe Updates progress, we launched the progress bar on the dashboard. It uses a polling endpoint that reported progress to the backend. Unfortunately, this went live suboptimal performance; it created too many queries that overloaded our servers. As soon as we pinpointed the issue, we rolled out a hotfix and the service was restored to normal.
Users logging on July 20, 2017, between 13h and 15h GMT, experienced 502 error in the first 30 minutes, followed by a period when users could log in, but would receive intermittent error popups.
Additional checks are being added to make sure this kind of issue is noticed and prevented before it goes live. Next time we will be ready, because winter is coming.
ManageWP rotates the keys used for synchronization (communication with websites added to ManageWP dashboard) in regular time intervals. This is done primarily for security purposes, so even if an unauthorized person somehow gained the access to those keys, they would be likely to expire before malicious or unauthorized action could be performed against those websites.
On April 11, one of the queries that writes new key data in our database failed, and the exception wasn’t managed properly. As a consequence entity manager instance associated with a persistence context has shut down, effectively preventing any further write commands.
This issue was rectified as soon as it was discovered, but unfortunately it caused 4,200 websites under our management to go out of sync, i.e. use new keys, while old keys remained in our databases. As a result, communication to them was disrupted, which manifested as those sites being disconnected from users’ dashboards. To fix this, you have to deactivate/reactivate the ManageWP Worker plugin, and click the Reconnect website button on your ManageWP dashboard. And if you’re hiding the Worker plugin, you have to log in via FTP and rename the wp-content/plugins/worker folder, in order to deactivate the plugin and unhide it.
We are making several changes that will prevent this kind of behavior in the future: failsafes that will limit the damage to a single website, and better logging that will allow us to resolve the sync issue without your involvement.
We apologize deeply for the service disruption you had. We are aware how important this service is to you – our customers and your businesses. We will do everything we can to learn from this event and use it to improve our stability even further.