A bug in the code used by the Collaborate feature caused a 45-minute downtime, between 13:30 and 14:15 GMT. Users that are also someone else’s collaborators could not log in during that time. We partially resolved the login issue, and the users could log in as collaborators, but not into their own accounts. 2 hours later we resolved that issue as well.
Or so we thought.
At 06:05 GMT next morning the traffic increased, and a couple of instances were activated to handle it. Unfortunately, these instances had the same bug we fixed yesterday, and logged out some of the collaborators that hit these instances. We went through each of these instances and fixed them.
Only users who have the option of logging in as collaborators on other people’s accounts were affected. Regular users did not experience any loss of connectivity.
We are putting up safeguards to prevent creation of new instances with bad configuration. We are also doing a deep dive into the issue to find out how the bug initially appeared and escaped detection so far.
As part of our ongoing fight to keep our code clean and performant, we are constantly refactoring old code to keep it up to standards.
This particular issue happened during the refactor of the authentication logic that is ran when a ManageWP Classic user is authenticated in Orion, the current version of our dashboard. Instead of authenticating the user only once and then reusing the cookie for subsequent requests, the regression introduced as part of the refactor triggered the code for re-authentication in ManageWP classic.
This in turn increased the load to the ManageWP classic database.
Because there were no more available MySQL connections available the ManageWP classic authentication endpoint started returning 502 status codes which triggered an exception in Orion that logs out the user.
Users that were on ManageWP Classic before transferring to Orion, logging on August 10, 2017, between 18:11h and 18:30h GMT, experienced random logouts from their dashboard.
We have added additional metrics that we are tracking and using for alerting so we can react faster. We have also improved our code review process in order to catch mistakes like this before they are deployed to production.
In order to better track the Safe Updates progress, we launched the progress bar on the dashboard. It uses a polling endpoint that reported progress to the backend. Unfortunately, this went live suboptimal performance; it created too many queries that overloaded our servers. As soon as we pinpointed the issue, we rolled out a hotfix and the service was restored to normal.
Users logging on July 20, 2017, between 13h and 15h GMT, experienced 502 error in the first 30 minutes, followed by a period when users could log in, but would receive intermittent error popups.
Additional checks are being added to make sure this kind of issue is noticed and prevented before it goes live. Next time we will be ready, because winter is coming.
ManageWP rotates the keys used for synchronization (communication with websites added to ManageWP dashboard) in regular time intervals. This is done primarily for security purposes, so even if an unauthorized person somehow gained the access to those keys, they would be likely to expire before malicious or unauthorized action could be performed against those websites.
On April 11, one of the queries that writes new key data in our database failed, and the exception wasn’t managed properly. As a consequence entity manager instance associated with a persistence context has shut down, effectively preventing any further write commands.
This issue was rectified as soon as it was discovered, but unfortunately it caused 4,200 websites under our management to go out of sync, i.e. use new keys, while old keys remained in our databases. As a result, communication to them was disrupted, which manifested as those sites being disconnected from users’ dashboards. To fix this, you have to deactivate/reactivate the ManageWP Worker plugin, and click the Reconnect website button on your ManageWP dashboard. And if you’re hiding the Worker plugin, you have to log in via FTP and rename the wp-content/plugins/worker folder, in order to deactivate the plugin and unhide it.
We are making several changes that will prevent this kind of behavior in the future: failsafes that will limit the damage to a single website, and better logging that will allow us to resolve the sync issue without your involvement.
We apologize deeply for the service disruption you had. We are aware how important this service is to you – our customers and your businesses. We will do everything we can to learn from this event and use it to improve our stability even further.