Our queueing system was under the siege once again. Strain of too many requests and queued processes finally slowed the system down to a halt. Since emptying queue and increasing resources was only a stopgap measure, our dev team built a completely separate queueing instance for the services that are generating increased number of requests.
As a consequence, some of the scheduled backups were performed after the situation was resolved.
Only a part of users that were affected initially, but after the system slowed down significantly it was put in maintenance mode until new solution was deployed.
As a short term solution, we separated queueing server instances for some of our services. In parallel, we are starting to work on the long term solution that includes complete rebuilding how some of our services are communicating with the scheduling system.
A glitch in the system caused that some of the processes were not able to finish. Which cased them to return to the wait queue, where they opened a new process. After a while, this caused many connections to open and ‘clog’ one of the servers on the cluster. Glitch was resolved, server restarted and system started behaving normally.
No data was lost and all the processes that were scheduled to run were successfully run after the server was restored to normal working order.
Only a part of users that were on the affected server were impacted by this disruption, but the system was down while we restarted the server to put the system to working order.
Our dev team identified a bug in the code the created the glitch and fixed it before deploying the solution and restarting the server instance.
We had some small issues with our services that caused a bit of a slowdown. After some adjustments, things should be back to normal. Data integrity is not compromised.
Our dev team investigated and found a large increase in requests on one of our shards that impacted system stability for the users on that particular shard. They scaled up the system resources and that stabilized the system. It was only a temporary measure, as the number of requests kept increasing. In parallel, another team traced the root of the problem. It took us a couple of hours due to the huge amount of incoming requests, but ultimately they identified a set of IP from which the requests were coming from. After all of the identified IP’s were banned, system slowly stabilized.
Only a part of users that were on the affected shard were impacted by this disruption.
We have our 24/7 on-call teams to jump in if similar situation appears in the future.
At 16:00 CEST (06:00 PST) We will have a short maintenance period in order to upgrade parts of our architecture. We expect the maintenance period to last around 30 minutes, during which the ManageWP dashboard will not be accessible.
We observed communication issues between the newly upgraded component and the rest of the system. A fix has been deployed that resolved the issue. All systems are working normally now.
Essentially, we had to deal with more than one issue.
With modifications and increased resources, everything was put back to working order on Friday/Saturday (June 28/29), with our team closely monitoring the situation.
All of our users experienced some system slowness and instabilities (some actions could not be performed) as well as system downtime while we deployed the changes.
We’ve increased the number of secondary instances and database slaves so we can cope better in case of similar outages. But there is very little we can do if our main platform providers have issues.
We’ve also identified several potential areas for improvement, and we will address them in the next couple of weeks in order to further improve our performance.
Most of US northeast was impacted by the issues that originated from Verizon partially disrupting AWS and Cloudflare networks. This in turn affected our services. Our dev teams responded quickly and started working on restoring the system to its working conditions but it took a couple of hours to restore everything once the disruption was resolved.
How many users were affected
Service was unavailable during affected hours.
Since this was an issue beyond our control, all we can do is to have a team in place to handle if something unexpected such as this happens again.
We’ve had an unexpected increase in load on one of our database shards. This happened due to a change we’ve made on the new Worker plugin. We’ve had to go to maintenance mode to resolve this extra load and get the system back into working order. The entire operation lasted less than 35 minutes.
How many users were affected
Service was unavailable during the maintenance.
This is a reason we have a team on standby. To handle unexpected situations just like this one.
On Friday, November 30, our primary server shard went went down. This led to the dashboard sync issues when our secondary systems took over. Needless to say, redundancy systems performed their backup function well in this case.
However, over the weekend we detected that the system struggled to keep up with the increased workload. That is why on Monday morning we decided to shutdown the system for maintenance and server upgrade.
During the 30 minute downtime, the server resources were scaled up to handle the increased workload. System was fully restored, and after a few hours the queue was empty of the scheduled tasks.
How many users were affected
All user that used the dashboard over the weekend were affected.
We are increasing the resources on on our primary and secondary system to make sure they perform well in case of the similar occurrence in the future.
On Saturday, November 3, the producer that is responsible for sending the schedule list to the queue went down. The issue was detected on Sunday, November 4. The producer was restored and system returned to the normal working order. The issue has been resolved, and the backups will continue to run normally now.
The backups that were scheduled to run on those two days were delayed, but were all ran successfully once the producer was brought back up.
Only the users that had scheduled backups for November 3 and 4 were affected.
We are increasing redundancy for the secondary systems and changing the logic so something like this doesn’t happen again.
On Wednesday, August 15, the machine that is responsible for starting scheduled backups went down. The issue was detected on August 17, the machine was restarted and the backups were ran.
Less than 1% of websites were affected, which is why it took us so long to detect the issue.
Allocated additional machines for this task. Improved tracking of these machines to detect outages.
On Saturday, August the 11th we had an issue with worker machines that are responsible for processing our SEO Ranking results. The issue was that the majority of the machines ended up in an infinite loop so the huge spike in processing we have during the weekends wasn’t being processed fast enough. Unfortunately, our alerting thresholds weren’t set up correctly so we didn’t notice all of this before it was too late. Because we use a 3rd party API we couldn’t just throw more machines at the load so we had to start shedding the load. This means that some of the keywords weren’t processed so some results might be missing. This weekend might have additional keywords missing because of late processing.
Almost all of the users that use the SEO Ranking addon.
Alerting thresholds are adjusted and the rework of this part of the system has been scheduled.