We’ve had an unexpected increase in load on one of our database shards. This happened due to a change we’ve made on the new Worker plugin. We’ve had to go to maintenance mode to resolve this extra load and get the system back into working order. The entire operation lasted less than 35 minutes.
How many users were affected
Service was unavailable during the maintenance.
This is a reason we have a team on standby. To handle unexpected situations just like this one.
On Friday, November 30, our primary server shard went went down. This led to the dashboard sync issues when our secondary systems took over. Needless to say, redundancy systems performed their backup function well in this case.
However, over the weekend we detected that the system struggled to keep up with the increased workload. That is why on Monday morning we decided to shutdown the system for maintenance and server upgrade.
During the 30 minute downtime, the server resources were scaled up to handle the increased workload. System was fully restored, and after a few hours the queue was empty of the scheduled tasks.
How many users were affected
All user that used the dashboard over the weekend were affected.
We are increasing the resources on on our primary and secondary system to make sure they perform well in case of the similar occurrence in the future.
On Saturday, November 3, the producer that is responsible for sending the schedule list to the queue went down. The issue was detected on Sunday, November 4. The producer was restored and system returned to the normal working order. The issue has been resolved, and the backups will continue to run normally now.
The backups that were scheduled to run on those two days were delayed, but were all ran successfully once the producer was brought back up.
Only the users that had scheduled backups for November 3 and 4 were affected.
We are increasing redundancy for the secondary systems and changing the logic so something like this doesn’t happen again.
On Wednesday, August 15, the machine that is responsible for starting scheduled backups went down. The issue was detected on August 17, the machine was restarted and the backups were ran.
Less than 1% of websites were affected, which is why it took us so long to detect the issue.
Allocated additional machines for this task. Improved tracking of these machines to detect outages.
On Saturday, August the 11th we had an issue with worker machines that are responsible for processing our SEO Ranking results. The issue was that the majority of the machines ended up in an infinite loop so the huge spike in processing we have during the weekends wasn’t being processed fast enough. Unfortunately, our alerting thresholds weren’t set up correctly so we didn’t notice all of this before it was too late. Because we use a 3rd party API we couldn’t just throw more machines at the load so we had to start shedding the load. This means that some of the keywords weren’t processed so some results might be missing. This weekend might have additional keywords missing because of late processing.
Almost all of the users that use the SEO Ranking addon.
Alerting thresholds are adjusted and the rework of this part of the system has been scheduled.