On Saturday, November 3, the producer that is responsible for sending the schedule list to the queue went down. The issue was detected on Sunday, November 4. The producer was restored and system returned to the normal working order. The issue has been resolved, and the backups will continue to run normally now.
The backups that were scheduled to run on those two days were delayed, but were all ran successfully once the producer was brought back up.
Only the users that had scheduled backups for November 3 and 4 were affected.
We are increasing redundancy for the secondary systems and changing the logic so something like this doesn’t happen again.
On Wednesday, August 15, the machine that is responsible for starting scheduled backups went down. The issue was detected on August 17, the machine was restarted and the backups were ran.
Less than 1% of websites were affected, which is why it took us so long to detect the issue.
Allocated additional machines for this task. Improved tracking of these machines to detect outages.
On Saturday, August the 11th we had an issue with worker machines that are responsible for processing our SEO Ranking results. The issue was that the majority of the machines ended up in an infinite loop so the huge spike in processing we have during the weekends wasn’t being processed fast enough. Unfortunately, our alerting thresholds weren’t set up correctly so we didn’t notice all of this before it was too late. Because we use a 3rd party API we couldn’t just throw more machines at the load so we had to start shedding the load. This means that some of the keywords weren’t processed so some results might be missing. This weekend might have additional keywords missing because of late processing.
Almost all of the users that use the SEO Ranking addon.
Alerting thresholds are adjusted and the rework of this part of the system has been scheduled.