On Friday, 21st March 2025, we received a report and noticed that the submission queue had slowed down in processing both submissions of XML deposits and XML queries.
Over the course of a few days, the number of queued deposits grew to nearly 100,000 and over 15,000 queries, and members were experiencing up to a five-day delay in processing. Typically, even at peak times, we see the backlog of deposits up to 10,000 and queries not more than 5,000, and the delay in processing is often less than 1 hour. We also usually cap the queue at 10k pending files per member, but we were seeing some members able to go far over this cap and were unclear how.
We opened an incident on 25th March. Many of our technical, program, and support staff were diverted to this issue, and other work (chiefly the cloud migration project) was paused.
We had recently made changes to 1) the database, 2) the ingest schema, and 3) regular weekly code deployments. So we commenced a process of elimination to determine the root cause. This process took several days as we deployed several tactics, such as adding more capacity, rolling back changes, and a 3-hour period of emergency downtime on March 29th to perform database maintenance. While overall database performance improved after that, the throughput was still slow, and some larger queries were still particularly stuck.
Having eliminated schema changes, code changes, and database changes, we turned to something called ‘misspell’, a subsystem that performs title matching in the processing of submissions. This service had been silently lagging with intermittent delays, but was not all-out failing, and was producing no errors. There was insufficient logging and health metrics, and because this part of the system had been consistent and reliable in its 20+ year history, this was not immediately attended to. However, we observed that it had become overloaded with the high volume of traffic and was handling some unusually heavy matching queries from one member in particular.
Between April 7th and 9th, we focused on adding logging and monitoring to the misspell system, and added additional servers to mitigate the increased load. We monitored metrics such as throughput time and age of queued deposits/queries, size of files, and observed significant improvement as the queues for deposits and queries finally drained. On 9th April, we agreed that all previous symptoms of the problem had cleared (slow transactions, intermittent network traffic, backend failures for misspell).
We’re still working out whether the volumes of queries will continue to increase and/or can be managed better, and why the limits on the system (capped at 10k pending submissions per member) were not enforced. But we now have better logging and health metrics set up on the misspell subsystem and will add the same for other subsystems, better documenting these as we go. As planned, all subsystems are being carefully considered as we recommence the data migration work, migrating multiple subsystems and codebases to the cloud. We also plan to discuss with members who make extra-large queries whether these can be made more efficient (either their side or ours), and we’ll investigate the queue limits and how these can be enforced.