Slower submission processing at doi.crossref.org

Incident Report for Crossref

Postmortem

Incident Review

Summary of incident and impact

On Friday, 21st March 2025, we received a report and noticed that the submission queue had slowed down in processing both submissions of XML deposits and XML queries. 

Over the course of a few days, the number of queued deposits grew to nearly 100,000 and over 15,000 queries, and members were experiencing up to a five-day delay in processing. Typically, even at peak times, we see the backlog of deposits up to 10,000 and queries not more than 5,000, and the delay in processing is often less than 1 hour. We also usually cap the queue at 10k pending files per member, but we were seeing some members able to go far over this cap and were unclear how.

We opened an incident on 25th March. Many of our technical, program, and support staff were diverted to this issue, and other work (chiefly the cloud migration project) was paused. 

Investigation

We had recently made changes to 1) the database, 2) the ingest schema, and 3) regular weekly code deployments. So we commenced a process of elimination to determine the root cause. This process took several days as we deployed several tactics, such as adding more capacity, rolling back changes, and a 3-hour period of emergency downtime on March 29th to perform database maintenance. While overall database performance improved after that, the throughput was still slow, and some larger queries were still particularly stuck.

Root cause

Having eliminated schema changes, code changes, and database changes, we turned to something called ‘misspell’, a subsystem that performs title matching in the processing of submissions. This service had been silently lagging with intermittent delays, but was not all-out failing, and was producing no errors. There was insufficient logging and health metrics, and because this part of the system had been consistent and reliable in its 20+ year history, this was not immediately attended to. However, we observed that it had become overloaded with the high volume of traffic and was handling some unusually heavy matching queries from one member in particular. 

Resolution

Between April 7th and 9th, we focused on adding logging and monitoring to the misspell system, and added additional servers to mitigate the increased load. We monitored metrics such as throughput time and age of queued deposits/queries, size of files, and observed significant improvement as the queues for deposits and queries finally drained. On 9th April, we agreed that all previous symptoms of the problem had cleared (slow transactions, intermittent network traffic, backend failures for misspell).

Next steps 

We’re still working out whether the volumes of queries will continue to increase and/or can be managed better, and why the limits on the system (capped at 10k pending submissions per member) were not enforced. But we now have better logging and health metrics set up on the misspell subsystem and will add the same for other subsystems, better documenting these as we go. As planned, all subsystems are being carefully considered as we recommence the data migration work, migrating multiple subsystems and codebases to the cloud. We also plan to discuss with members who make extra-large queries whether these can be made more efficient (either their side or ours), and we’ll investigate the queue limits and how these can be enforced.

Posted Apr 16, 2025 - 17:04 UTC

Resolved

This incident has been resolved. We continue to work on a post-mortem, and will share it when it's available. -IF
Posted Apr 10, 2025 - 13:54 UTC

Update

We are no longer seeing lags with processing of submissions and queries in our queue (doi.crossref.org). That said, we continue to monitor the performance of the queue, while we work on that post-mortem. -IF
Posted Apr 09, 2025 - 19:55 UTC

Monitoring

We believe a fix has been found and implemented; everything is flowing far faster now. We will monitor the results for a bit longer and share a post-mortem soon.
-GH
Posted Apr 09, 2025 - 11:29 UTC

Identified

We believe we have identified the underlying issue that has been causing submission processing lags. We're working on a fix. -IF
Posted Apr 08, 2025 - 20:30 UTC

Update

We are continuing to investigate this issue.
Posted Apr 07, 2025 - 21:45 UTC

Update

We're still focused on speeding up slow deposit and query processing in the submission queue. Some progress was made over the weekend so things are looking better; more details soon.
-GH
Posted Apr 06, 2025 - 10:14 UTC

Update

We are continuing to investigate this issue.
Posted Apr 04, 2025 - 19:08 UTC

Update

Our technical team continues to investigate the cause of these performance lags. -IF
Posted Apr 03, 2025 - 21:31 UTC

Update

We are continuing to investigate this issue.
Posted Apr 02, 2025 - 20:43 UTC

Investigating

We reverted additional code earlier today during our weekly release in an attempt to pinpoint the origin of the performance lags in the submission and query queues at doi.crossref.org. Unfortunately, this did not result in a return to the processing speeds that we saw prior to 18 March. We have enabled additional processing threads for submissions and queries and continue to investigate the source of the lags. -IF
Posted Apr 01, 2025 - 19:02 UTC

Update

We are continuing to monitor for any further issues.
Posted Mar 31, 2025 - 17:07 UTC

Update

The emergency maintenance has now been completed. We have dedicated additional resources to drive down the backlog of pending submissions and queries in the queue at doi.crossref.org. -IF

Our changes have improved performance. We continue to monitor the results.
Posted Mar 29, 2025 - 14:37 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Mar 29, 2025 - 13:58 UTC

Identified

We will be conducting emergency, unplanned maintenance tomorrow, Saturday, 29 March, from 11:00 to 14:00 UTC in order to address the slow processing times and a growing submission and query queue size. More details here: https://status.crossref.org/incidents/nxp3kkx4xxs3
Posted Mar 28, 2025 - 17:51 UTC

Investigating

Unfortunately, today's special release has not resolved the issue. We're continuing to investigate the cause of the performance lags in the submission queue - doi.crossref.org. Resolving these lags is our technical team's highest priority. -IF
Posted Mar 27, 2025 - 17:42 UTC

Update

We will be performing special maintenance - release v0.231.0 - tomorrow, 27 March, to try to further improve submission processing performance. -IF
Posted Mar 26, 2025 - 19:56 UTC

Update

Our change yesterday improved performance in the submission queue - doi.crossref.org - for several hours, but we're seeing lags again today. We're investigating other potential causes of this issue. -IF
Posted Mar 26, 2025 - 13:20 UTC

Update

The deployment earlier today did not fix the performance lags in our submission queue - doi.crossref.org. Our technical team has reverted a change that they believe is responsible for the lags. We're monitoring the results. -IF
Posted Mar 25, 2025 - 17:38 UTC

Monitoring

We have observed slower processing times in our submission queue - doi.crossref.org - during the last few days. We've exceeded 50,000+ pending submissions at the moment. We're still processing your submissions. We're hopeful that today's deployment will fix the performance lags. We're monitoring the issue. -IF
Posted Mar 25, 2025 - 14:13 UTC
This incident affected: Content Registration (Admin tool, Test admin tool, Web deposit form, Record registration form) and Beta (Metadata Manager).