Networking Issues at our Data centre

Incident Report for Crossref

Postmortem

Summary of incident and impact

On 30th April at 11:30 UTC, we were alerted by our monitoring tools that our physical data centre was down, meaning all services that go through the data centre were out of action. Essentially all Crossref services were affected: all content registration and helper tools (web deposit form, record registration form, STQ, etc.), the REST API, OAI-PMH, reports, and our website. Members who tried to deposit metadata during this time received a network error and will now need to re-try their metadata submissions. Existing DOIs still resolved during this time. 

Because the REST API already runs in the cloud (though traffic to it is routed through the data centre first), we updated the routing for the REST API to bypass the data centre, restoring REST API service at approximately 15:00 UTC. The rest of the services were restored at approximately 23:00 UTC.

Updating the routing of the REST API had a knock-on effect of disrupting deposits for our members using the Crossref OJS plugin, beginning when the rest of the services were restored. The issue was resolved with additional routing changes on 1 May at approximately 16:00 UTC. OJS users who use the Crossref plugin and attempted deposits during this time received a failure notification and will need to resubmit. 

Root cause

Once our staff arrived at the data centre, we determined the primary firewall hardware had failed. The secondary firewall had also failed previously, but that failure had gone unacknowledged. 

Resolution

We obtained and configured a new firewall and restored services.

Next Steps

We’ll obtain additional backup firewalls to have on hand in the event of another failure. We are already in the process of moving all of our services to the cloud and out of the physical data centre, so this incident is a great reminder (if we needed one!) of the importance of this project.

Posted May 01, 2025 - 23:34 UTC

Resolved

This incident has been resolved.
Posted May 01, 2025 - 02:54 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Apr 30, 2025 - 23:16 UTC

Update

Our infrastructure team continues to work on a fix for this issue. -IF
Posted Apr 30, 2025 - 21:41 UTC

Update

We are continuing to work on a fix for this issue.
Posted Apr 30, 2025 - 20:12 UTC

Update

We are continuing to work on a fix for this issue. Any metadata registration attempts sent to us are failing with a network timeout. Unfortunately, that means anything submitted to us during this downtime will need to resubmitted to us when the system is restored. -IF
Posted Apr 30, 2025 - 18:02 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Apr 30, 2025 - 16:00 UTC

Update

We are continuing to investigate this issue.
Posted Apr 30, 2025 - 15:22 UTC

Update

We are continuing to investigate this issue.
Posted Apr 30, 2025 - 14:11 UTC

Investigating

We are currently experiencing network problems at our data centre which has caused both our doi and api domains to be inaccessible.

We are working on this as a matter of urgency and will post more when we know it.
Posted Apr 30, 2025 - 12:35 UTC
This incident affected: APIs (Public REST API, Polite REST API, OAI-PMH, XML API, Event Data Query API, OpenURL, Crossmark dialog server, Public content negotiation, Polite content negotiation), Sites (Crossref website, Crossref support, Metadata search, Participation reports), Meta (deliberately-unreliable server, Demo Auth), Metadata Plus (Plus REST API, Plus OAI-PMH, XML Snapshots, JSON Snapshots, Plus content negotiation, Key Manager), Content Registration (Admin tool, Test admin tool, Web deposit form, Record registration form), Integrations and external dependencies (Handle servers, Monitoring of iThenticate can be seen at https://turnitin.statuspage.io., AWS cloudfront, ORCID Auto-update), and Beta (Metadata Manager).