On 30th April at 11:30 UTC, we were alerted by our monitoring tools that our physical data centre was down, meaning all services that go through the data centre were out of action. Essentially all Crossref services were affected: all content registration and helper tools (web deposit form, record registration form, STQ, etc.), the REST API, OAI-PMH, reports, and our website. Members who tried to deposit metadata during this time received a network error and will now need to re-try their metadata submissions. Existing DOIs still resolved during this time.
Because the REST API already runs in the cloud (though traffic to it is routed through the data centre first), we updated the routing for the REST API to bypass the data centre, restoring REST API service at approximately 15:00 UTC. The rest of the services were restored at approximately 23:00 UTC.
Updating the routing of the REST API had a knock-on effect of disrupting deposits for our members using the Crossref OJS plugin, beginning when the rest of the services were restored. The issue was resolved with additional routing changes on 1 May at approximately 16:00 UTC. OJS users who use the Crossref plugin and attempted deposits during this time received a failure notification and will need to resubmit.
Once our staff arrived at the data centre, we determined the primary firewall hardware had failed. The secondary firewall had also failed previously, but that failure had gone unacknowledged.
We obtained and configured a new firewall and restored services.
We’ll obtain additional backup firewalls to have on hand in the event of another failure. We are already in the process of moving all of our services to the cloud and out of the physical data centre, so this incident is a great reminder (if we needed one!) of the importance of this project.