Issues affecting the start of the exam sessions. Ongoing exams are not affected.

Incident Report for ProctorExam

Postmortem

On 13 July 2022, between 15:30 and 16:50 UTC (all times below in UTC), all our WebRTC clusters were unable to scale. This affected exam takers trying to start a system check and those starting an exam session. All clusters were affected, but only eu4 was scaling out at that time, so only exam takers on eu4 were actually affected. On all our other clusters, new exam takers were accommodated by existing nodes that did not have the issue.

At 15:30, new infrastructure was deployed to our production environments to help processing the video recordings. This infrastructure was previously deployed and tested successfully to our staging environment with no issues. Also this infrastructure is supposed to be independent of the existing infrastructure and does not modify the existing infrastructure.

At 15:40 UTC, customers started complaining of long waits by students who were entering the system check or the exam session. No other alarms were raised.

The engineering team responded immediately and escalated the issue to a P1 at 15:45 UTC. It was clear that the WebRTC cluster eu4 wasn’t allowing new sessions to start correctly and no cause could be identified immediately.

At 16:20 UTC, the root cause of the problem was identified: New nodes in our WebRTC cluster were not able to pull the docker images they needed to run due to an I/O timeout. We then proceeded to test other WebRTC clusters (eu1, ca2 and us2) and confirmed that this issue was happening on all our clusters, prompting us to update our customer-facing status page accordingly.

We started to investigate whether the deployment of the new infrastructure could be related and found that it added some endpoints to the VPCs (Virtual Private Clouds) in which the WebRTC clusters are located. For some yet unknown reasons, these new endpoints seemed to prevent our nodes from pulling docker images from the registry, something which isn’t happening on our staging VPC.

As soon as we removed the endpoints, the nodes started working again. At 16:45 UTC, the eu4 cluster was fixed and at 16:50 UTC, all other clusters were working again.

As a follow-up:

We have opened a support case with AWS to understand why these endpoints would affect the ability of our cluster nodes to pull docker images from our registry
We will not re-deploy the new video processing infrastructure until we have reproduced and resolved the issue on our staging cluster.

There is no risk of this issue happening again.

Posted Jul 14, 2022 - 16:33 UTC

Resolved

Our team has identified the root cause of the issue and implemented a fix that has resolved the incident.

Posted Jul 13, 2022 - 16:55 UTC

Update

We are still investigating the issues that are affecting the start of exams on some of our regions. This is currently impacting all Clusters.

Posted Jul 13, 2022 - 16:38 UTC

Investigating

We are currently experiencing an issue that is affecting the start of the exam sessions on our platform. The ongoing sessions are not affected.

Posted Jul 13, 2022 - 15:55 UTC

This incident affected: European Cluster EU2 (EU2 - Taking Exams), US Cluster US1 (US1 - Taking Exams), Canadian Cluster CA1 (CA1 - Taking Exams), and European Cluster EU3 (EU3 - Taking Exams).